Facies Labelling

Problem

Exercise: Clustering of industry-related data using K-means approach

Input: dataset with wireline log measures:

  • GR=Gamma ray

  • ILD_Log10=Resistivity logging

  • Delta PHI=Neutron-density porosity difference

  • PHIND=Average neutron-density porosity

  • PE=Photoelectric effect

Challenge: Need to assign input data records to rock facies

Imports

from fastai.basics import *
from nlphero.data.external import *
import sklearn as sk
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# from ssklearn.p

Data Extraction

#kaggle datasets download -d rahuketu86/facieswells
path = untar_data("kaggle_datasets::rahuketu86/facieswells"); path
/Landmark2/pdo/.nlphero/archive/facieswells.zip
Path('/Landmark2/pdo/.nlphero/data/facieswells')
path.ls()
(#1) [Path('/Landmark2/pdo/.nlphero/data/facieswells/facies_labels.csv')]
df = pd.read_csv(path/"facies_labels.csv"); df.head()
Facies Depth GR ILD_log10 DeltaPHI PHIND PE WellName FaciesLabel FaciesDescription
0 3 2793.0 77.45 0.664 9.9 11.915 4.6 SHRIMPLIN FSiS Nonmarine fine siltstone
1 3 2793.5 78.26 0.661 14.2 12.565 4.1 SHRIMPLIN FSiS Nonmarine fine siltstone
2 3 2794.0 79.05 0.658 14.8 13.050 3.6 SHRIMPLIN FSiS Nonmarine fine siltstone
3 3 2794.5 86.10 0.655 13.9 13.115 3.5 SHRIMPLIN FSiS Nonmarine fine siltstone
4 3 2795.0 74.58 0.647 13.5 13.300 3.4 SHRIMPLIN FSiS Nonmarine fine siltstone

Data Exploration

df[['Facies', 'FaciesLabel']].drop_duplicates().reset_index(drop=True)
Facies FaciesLabel
0 3 FSiS
1 2 CSiS
2 8 PS
3 6 WS
4 7 D
5 4 SiSh
6 5 MS
7 9 BS
8 1 SS
df['WellName'].unique()
array(['SHRIMPLIN', 'ALEXANDER D', 'SHANKLE', 'LUKE G U', 'KIMZEY A',
       'CROSS H CATTLE', 'NOLAN', 'Recruit F9', 'NEWBY',
       'CHURCHMAN BIBLE'], dtype=object)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4149 entries, 0 to 4148
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Facies             4149 non-null   int64  
 1   Depth              4149 non-null   float64
 2   GR                 4149 non-null   float64
 3   ILD_log10          4149 non-null   float64
 4   DeltaPHI           4149 non-null   float64
 5   PHIND              4149 non-null   float64
 6   PE                 3232 non-null   float64
 7   WellName           4149 non-null   object 
 8   FaciesLabel        4149 non-null   object 
 9   FaciesDescription  4149 non-null   object 
dtypes: float64(6), int64(1), object(3)
memory usage: 324.3+ KB

Preprocessing

Select Columns of interest

df.columns#
Index(['Facies', 'Depth', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE',
       'WellName', 'FaciesLabel', 'FaciesDescription'],
      dtype='object')
df[['Depth', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]
Depth GR ILD_log10 DeltaPHI PHIND PE
0 2793.0 77.450 0.664 9.900 11.915 4.600
1 2793.5 78.260 0.661 14.200 12.565 4.100
2 2794.0 79.050 0.658 14.800 13.050 3.600
3 2794.5 86.100 0.655 13.900 13.115 3.500
4 2795.0 74.580 0.647 13.500 13.300 3.400
... ... ... ... ... ... ...
4144 3120.5 46.719 0.947 1.828 7.254 3.617
4145 3121.0 44.563 0.953 2.241 8.013 3.344
4146 3121.5 49.719 0.964 2.925 8.013 3.190
4147 3122.0 51.469 0.965 3.083 7.708 3.152
4148 3122.5 50.031 0.970 2.609 6.668 3.295

4149 rows × 6 columns

inp = df[[ 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]
inp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4149 entries, 0 to 4148
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   GR         4149 non-null   float64
 1   ILD_log10  4149 non-null   float64
 2   DeltaPHI   4149 non-null   float64
 3   PHIND      4149 non-null   float64
 4   PE         3232 non-null   float64
dtypes: float64(5)
memory usage: 162.2 KB

Normalize

imp = SimpleImputer()
sc = MinMaxScaler()
# mm.fit_transform(inp)
model = KMeans(n_clusters=9, random_state=42)
pipe = Pipeline([('imp', imp), ('sc', sc), ('model', model)])
pipe.fit_transform(inp)
array([[0.3386812 , 0.31230835, 0.33234149, ..., 0.21724934, 0.27155135,
        0.18648717],
       [0.40458037, 0.33378014, 0.37788947, ..., 0.26987168, 0.37232316,
        0.24162093],
       [0.4251983 , 0.32234207, 0.39128932, ..., 0.26971391, 0.40303073,
        0.25479664],
       ...,
       [0.16679157, 0.38264676, 0.39475868, ..., 0.22015877, 0.21032196,
        0.12523228],
       [0.17061741, 0.38297956, 0.39137533, ..., 0.22070974, 0.21661977,
        0.12612891],
       [0.15224809, 0.39269409, 0.39269105, ..., 0.22791892, 0.19900377,
        0.12644818]])
len(model.labels_)
4149
df.shape
(4149, 11)
df['PredictedLabels'] = model.labels_
df[['PredictedLabels', 'FaciesLabel']].groupby(['PredictedLabels', 'FaciesLabel'])['FaciesLabel'].agg(Frequency='count').groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
Frequency
PredictedLabels FaciesLabel
0 BS 4.359673
FSiS 0.272480
MS 11.989101
PS 32.697548
SiSh 2.997275
... ... ...
8 MS 18.791946
PS 15.771812
SS 2.013423
SiSh 12.751678
WS 28.523490

73 rows × 1 columns