Facies Labelling¶

Problem

Exercise: Clustering of industry-related data using K-means approach

Input: dataset with wireline log measures:

GR=Gamma ray
ILD_Log10=Resistivity logging
Delta PHI=Neutron-density porosity difference
PHIND=Average neutron-density porosity
PE=Photoelectric effect

Challenge: Need to assign input data records to rock facies

Imports¶

from fastai.basics import *
from nlphero.data.external import *
import sklearn as sk
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# from ssklearn.p

Data Extraction¶

#kaggle datasets download -d rahuketu86/facieswells
path = untar_data("kaggle_datasets::rahuketu86/facieswells"); path

/Landmark2/pdo/.nlphero/archive/facieswells.zip

Path('/Landmark2/pdo/.nlphero/data/facieswells')

path.ls()

(#1) [Path('/Landmark2/pdo/.nlphero/data/facieswells/facies_labels.csv')]

df = pd.read_csv(path/"facies_labels.csv"); df.head()

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	WellName	FaciesLabel	FaciesDescription
0	3	2793.0	77.45	0.664	9.9	11.915	4.6	SHRIMPLIN	FSiS	Nonmarine fine siltstone
1	3	2793.5	78.26	0.661	14.2	12.565	4.1	SHRIMPLIN	FSiS	Nonmarine fine siltstone
2	3	2794.0	79.05	0.658	14.8	13.050	3.6	SHRIMPLIN	FSiS	Nonmarine fine siltstone
3	3	2794.5	86.10	0.655	13.9	13.115	3.5	SHRIMPLIN	FSiS	Nonmarine fine siltstone
4	3	2795.0	74.58	0.647	13.5	13.300	3.4	SHRIMPLIN	FSiS	Nonmarine fine siltstone

Data Exploration¶

df[['Facies', 'FaciesLabel']].drop_duplicates().reset_index(drop=True)

	Facies	FaciesLabel
0	3	FSiS
1	2	CSiS
2	8	PS
3	6	WS
4	7	D
5	4	SiSh
6	5	MS
7	9	BS
8	1	SS

df['WellName'].unique()

array(['SHRIMPLIN', 'ALEXANDER D', 'SHANKLE', 'LUKE G U', 'KIMZEY A',
       'CROSS H CATTLE', 'NOLAN', 'Recruit F9', 'NEWBY',
       'CHURCHMAN BIBLE'], dtype=object)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4149 entries, 0 to 4148
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Facies             4149 non-null   int64  
 1   Depth              4149 non-null   float64
 2   GR                 4149 non-null   float64
 3   ILD_log10          4149 non-null   float64
 4   DeltaPHI           4149 non-null   float64
 5   PHIND              4149 non-null   float64
 6   PE                 3232 non-null   float64
 7   WellName           4149 non-null   object 
 8   FaciesLabel        4149 non-null   object 
 9   FaciesDescription  4149 non-null   object 
dtypes: float64(6), int64(1), object(3)
memory usage: 324.3+ KB

Preprocessing¶

Select Columns of interest¶

df.columns#

Index(['Facies', 'Depth', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE',
       'WellName', 'FaciesLabel', 'FaciesDescription'],
      dtype='object')

df[['Depth', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]

	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE
0	2793.0	77.450	0.664	9.900	11.915	4.600
1	2793.5	78.260	0.661	14.200	12.565	4.100
2	2794.0	79.050	0.658	14.800	13.050	3.600
3	2794.5	86.100	0.655	13.900	13.115	3.500
4	2795.0	74.580	0.647	13.500	13.300	3.400
...	...	...	...	...	...	...
4144	3120.5	46.719	0.947	1.828	7.254	3.617
4145	3121.0	44.563	0.953	2.241	8.013	3.344
4146	3121.5	49.719	0.964	2.925	8.013	3.190
4147	3122.0	51.469	0.965	3.083	7.708	3.152
4148	3122.5	50.031	0.970	2.609	6.668	3.295

4149 rows × 6 columns

inp = df[[ 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]
inp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4149 entries, 0 to 4148
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   GR         4149 non-null   float64
 1   ILD_log10  4149 non-null   float64
 2   DeltaPHI   4149 non-null   float64
 3   PHIND      4149 non-null   float64
 4   PE         3232 non-null   float64
dtypes: float64(5)
memory usage: 162.2 KB

Normalize¶

imp = SimpleImputer()

sc = MinMaxScaler()

# mm.fit_transform(inp)

model = KMeans(n_clusters=9, random_state=42)

pipe = Pipeline([('imp', imp), ('sc', sc), ('model', model)])

pipe.fit_transform(inp)

array([[0.3386812 , 0.31230835, 0.33234149, ..., 0.21724934, 0.27155135,
        0.18648717],
       [0.40458037, 0.33378014, 0.37788947, ..., 0.26987168, 0.37232316,
        0.24162093],
       [0.4251983 , 0.32234207, 0.39128932, ..., 0.26971391, 0.40303073,
        0.25479664],
       ...,
       [0.16679157, 0.38264676, 0.39475868, ..., 0.22015877, 0.21032196,
        0.12523228],
       [0.17061741, 0.38297956, 0.39137533, ..., 0.22070974, 0.21661977,
        0.12612891],
       [0.15224809, 0.39269409, 0.39269105, ..., 0.22791892, 0.19900377,
        0.12644818]])

len(model.labels_)

df.shape

(4149, 11)

df['PredictedLabels'] = model.labels_

df[['PredictedLabels', 'FaciesLabel']].groupby(['PredictedLabels', 'FaciesLabel'])['FaciesLabel'].agg(Frequency='count').groupby(level=0).apply(lambda x: 100*x/float(x.sum()))

		Frequency
PredictedLabels	FaciesLabel
0	BS	4.359673
	FSiS	0.272480
	MS	11.989101
	PS	32.697548
	SiSh	2.997275
...	...	...
8	MS	18.791946
	PS	15.771812
	SS	2.013423
	SiSh	12.751678
	WS	28.523490

73 rows × 1 columns

Code First NLP