Facies Labelling¶
Problem
Exercise: Clustering of industry-related data using K-means approach
Input: dataset with wireline log measures:
GR=Gamma ray
ILD_Log10=Resistivity logging
Delta PHI=Neutron-density porosity difference
PHIND=Average neutron-density porosity
PE=Photoelectric effect
Challenge: Need to assign input data records to rock facies
Imports¶
from fastai.basics import *
from nlphero.data.external import *
import sklearn as sk
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# from ssklearn.p
Data Extraction¶
#kaggle datasets download -d rahuketu86/facieswells
path = untar_data("kaggle_datasets::rahuketu86/facieswells"); path
/Landmark2/pdo/.nlphero/archive/facieswells.zip
Path('/Landmark2/pdo/.nlphero/data/facieswells')
path.ls()
(#1) [Path('/Landmark2/pdo/.nlphero/data/facieswells/facies_labels.csv')]
df = pd.read_csv(path/"facies_labels.csv"); df.head()
Facies | Depth | GR | ILD_log10 | DeltaPHI | PHIND | PE | WellName | FaciesLabel | FaciesDescription | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 2793.0 | 77.45 | 0.664 | 9.9 | 11.915 | 4.6 | SHRIMPLIN | FSiS | Nonmarine fine siltstone |
1 | 3 | 2793.5 | 78.26 | 0.661 | 14.2 | 12.565 | 4.1 | SHRIMPLIN | FSiS | Nonmarine fine siltstone |
2 | 3 | 2794.0 | 79.05 | 0.658 | 14.8 | 13.050 | 3.6 | SHRIMPLIN | FSiS | Nonmarine fine siltstone |
3 | 3 | 2794.5 | 86.10 | 0.655 | 13.9 | 13.115 | 3.5 | SHRIMPLIN | FSiS | Nonmarine fine siltstone |
4 | 3 | 2795.0 | 74.58 | 0.647 | 13.5 | 13.300 | 3.4 | SHRIMPLIN | FSiS | Nonmarine fine siltstone |
Data Exploration¶
df[['Facies', 'FaciesLabel']].drop_duplicates().reset_index(drop=True)
Facies | FaciesLabel | |
---|---|---|
0 | 3 | FSiS |
1 | 2 | CSiS |
2 | 8 | PS |
3 | 6 | WS |
4 | 7 | D |
5 | 4 | SiSh |
6 | 5 | MS |
7 | 9 | BS |
8 | 1 | SS |
df['WellName'].unique()
array(['SHRIMPLIN', 'ALEXANDER D', 'SHANKLE', 'LUKE G U', 'KIMZEY A',
'CROSS H CATTLE', 'NOLAN', 'Recruit F9', 'NEWBY',
'CHURCHMAN BIBLE'], dtype=object)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4149 entries, 0 to 4148
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Facies 4149 non-null int64
1 Depth 4149 non-null float64
2 GR 4149 non-null float64
3 ILD_log10 4149 non-null float64
4 DeltaPHI 4149 non-null float64
5 PHIND 4149 non-null float64
6 PE 3232 non-null float64
7 WellName 4149 non-null object
8 FaciesLabel 4149 non-null object
9 FaciesDescription 4149 non-null object
dtypes: float64(6), int64(1), object(3)
memory usage: 324.3+ KB
Preprocessing¶
Select Columns of interest¶
df.columns#
Index(['Facies', 'Depth', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE',
'WellName', 'FaciesLabel', 'FaciesDescription'],
dtype='object')
df[['Depth', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]
Depth | GR | ILD_log10 | DeltaPHI | PHIND | PE | |
---|---|---|---|---|---|---|
0 | 2793.0 | 77.450 | 0.664 | 9.900 | 11.915 | 4.600 |
1 | 2793.5 | 78.260 | 0.661 | 14.200 | 12.565 | 4.100 |
2 | 2794.0 | 79.050 | 0.658 | 14.800 | 13.050 | 3.600 |
3 | 2794.5 | 86.100 | 0.655 | 13.900 | 13.115 | 3.500 |
4 | 2795.0 | 74.580 | 0.647 | 13.500 | 13.300 | 3.400 |
... | ... | ... | ... | ... | ... | ... |
4144 | 3120.5 | 46.719 | 0.947 | 1.828 | 7.254 | 3.617 |
4145 | 3121.0 | 44.563 | 0.953 | 2.241 | 8.013 | 3.344 |
4146 | 3121.5 | 49.719 | 0.964 | 2.925 | 8.013 | 3.190 |
4147 | 3122.0 | 51.469 | 0.965 | 3.083 | 7.708 | 3.152 |
4148 | 3122.5 | 50.031 | 0.970 | 2.609 | 6.668 | 3.295 |
4149 rows × 6 columns
inp = df[[ 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]
inp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4149 entries, 0 to 4148
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GR 4149 non-null float64
1 ILD_log10 4149 non-null float64
2 DeltaPHI 4149 non-null float64
3 PHIND 4149 non-null float64
4 PE 3232 non-null float64
dtypes: float64(5)
memory usage: 162.2 KB
Normalize¶
imp = SimpleImputer()
sc = MinMaxScaler()
# mm.fit_transform(inp)
model = KMeans(n_clusters=9, random_state=42)
pipe = Pipeline([('imp', imp), ('sc', sc), ('model', model)])
pipe.fit_transform(inp)
array([[0.3386812 , 0.31230835, 0.33234149, ..., 0.21724934, 0.27155135,
0.18648717],
[0.40458037, 0.33378014, 0.37788947, ..., 0.26987168, 0.37232316,
0.24162093],
[0.4251983 , 0.32234207, 0.39128932, ..., 0.26971391, 0.40303073,
0.25479664],
...,
[0.16679157, 0.38264676, 0.39475868, ..., 0.22015877, 0.21032196,
0.12523228],
[0.17061741, 0.38297956, 0.39137533, ..., 0.22070974, 0.21661977,
0.12612891],
[0.15224809, 0.39269409, 0.39269105, ..., 0.22791892, 0.19900377,
0.12644818]])
len(model.labels_)
4149
df.shape
(4149, 11)
df['PredictedLabels'] = model.labels_
df[['PredictedLabels', 'FaciesLabel']].groupby(['PredictedLabels', 'FaciesLabel'])['FaciesLabel'].agg(Frequency='count').groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
Frequency | ||
---|---|---|
PredictedLabels | FaciesLabel | |
0 | BS | 4.359673 |
FSiS | 0.272480 | |
MS | 11.989101 | |
PS | 32.697548 | |
SiSh | 2.997275 | |
... | ... | ... |
8 | MS | 18.791946 |
PS | 15.771812 | |
SS | 2.013423 | |
SiSh | 12.751678 | |
WS | 28.523490 |
73 rows × 1 columns