Lay of Land - Spook Author Identification¶

Lets’ start by getting a lay of land using a code first approach. Idea is to quickly jump into training NLP models using clean datasets from Kaggle. I am going to follow this wonderful kernel from Abhishek Thakur,a prolific Kaggle GM, for Spooky Author Classification Competition.

Note

Whenever we start a kaggle competition it is useful to look at evaluation metric first. Real life project are not like Kaggle. In real world following activities happen before

Identify a business problem
Clarify and refine the problem and convert the same into ML problem.
Collect relevant datasets
Preprocess and clean the dataset
Define an evaluation metric relevant for value proposition

However, when we are learning it might be useful to start with a clean dataset with evaluation criteria defined; so that we can concentrate on learning modelling skills

Imports¶

It is better to keep all imports at the top of notebook.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plot
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn import metrics, pipeline
from fastai.basics import *
from nlphero.data.external import *

      
    

%matplotlib inline

data = untar_data(KAGGLEs.SPOOKY)

Simple EDA¶

First we should check out the data

	id	text	author
0	id26305	This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.	EAP
1	id17569	It never once occurred to me that the fumbling might be a mere mistake.	HPL
2	id11008	In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.	EAP
3	id27763	How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.	MWS
4	id12958	Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.	HPL

<AxesSubplot:>

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19579 entries, 0 to 19578
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      19579 non-null  object
 1   text    19579 non-null  object
 2   author  19579 non-null  object
dtypes: object(3)
memory usage: 459.0+ KB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      8392 non-null   object
 1   text    8392 non-null   object
dtypes: object(2)
memory usage: 131.2+ KB

sample.head()
# print("HHH")

	id	EAP	HPL	MWS
0	id02310	0.403494	0.287808	0.308698
1	id24541	0.403494	0.287808	0.308698
2	id00134	0.403494	0.287808	0.308698
3	id27757	0.403494	0.287808	0.308698
4	id04081	0.403494	0.287808	0.308698

So we have roughly 20k rows and about 8.4K test data (41% of size of train data). Classes are more or less equally distributed. We need to predict probabilities of different authors.

Evaluation Metric¶

For this competition, kaggle has defined multi class log loss as evaluation metric defined in link here. What does it mean ?

For each id ; we must predict probability of each author
Formulae for evaluation is defined as

(1)¶

def convert_multiclass(actual):
    actual2 = np.zeros((actual.shape[0],predicted.shape[1]))
    actual2[actual.reshape(-1,1)] = 0
    for i, val in enumerate(actual):
        actual2[i, val]=1
    return actual2

      
    

def multiclass_logloss(actual, predicted, eps=10**-15):
    y_ij = convert_multiclass(actual)
    N = len(actual)
    num_pij = np.clip(predicted, eps, 1-eps)
    L = -(y_ij*np.log(num_pij)).sum()/N
    return L

      
    

Train Test Split¶

We split and stratify based on output

x_train, y_train = train[['id','text']], train['author']
lbl_encoder = LabelEncoder()
y_train = lbl_encoder.fit_transform(y_train)

x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, stratify=y_train, test_size=0.33, random_state=42)
len(x_train), len(x_valid)

(13117, 6462)

Modelling¶

TF-IDF with Logistic Regression¶

Term frequency inverse document frequency

A good reference can be found here

{} indicates if you are using the word too many time in a document or too little.
{} of a word is the measure of how significant that term is in the whole corpus.
Put simply, the higher the TF.IDF score (weight), the rarer the term and vice versa.

tfv = TfidfVectorizer(min_df=1,
                      max_features=None,
                      strip_accents="unicode",
                      analyzer='word',
                      token_pattern=r'\w{1,}',
                      ngram_range=(1,3),
                      use_idf=1,
                      smooth_idf=1,
                      sublinear_tf=1,
                      stop_words='english'
                     )
tfv

      
    

TfidfVectorizer(ngram_range=(1, 3), smooth_idf=1, stop_words='english',
                strip_accents='unicode', sublinear_tf=1,
                token_pattern='\\w{1,}', use_idf=1)

tfv.fit(np.concatenate([x_train['text'].values,x_valid['text'].values]))

TfidfVectorizer(ngram_range=(1, 3), smooth_idf=1, stop_words='english',
                strip_accents='unicode', sublinear_tf=1,
                token_pattern='\\w{1,}', use_idf=1)

x_train_tfv = tfv.transform(x_train['text'].values)
x_train_tfv

<13117x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 413765 stored elements in Compressed Sparse Row format>

x_valid_tfv = tfv.transform(x_valid['text'].values)
x_valid_tfv

<6462x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 205383 stored elements in Compressed Sparse Row format>

clf = LogisticRegression(C=1.0)
clf.fit(x_train_tfv, y_train)

LogisticRegression()

val_preds = clf.predict_proba(x_valid_tfv)
val_preds

array([[0.41254593, 0.26687746, 0.32057661],
       [0.26133285, 0.13761174, 0.60105541],
       [0.29159974, 0.4086124 , 0.29978786],
       ...,
       [0.60269293, 0.18377996, 0.21352711],
       [0.20320163, 0.16185855, 0.63493981],
       [0.56495872, 0.15974122, 0.27530006]])

print ("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))

logloss: 0.806

def multiclass_logloss2(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

      
    

print ("logloss: %0.3f " % multiclass_logloss2(y_valid, val_preds))

logloss: 0.806

CountVector with Logistic Regression¶

CountVectorizer - Frequency of different words

cv = CountVectorizer(analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     stop_words='english'
                    )
cv

      
    

CountVectorizer(ngram_range=(1, 3), stop_words='english',
                token_pattern='\\w{1,}')

cv.fit(np.concatenate([x_train['text'].values,x_valid['text'].values]))

CountVectorizer(ngram_range=(1, 3), stop_words='english',
                token_pattern='\\w{1,}')

x_train_cv = cv.transform(x_train['text'].values)
x_train_cv

<13117x400266 sparse matrix of type '<class 'numpy.int64'>'
	with 413776 stored elements in Compressed Sparse Row format>

x_valid_cv = cv.transform(x_valid['text'].values)
x_valid_cv

<6462x400266 sparse matrix of type '<class 'numpy.int64'>'
	with 205395 stored elements in Compressed Sparse Row format>

clf = LogisticRegression(C=1.0)
clf.fit(x_train_cv, y_train)
clf

LogisticRegression()

val_preds = clf.predict_proba(x_valid_cv)
val_preds

array([[0.53572102, 0.11567325, 0.34860574],
       [0.03049094, 0.00427   , 0.96523906],
       [0.33594102, 0.3222241 , 0.34183488],
       ...,
       [0.97113843, 0.01189312, 0.01696845],
       [0.05768433, 0.02785335, 0.91446231],
       [0.8622879 , 0.020955  , 0.1167571 ]])

print ("logloss: %0.3f " % multiclass_logloss2(y_valid, val_preds))

logloss: 0.557

MultinomialNB¶

clf = MultinomialNB()
clf.fit(x_train_tfv, y_train)

MultinomialNB()

val_preds = clf.predict_proba(x_valid_tfv)
val_preds

array([[0.43479056, 0.2630958 , 0.30211364],
       [0.33038311, 0.16973709, 0.4998798 ],
       [0.37577295, 0.34463046, 0.27959659],
       ...,
       [0.54395868, 0.18701702, 0.2690243 ],
       [0.27495975, 0.18696675, 0.5380735 ],
       [0.56420466, 0.16682221, 0.26897313]])

print ("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))

logloss: 0.853

clf = MultinomialNB()
clf.fit(x_train_cv, y_train)
val_preds = clf.predict_proba(x_valid_cv)
print ("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))

logloss: 0.467

def check_score(clf, vectorizer, 
                x_train=x_train, x_valid=x_valid,
                y_train=y_train, y_valid=y_valid):
    
    vectorizer.fit(np.concatenate([x_train['text'].values,
                                   x_valid['text'].values]))
    x_train_vect = vectorizer.transform(x_train['text'].values)
    x_valid_vect = vectorizer.transform(x_valid['text'].values)
    clf.fit(x_train_vect, y_train)
    val_preds = clf.predict_proba(x_valid_vect)
    mcll = multiclass_logloss(y_valid, val_preds)
    return mcll

      
    

clf = MultinomialNB()
vectorizer = CountVectorizer(analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     stop_words='english'
                    )
score = check_score(clf, vectorizer)
print ("logloss: %0.3f " % score)

      
    

logloss: 0.467

score = check_score(clf, tfv)
print ("logloss: %0.3f " % score)

logloss: 0.853

SVD¶

svd = TruncatedSVD(n_components=120)

svd.fit(x_train_cv)

TruncatedSVD(n_components=120)

xtrain_svd = svd.transform(x_train_cv)
xvalid_svd = svd.transform(x_valid_cv)

scl = StandardScaler()
scl.fit(xtrain_svd)

StandardScaler()

xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)

clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(xtrain_svd_scl, y_train)
val_preds = clf.predict_proba(xvalid_svd_scl)

print("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))

logloss: 0.787

x_train_tfv.tocsc()

<13117x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 413765 stored elements in Compressed Sparse Column format>

x_train_tfv

<13117x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 413765 stored elements in Compressed Sparse Row format>

XGBoost¶

CountVectorizer¶

clf = xgb.XGBClassifier(max_depth=7, 
                        n_estimators=200, 
                        colsample_bytree=0.8, 
                        subsample=0.8, 
                        nthread=10, 
                        learning_rate=0.1)
vectorizer = CountVectorizer(analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     stop_words='english'
                    )
score = check_score(clf, vectorizer)
print ("logloss: %0.3f " % score)

      
    

logloss: 0.776

Tf-idfVectorizer¶

clf = xgb.XGBClassifier(max_depth=7, 
                        n_estimators=200, 
                        colsample_bytree=0.8, 
                        subsample=0.8, 
                        nthread=10, 
                        learning_rate=0.1)

vectorizer = TfidfVectorizer(min_df=1,
                      max_features=None,
                      strip_accents="unicode",
                      analyzer='word',
                      token_pattern=r'\w{1,}',
                      ngram_range=(1,3),
                      use_idf=1,
                      smooth_idf=1,
                      sublinear_tf=1,
                      stop_words='english'
                     )
score = check_score(clf, vectorizer)
print ("logloss: %0.3f " % score)

      
    

logloss: 0.787

SVD Transformation¶

clf = xgb.XGBClassifier(max_depth=7, 
                        n_estimators=200, 
                        colsample_bytree=0.8, 
                        subsample=0.8, 
                        nthread=10, 
                        learning_rate=0.1)
clf.fit(xtrain_svd, y_train)
val_preds = clf.predict_proba(xvalid_svd)
score = multiclass_logloss(y_valid, val_preds)
print ("logloss: %0.3f " % score)

      
    

logloss: 0.805

Grid Search¶

Doing Grid Search using Logistic Regression

# mll_scorer = metrics.make_scorer?

mll_scorer = metrics.make_scorer(multiclass_logloss, 
                                greater_is_better=False,
                                needs_proba=True)

mll_scorer

make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True)

svd = TruncatedSVD()
scl = StandardScaler()
lr_model = LogisticRegression()

clf = pipeline.Pipeline([
    ('svd', svd),
    ('scl', scl),
    ('lr', lr_model)
])
clf

      
    

Pipeline(steps=[('svd', TruncatedSVD()), ('scl', StandardScaler()),
                ('lr', LogisticRegression())])

param_grid = {
    'svd__n_components':[120, 180],
    'lr__C': [0.1, 1.0, 10],
    'lr__penalty':['l1','l2']
             }
param_grid

      
    

{'svd__n_components': [120, 180],
 'lr__C': [0.1, 1.0, 10],
 'lr__penalty': ['l1', 'l2']}

model = GridSearchCV(estimator=clf,
                     param_grid=param_grid,
                     scoring=mll_scorer,
                     verbose=10,
                     n_jobs=-1,
                     iid=True,
                     refit=True,
                     cv=2
                    )
model

      
    

GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('svd', TruncatedSVD()),
                                       ('scl', StandardScaler()),
                                       ('lr', LogisticRegression())]),
             iid=True, n_jobs=-1,
             param_grid={'lr__C': [0.1, 1.0, 10], 'lr__penalty': ['l1', 'l2'],
                         'svd__n_components': [120, 180]},
             scoring=make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True),
             verbose=10)

model.fit(x_train_tfv, y_train)
model

Fitting 2 folds for each of 12 candidates, totalling 24 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done   3 out of  24 | elapsed:  1.5min remaining: 10.7min
[Parallel(n_jobs=-1)]: Done   6 out of  24 | elapsed:  1.6min remaining:  4.7min
[Parallel(n_jobs=-1)]: Done   9 out of  24 | elapsed:  1.6min remaining:  2.7min
[Parallel(n_jobs=-1)]: Done  12 out of  24 | elapsed:  1.6min remaining:  1.6min
[Parallel(n_jobs=-1)]: Done  15 out of  24 | elapsed:  2.2min remaining:  1.3min
[Parallel(n_jobs=-1)]: Done  18 out of  24 | elapsed:  2.2min remaining:   43.9s
[Parallel(n_jobs=-1)]: Done  21 out of  24 | elapsed:  2.2min remaining:   18.9s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  2.2min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  2.2min finished
/home/ubuntu/anaconda3/envs/nlphero/lib/python3.8/site-packages/sklearn/model_selection/_search.py:847: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
  warnings.warn(

GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('svd', TruncatedSVD()),
                                       ('scl', StandardScaler()),
                                       ('lr', LogisticRegression())]),
             iid=True, n_jobs=-1,
             param_grid={'lr__C': [0.1, 1.0, 10], 'lr__penalty': ['l1', 'l2'],
                         'svd__n_components': [120, 180]},
             scoring=make_scorer(multiclass_logloss, greater_is_better=False, needs_proba=True),
             verbose=10)

print(f"Best Score: {model.best_score_:.3}")
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(param_name, best_parameters[param_name])

Best Score: -0.792
Best parameters set:
lr__C 0.1
lr__penalty l2
svd__n_components 180

model = GridSearchCV(estimator=clf,
                     param_grid=param_grid,
                     scoring=mll_scorer,
                     verbose=10,
                     n_jobs=-1,
                     iid=True,
                     refit=True,
                     cv=2)
model.fit(x_train_tfv, y_train)
print(f"Best Score: {model.best_score_:.3}")
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(param_name, best_parameters[param_name])

      
    

Fitting 2 folds for each of 12 candidates, totalling 24 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done   3 out of  24 | elapsed:  1.5min remaining: 10.3min
[Parallel(n_jobs=-1)]: Done   6 out of  24 | elapsed:  1.5min remaining:  4.6min
[Parallel(n_jobs=-1)]: Done   9 out of  24 | elapsed:  1.6min remaining:  2.7min
[Parallel(n_jobs=-1)]: Done  12 out of  24 | elapsed:  1.6min remaining:  1.6min
[Parallel(n_jobs=-1)]: Done  15 out of  24 | elapsed:  2.2min remaining:  1.3min
[Parallel(n_jobs=-1)]: Done  18 out of  24 | elapsed:  2.2min remaining:   43.9s
[Parallel(n_jobs=-1)]: Done  21 out of  24 | elapsed:  2.2min remaining:   18.9s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  2.2min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  2.2min finished
/home/ubuntu/anaconda3/envs/nlphero/lib/python3.8/site-packages/sklearn/model_selection/_search.py:847: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
  warnings.warn(

Best Score: -0.789
Best parameters set:
lr__C 10
lr__penalty l2
svd__n_components 180

Code First NLP