Lay of Land - Spook Author Identification

Lets’ start by getting a lay of land using a code first approach. Idea is to quickly jump into training NLP models using clean datasets from Kaggle. I am going to follow this wonderful kernel from Abhishek Thakur,a prolific Kaggle GM, for Spooky Author Classification Competition.

Note

Whenever we start a kaggle competition it is useful to look at evaluation metric first. Real life project are not like Kaggle. In real world following activities happen before

  • Identify a business problem

  • Clarify and refine the problem and convert the same into ML problem.

  • Collect relevant datasets

  • Preprocess and clean the dataset

  • Define an evaluation metric relevant for value proposition

However, when we are learning it might be useful to start with a clean dataset with evaluation criteria defined; so that we can concentrate on learning modelling skills

Imports

It is better to keep all imports at the top of notebook.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plot
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn import metrics, pipeline
from fastai.basics import *
from nlphero.data.external import *
%matplotlib inline
data = untar_data(KAGGLEs.SPOOKY)

Simple EDA

First we should check out the data

id text author
0 id26305 This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall. EAP
1 id17569 It never once occurred to me that the fumbling might be a mere mistake. HPL
2 id11008 In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction. EAP
3 id27763 How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair. MWS
4 id12958 Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk. HPL
<AxesSubplot:>
../_images/layofland_11_1.png
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19579 entries, 0 to 19578
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      19579 non-null  object
 1   text    19579 non-null  object
 2   author  19579 non-null  object
dtypes: object(3)
memory usage: 459.0+ KB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      8392 non-null   object
 1   text    8392 non-null   object
dtypes: object(2)
memory usage: 131.2+ KB
sample.head()
# print("HHH")
id EAP HPL MWS
0 id02310 0.403494 0.287808 0.308698
1 id24541 0.403494 0.287808 0.308698
2 id00134 0.403494 0.287808 0.308698
3 id27757 0.403494 0.287808 0.308698
4 id04081 0.403494 0.287808 0.308698

So we have roughly 20k rows and about 8.4K test data (41% of size of train data). Classes are more or less equally distributed. We need to predict probabilities of different authors.

Evaluation Metric

For this competition, kaggle has defined multi class log loss as evaluation metric defined in link here. What does it mean ?

  • For each id ; we must predict probability of each author

  • Formulae for evaluation is defined as

(1)\[\begin{align} L &= Logloss = -\frac{1}{N} \Sigma_{i=1}^{N} \Sigma_{j=1}^{M}{y_{ij}}{\log{p_{ij}}}\\ N &= Samples\\ M &= PredictionClasses\\ y_{ij} &= \left\{ \begin{array}{rcl} 1 & if &Observation(i) \in Class(j) \\ 0 & otherwise\\ \end{array}\right.\\ p_{ij} &= PredictedProbability [Observation(i) \in Class(j)] & {where} && \Sigma p_{ij} &= 1 \\ Num[p_{ij}] &= max(min(p_{ij}, 1- 10^{-15}), 10^{-15}) & {where} && \Sigma Num[p_{ij}] & \neq 1 \\ \end{align}\]
def convert_multiclass(actual):
    actual2 = np.zeros((actual.shape[0],predicted.shape[1]))
    actual2[actual.reshape(-1,1)] = 0
    for i, val in enumerate(actual):
        actual2[i, val]=1
    return actual2
def multiclass_logloss(actual, predicted, eps=10**-15):
    y_ij = convert_multiclass(actual)
    N = len(actual)
    num_pij = np.clip(predicted, eps, 1-eps)
    L = -(y_ij*np.log(num_pij)).sum()/N
    return L

Train Test Split

We split and stratify based on output

x_train, y_train = train[['id','text']], train['author']
lbl_encoder = LabelEncoder()
y_train = lbl_encoder.fit_transform(y_train)
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, stratify=y_train, test_size=0.33, random_state=42)
len(x_train), len(x_valid)
(13117, 6462)

Modelling

TF-IDF with Logistic Regression

Term frequency inverse document frequency

A good reference can be found here

  • {\(TF \to \text{Frequency of term in a document} \)} indicates if you are using the word too many time in a document or too little.

  • {\(IDF \to \text{Inverse document Frequency}\)} of a word is the measure of how significant that term is in the whole corpus.

  • \( TF.IDF(t, d)=W_t^{d} = TF_t^{d}\ln{\frac{N}{DF^t}}\)

    • \(N=\text{Total number of documents}\)

    • \(DF^t=\text{Number of document with term t}\)

  • Put simply, the higher the TF.IDF score (weight), the rarer the term and vice versa.

tfv = TfidfVectorizer(min_df=1,
                      max_features=None,
                      strip_accents="unicode",
                      analyzer='word',
                      token_pattern=r'\w{1,}',
                      ngram_range=(1,3),
                      use_idf=1,
                      smooth_idf=1,
                      sublinear_tf=1,
                      stop_words='english'
                     )
tfv
TfidfVectorizer(ngram_range=(1, 3), smooth_idf=1, stop_words='english',
                strip_accents='unicode', sublinear_tf=1,
                token_pattern='\\w{1,}', use_idf=1)
tfv.fit(np.concatenate([x_train['text'].values,x_valid['text'].values]))
TfidfVectorizer(ngram_range=(1, 3), smooth_idf=1, stop_words='english',
                strip_accents='unicode', sublinear_tf=1,
                token_pattern='\\w{1,}', use_idf=1)
x_train_tfv = tfv.transform(x_train['text'].values)
x_train_tfv
<13117x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 413765 stored elements in Compressed Sparse Row format>
x_valid_tfv = tfv.transform(x_valid['text'].values)
x_valid_tfv
<6462x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 205383 stored elements in Compressed Sparse Row format>
clf = LogisticRegression(C=1.0)
clf.fit(x_train_tfv, y_train)
LogisticRegression()
val_preds = clf.predict_proba(x_valid_tfv)
val_preds
array([[0.41254593, 0.26687746, 0.32057661],
       [0.26133285, 0.13761174, 0.60105541],
       [0.29159974, 0.4086124 , 0.29978786],
       ...,
       [0.60269293, 0.18377996, 0.21352711],
       [0.20320163, 0.16185855, 0.63493981],
       [0.56495872, 0.15974122, 0.27530006]])
print ("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))
logloss: 0.806 
def multiclass_logloss2(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota
print ("logloss: %0.3f " % multiclass_logloss2(y_valid, val_preds))
logloss: 0.806 

CountVector with Logistic Regression

CountVectorizer - Frequency of different words

cv = CountVectorizer(analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     stop_words='english'
                    )
cv
CountVectorizer(ngram_range=(1, 3), stop_words='english',
                token_pattern='\\w{1,}')
cv.fit(np.concatenate([x_train['text'].values,x_valid['text'].values]))
CountVectorizer(ngram_range=(1, 3), stop_words='english',
                token_pattern='\\w{1,}')
x_train_cv = cv.transform(x_train['text'].values)
x_train_cv
<13117x400266 sparse matrix of type '<class 'numpy.int64'>'
	with 413776 stored elements in Compressed Sparse Row format>
x_valid_cv = cv.transform(x_valid['text'].values)
x_valid_cv
<6462x400266 sparse matrix of type '<class 'numpy.int64'>'
	with 205395 stored elements in Compressed Sparse Row format>
clf = LogisticRegression(C=1.0)
clf.fit(x_train_cv, y_train)
clf
LogisticRegression()
val_preds = clf.predict_proba(x_valid_cv)
val_preds
array([[0.53572102, 0.11567325, 0.34860574],
       [0.03049094, 0.00427   , 0.96523906],
       [0.33594102, 0.3222241 , 0.34183488],
       ...,
       [0.97113843, 0.01189312, 0.01696845],
       [0.05768433, 0.02785335, 0.91446231],
       [0.8622879 , 0.020955  , 0.1167571 ]])
print ("logloss: %0.3f " % multiclass_logloss2(y_valid, val_preds))
logloss: 0.557 

MultinomialNB

clf = MultinomialNB()
clf.fit(x_train_tfv, y_train)
MultinomialNB()
val_preds = clf.predict_proba(x_valid_tfv)
val_preds
array([[0.43479056, 0.2630958 , 0.30211364],
       [0.33038311, 0.16973709, 0.4998798 ],
       [0.37577295, 0.34463046, 0.27959659],
       ...,
       [0.54395868, 0.18701702, 0.2690243 ],
       [0.27495975, 0.18696675, 0.5380735 ],
       [0.56420466, 0.16682221, 0.26897313]])
print ("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))
logloss: 0.853 
clf = MultinomialNB()
clf.fit(x_train_cv, y_train)
val_preds = clf.predict_proba(x_valid_cv)
print ("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))
logloss: 0.467 
def check_score(clf, vectorizer, 
                x_train=x_train, x_valid=x_valid,
                y_train=y_train, y_valid=y_valid):
    
    vectorizer.fit(np.concatenate([x_train['text'].values,
                                   x_valid['text'].values]))
    x_train_vect = vectorizer.transform(x_train['text'].values)
    x_valid_vect = vectorizer.transform(x_valid['text'].values)
    clf.fit(x_train_vect, y_train)
    val_preds = clf.predict_proba(x_valid_vect)
    mcll = multiclass_logloss(y_valid, val_preds)
    return mcll
clf = MultinomialNB()
vectorizer = CountVectorizer(analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     stop_words='english'
                    )
score = check_score(clf, vectorizer)
print ("logloss: %0.3f " % score)
logloss: 0.467 
score = check_score(clf, tfv)
print ("logloss: %0.3f " % score)
logloss: 0.853 

SVD

svd = TruncatedSVD(n_components=120)
svd.fit(x_train_cv)
TruncatedSVD(n_components=120)
xtrain_svd = svd.transform(x_train_cv)
xvalid_svd = svd.transform(x_valid_cv)
scl = StandardScaler()
scl.fit(xtrain_svd)
StandardScaler()
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)
clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(xtrain_svd_scl, y_train)
val_preds = clf.predict_proba(xvalid_svd_scl)
print("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))
logloss: 0.787 
x_train_tfv.tocsc()
<13117x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 413765 stored elements in Compressed Sparse Column format>
x_train_tfv
<13117x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 413765 stored elements in Compressed Sparse Row format>

XGBoost

CountVectorizer

clf = xgb.XGBClassifier(max_depth=7, 
                        n_estimators=200, 
                        colsample_bytree=0.8, 
                        subsample=0.8, 
                        nthread=10, 
                        learning_rate=0.1)
vectorizer = CountVectorizer(analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     stop_words='english'
                    )
score = check_score(clf, vectorizer)
print ("logloss: %0.3f " % score)
logloss: 0.776 

Tf-idfVectorizer

clf = xgb.XGBClassifier(max_depth=7, 
                        n_estimators=200, 
                        colsample_bytree=0.8, 
                        subsample=0.8, 
                        nthread=10, 
                        learning_rate=0.1)

vectorizer = TfidfVectorizer(min_df=1,
                      max_features=None,
                      strip_accents="unicode",
                      analyzer='word',
                      token_pattern=r'\w{1,}',
                      ngram_range=(1,3),
                      use_idf=1,
                      smooth_idf=1,
                      sublinear_tf=1,
                      stop_words='english'
                     )
score = check_score(clf, vectorizer)
print ("logloss: %0.3f " % score)
logloss: 0.787 

SVD Transformation

clf = xgb.XGBClassifier(max_depth=7, 
                        n_estimators=200, 
                        colsample_bytree=0.8, 
                        subsample=0.8, 
                        nthread=10, 
                        learning_rate=0.1)
clf.fit(xtrain_svd, y_train)
val_preds = clf.predict_proba(xvalid_svd)
score = multiclass_logloss(y_valid, val_preds)
print ("logloss: %0.3f " % score)
logloss: 0.805