Lay of Land - Spook Author Identification

Lets’ start by getting a lay of land using a code first approach. Idea is to quickly jump into training NLP models using clean datasets from Kaggle. I am going to follow this wonderful kernel from Abhishek Thakur,a prolific Kaggle GM, for Spooky Author Classification Competition.

Note

Whenever we start a kaggle competition it is useful to look at evaluation metric first. Real life project are not like Kaggle. In real world following activities happen before

  • Identify a business problem

  • Clarify and refine the problem and convert the same into ML problem.

  • Collect relevant datasets

  • Preprocess and clean the dataset

  • Define an evaluation metric relevant for value proposition

However, when we are learning it might be useful to start with a clean dataset with evaluation criteria defined; so that we can concentrate on learning modelling skills

Imports

It is better to keep all imports at the top of notebook.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plot
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn import metrics, pipeline
from fastai.basics import *
from nlphero.data.external import *
Copy to clipboard
%matplotlib inline
Copy to clipboard
data = untar_data(KAGGLEs.SPOOKY)
Copy to clipboard

Simple EDA

First we should check out the data

id text author
0 id26305 This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall. EAP
1 id17569 It never once occurred to me that the fumbling might be a mere mistake. HPL
2 id11008 In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction. EAP
3 id27763 How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair. MWS
4 id12958 Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk. HPL
<AxesSubplot:>
Copy to clipboard
../_images/layofland_11_1.png
train.info()
Copy to clipboard
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19579 entries, 0 to 19578
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      19579 non-null  object
 1   text    19579 non-null  object
 2   author  19579 non-null  object
dtypes: object(3)
memory usage: 459.0+ KB
Copy to clipboard
test.info()
Copy to clipboard
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      8392 non-null   object
 1   text    8392 non-null   object
dtypes: object(2)
memory usage: 131.2+ KB
Copy to clipboard
sample.head()
# print("HHH")
Copy to clipboard
id EAP HPL MWS
0 id02310 0.403494 0.287808 0.308698
1 id24541 0.403494 0.287808 0.308698
2 id00134 0.403494 0.287808 0.308698
3 id27757 0.403494 0.287808 0.308698
4 id04081 0.403494 0.287808 0.308698

So we have roughly 20k rows and about 8.4K test data (41% of size of train data). Classes are more or less equally distributed. We need to predict probabilities of different authors.

Evaluation Metric

For this competition, kaggle has defined multi class log loss as evaluation metric defined in link here. What does it mean ?

  • For each id ; we must predict probability of each author

  • Formulae for evaluation is defined as

(1)L=Logloss=1NΣNi=1ΣMj=1yijlogpijN=SamplesM=PredictionClassesyij={1ifObservation(i)Class(j)0otherwisepij=PredictedProbability[Observation(i)Class(j)]whereΣpij=1Num[pij]=max(min(pij,11015),1015)whereΣNum[pij]1
def convert_multiclass(actual):
    actual2 = np.zeros((actual.shape[0],predicted.shape[1]))
    actual2[actual.reshape(-1,1)] = 0
    for i, val in enumerate(actual):
        actual2[i, val]=1
    return actual2
Copy to clipboard
def multiclass_logloss(actual, predicted, eps=10**-15):
    y_ij = convert_multiclass(actual)
    N = len(actual)
    num_pij = np.clip(predicted, eps, 1-eps)
    L = -(y_ij*np.log(num_pij)).sum()/N
    return L
Copy to clipboard

Train Test Split

We split and stratify based on output

x_train, y_train = train[['id','text']], train['author']
lbl_encoder = LabelEncoder()
y_train = lbl_encoder.fit_transform(y_train)
Copy to clipboard
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, stratify=y_train, test_size=0.33, random_state=42)
len(x_train), len(x_valid)
Copy to clipboard
(13117, 6462)
Copy to clipboard

Modelling

TF-IDF with Logistic Regression

Term frequency inverse document frequency

A good reference can be found here

  • {TFFrequency of term in a document} indicates if you are using the word too many time in a document or too little.

  • {IDFInverse document Frequency} of a word is the measure of how significant that term is in the whole corpus.

  • TF.IDF(t,d)=Wdt=TFdtlnNDFt

    • N=Total number of documents

    • DFt=Number of document with term t

  • Put simply, the higher the TF.IDF score (weight), the rarer the term and vice versa.

tfv = TfidfVectorizer(min_df=1,
                      max_features=None,
                      strip_accents="unicode",
                      analyzer='word',
                      token_pattern=r'\w{1,}',
                      ngram_range=(1,3),
                      use_idf=1,
                      smooth_idf=1,
                      sublinear_tf=1,
                      stop_words='english'
                     )
tfv
Copy to clipboard
TfidfVectorizer(ngram_range=(1, 3), smooth_idf=1, stop_words='english',
                strip_accents='unicode', sublinear_tf=1,
                token_pattern='\\w{1,}', use_idf=1)
Copy to clipboard
tfv.fit(np.concatenate([x_train['text'].values,x_valid['text'].values]))
Copy to clipboard
TfidfVectorizer(ngram_range=(1, 3), smooth_idf=1, stop_words='english',
                strip_accents='unicode', sublinear_tf=1,
                token_pattern='\\w{1,}', use_idf=1)
Copy to clipboard
x_train_tfv = tfv.transform(x_train['text'].values)
x_train_tfv
Copy to clipboard
<13117x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 413765 stored elements in Compressed Sparse Row format>
Copy to clipboard
x_valid_tfv = tfv.transform(x_valid['text'].values)
x_valid_tfv
Copy to clipboard
<6462x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 205383 stored elements in Compressed Sparse Row format>
Copy to clipboard
clf = LogisticRegression(C=1.0)
clf.fit(x_train_tfv, y_train)
Copy to clipboard
LogisticRegression()
Copy to clipboard
val_preds = clf.predict_proba(x_valid_tfv)
val_preds
Copy to clipboard
array([[0.41254593, 0.26687746, 0.32057661],
       [0.26133285, 0.13761174, 0.60105541],
       [0.29159974, 0.4086124 , 0.29978786],
       ...,
       [0.60269293, 0.18377996, 0.21352711],
       [0.20320163, 0.16185855, 0.63493981],
       [0.56495872, 0.15974122, 0.27530006]])
Copy to clipboard
print ("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))
Copy to clipboard
logloss: 0.806 
Copy to clipboard
def multiclass_logloss2(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota
Copy to clipboard
print ("logloss: %0.3f " % multiclass_logloss2(y_valid, val_preds))
Copy to clipboard
logloss: 0.806 
Copy to clipboard

CountVector with Logistic Regression

CountVectorizer - Frequency of different words

cv = CountVectorizer(analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     stop_words='english'
                    )
cv
Copy to clipboard
CountVectorizer(ngram_range=(1, 3), stop_words='english',
                token_pattern='\\w{1,}')
Copy to clipboard
cv.fit(np.concatenate([x_train['text'].values,x_valid['text'].values]))
Copy to clipboard
CountVectorizer(ngram_range=(1, 3), stop_words='english',
                token_pattern='\\w{1,}')
Copy to clipboard
x_train_cv = cv.transform(x_train['text'].values)
x_train_cv
Copy to clipboard
<13117x400266 sparse matrix of type '<class 'numpy.int64'>'
	with 413776 stored elements in Compressed Sparse Row format>
Copy to clipboard
x_valid_cv = cv.transform(x_valid['text'].values)
x_valid_cv
Copy to clipboard
<6462x400266 sparse matrix of type '<class 'numpy.int64'>'
	with 205395 stored elements in Compressed Sparse Row format>
Copy to clipboard
clf = LogisticRegression(C=1.0)
clf.fit(x_train_cv, y_train)
clf
Copy to clipboard
LogisticRegression()
Copy to clipboard
val_preds = clf.predict_proba(x_valid_cv)
val_preds
Copy to clipboard
array([[0.53572102, 0.11567325, 0.34860574],
       [0.03049094, 0.00427   , 0.96523906],
       [0.33594102, 0.3222241 , 0.34183488],
       ...,
       [0.97113843, 0.01189312, 0.01696845],
       [0.05768433, 0.02785335, 0.91446231],
       [0.8622879 , 0.020955  , 0.1167571 ]])
Copy to clipboard
print ("logloss: %0.3f " % multiclass_logloss2(y_valid, val_preds))
Copy to clipboard
logloss: 0.557 
Copy to clipboard

MultinomialNB

clf = MultinomialNB()
clf.fit(x_train_tfv, y_train)
Copy to clipboard
MultinomialNB()
Copy to clipboard
val_preds = clf.predict_proba(x_valid_tfv)
val_preds
Copy to clipboard
array([[0.43479056, 0.2630958 , 0.30211364],
       [0.33038311, 0.16973709, 0.4998798 ],
       [0.37577295, 0.34463046, 0.27959659],
       ...,
       [0.54395868, 0.18701702, 0.2690243 ],
       [0.27495975, 0.18696675, 0.5380735 ],
       [0.56420466, 0.16682221, 0.26897313]])
Copy to clipboard
print ("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))
Copy to clipboard
logloss: 0.853 
Copy to clipboard
clf = MultinomialNB()
clf.fit(x_train_cv, y_train)
val_preds = clf.predict_proba(x_valid_cv)
print ("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))
Copy to clipboard
logloss: 0.467 
Copy to clipboard
def check_score(clf, vectorizer, 
                x_train=x_train, x_valid=x_valid,
                y_train=y_train, y_valid=y_valid):
    
    vectorizer.fit(np.concatenate([x_train['text'].values,
                                   x_valid['text'].values]))
    x_train_vect = vectorizer.transform(x_train['text'].values)
    x_valid_vect = vectorizer.transform(x_valid['text'].values)
    clf.fit(x_train_vect, y_train)
    val_preds = clf.predict_proba(x_valid_vect)
    mcll = multiclass_logloss(y_valid, val_preds)
    return mcll
Copy to clipboard
clf = MultinomialNB()
vectorizer = CountVectorizer(analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     stop_words='english'
                    )
score = check_score(clf, vectorizer)
print ("logloss: %0.3f " % score)
Copy to clipboard
logloss: 0.467 
Copy to clipboard
score = check_score(clf, tfv)
print ("logloss: %0.3f " % score)
Copy to clipboard
logloss: 0.853 
Copy to clipboard

SVD

svd = TruncatedSVD(n_components=120)
Copy to clipboard
svd.fit(x_train_cv)
Copy to clipboard
TruncatedSVD(n_components=120)
Copy to clipboard
xtrain_svd = svd.transform(x_train_cv)
xvalid_svd = svd.transform(x_valid_cv)
Copy to clipboard
scl = StandardScaler()
scl.fit(xtrain_svd)
Copy to clipboard
StandardScaler()
Copy to clipboard
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)
Copy to clipboard
clf = SVC(C=1.0, probability=True) # since we need probabilities
clf.fit(xtrain_svd_scl, y_train)
val_preds = clf.predict_proba(xvalid_svd_scl)
Copy to clipboard
print("logloss: %0.3f " % multiclass_logloss(y_valid, val_preds))
Copy to clipboard
logloss: 0.787 
Copy to clipboard
x_train_tfv.tocsc()
Copy to clipboard
<13117x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 413765 stored elements in Compressed Sparse Column format>
Copy to clipboard
x_train_tfv
Copy to clipboard
<13117x400219 sparse matrix of type '<class 'numpy.float64'>'
	with 413765 stored elements in Compressed Sparse Row format>
Copy to clipboard

XGBoost

CountVectorizer

clf = xgb.XGBClassifier(max_depth=7, 
                        n_estimators=200, 
                        colsample_bytree=0.8, 
                        subsample=0.8, 
                        nthread=10, 
                        learning_rate=0.1)
vectorizer = CountVectorizer(analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     stop_words='english'
                    )
score = check_score(clf, vectorizer)
print ("logloss: %0.3f " % score)
Copy to clipboard
logloss: 0.776 
Copy to clipboard

Tf-idfVectorizer

clf = xgb.XGBClassifier(max_depth=7, 
                        n_estimators=200, 
                        colsample_bytree=0.8, 
                        subsample=0.8, 
                        nthread=10, 
                        learning_rate=0.1)

vectorizer = TfidfVectorizer(min_df=1,
                      max_features=None,
                      strip_accents="unicode",
                      analyzer='word',
                      token_pattern=r'\w{1,}',
                      ngram_range=(1,3),
                      use_idf=1,
                      smooth_idf=1,
                      sublinear_tf=1,
                      stop_words='english'
                     )
score = check_score(clf, vectorizer)
print ("logloss: %0.3f " % score)
Copy to clipboard
logloss: 0.787 
Copy to clipboard

SVD Transformation

clf = xgb.XGBClassifier(max_depth=7, 
                        n_estimators=200, 
                        colsample_bytree=0.8, 
                        subsample=0.8, 
                        nthread=10, 
                        learning_rate=0.1)
clf.fit(xtrain_svd, y_train)
val_preds = clf.predict_proba(xvalid_svd)
score = multiclass_logloss(y_valid, val_preds)
print ("logloss: %0.3f " % score)
Copy to clipboard
logloss: 0.805 
Copy to clipboard