Lecture 1: Introduction to FastAI¶

In this notebook, I will run through multiple fastai examples from surface level to understand basic structure of library.I will do this by

Running different applications it offers namely Vision, Segmentation, Text, Tabular & Collaborative filtering.
Running my own datasets from kaggle like Titanic / CIFAR to reinforce the structure and concepts
Jotting down some pointers and comments from Video and Chapter1 of the book

Imports¶

Owing to some design decisions it has become important for now to import nlphero after fastai.basics. I will try to fix the same in future
We need to start jupyter after activating nlphero env for fastai import to work ?
Autocompletion issues launching jupyter from nlphero env. Try below fix

    pip3 install jupyter-tabnine
    jupyter nbextension install --py jupyter_tabnine
    jupyter nbextension enable --py jupyter_tabnine
    jupyter serverextension enable --py jupyter_tabnine

%config IPCompleter.greedy=True

from PIL import Image

Application Examples¶

Vision¶

from fastai.basics import *
from fastai.vision.all import *
from nlphero.data.external import *

Dogs vs Cats¶

path = untar_data(URLs.PETS)/"images"

(path).ls()

(#7393) [Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/wheaten_terrier_112.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/havanese_87.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/Bombay_144.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/British_Shorthair_74.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/Ragdoll_93.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/keeshond_40.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/great_pyrenees_88.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/Maine_Coon_155.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/basset_hound_14.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/shiba_inu_69.jpg')...]

imgs = get_image_files(path); len(imgs)

imgs[0]; imgs[2]

Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/Bombay_144.jpg')

Image.open(imgs[0])

Image.open(imgs[2])

This dataset has

7393 files in image folder
7390 images
Some images are starting with a capital names. They are cats
Some images are starting with a small alphabet. They are dogs

Now let’s make a model

def is_cat(x): return x[0].isupper()

is_cat(imgs[0].name), is_cat(imgs[2].name)

(False, True)

imgs[2].name

'Bombay_144.jpg'

doc(ImageDataLoaders.from_name_func)

!pip uninstall jupyter-tabnine -y

WARNING: Skipping jupyter-tabnine as it is not installed.

dls = ImageDataLoaders.from_name_func(path/"images", imgs, label_func=is_cat, 
                                      valid_pct=0.2, seed=42, item_tfms=Resize(224))
dls

<fastai.data.core.DataLoaders at 0x7f24fc1e8940>

learn = cnn_learner(dls, resnet34, metrics=error_rate)

learn

<fastai.learner.Learner at 0x7f24fc1e8790>

learn.fine_tune(1)

epoch	train_loss	valid_loss	error_rate	time
0	0.172422	0.016313	0.005413	04:18

epoch	train_loss	valid_loss	error_rate	time
0	0.051411	0.018620	0.006766	06:20

CIFAR10¶

path = untar_data(URLs.CIFAR);path

Path('/Landmark2/pdo/.nlphero/data/cifar10')

(path/"train").ls()

(#10) [Path('/Landmark2/pdo/.nlphero/data/cifar10/train/deer'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/automobile'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/bird'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/horse'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/cat'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/airplane'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/frog'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/truck'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/ship'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/dog')]

(path/"test").ls()

(#10) [Path('/Landmark2/pdo/.nlphero/data/cifar10/test/horse'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/frog'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/automobile'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/truck'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/dog'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/deer'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/airplane'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/bird'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/cat'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/ship')]

doc(ImageDataLoaders.from_folder)

dls = ImageDataLoaders.from_folder(path, train='train', valid_pct=0.2, seed=42, item_tfms=Resize(32));dls

<fastai.data.core.DataLoaders at 0x7fa3549e4130>

learn = cnn_learner(dls, resnet18, metrics=error_rate)

learn.fine_tune(4)

epoch	train_loss	valid_loss	error_rate	time
0	0.901235	0.882115	0.301667	01:38

epoch	train_loss	valid_loss	error_rate	time
0	0.777667	0.767829	0.268333	03:56
1	0.648126	0.686950	0.233250	03:59
2	0.447472	0.660180	0.218583	03:58
3	0.303529	0.693361	0.219833	03:57

dls2 = ImageDataLoaders.from_folder(path, train='train', valid_pct=0.2, seed=42, item_tfms=Resize(28));dls2

<fastai.data.core.DataLoaders at 0x7fa34f572340>

learn2 = cnn_learner(dls2, resnet18, metrics=error_rate)

learn2.fine_tune(1)

epoch	train_loss	valid_loss	error_rate	time
0	1.772746	1.594504	0.568250	01:11

epoch	train_loss	valid_loss	error_rate	time
0	1.063720	0.954283	0.337167	03:56

Some Questions to figure out¶

How to get intuition on sizes for different resnet architectures?
What is the impact of different Resize 16, 32, 64, 128, 224, 299 etc…? In conjunction with resnet sizes?
What is the intuitution behind number of epochs to run?
What does fine-tune do internally? How to understand it in context of fit_one_cycle?
Should we keep test set completely blind and split train -> train & validation? Or take test as validation set for cifar dataset?
Intuition behind number 7 in terms of sizes?

Tabular Data¶

Adult Salary¶

from fastai.basics import *
from fastai.tabular.all import *
from nlphero.data.external import *

path = untar_data(URLs.ADULT_SAMPLE); path

Path('/Landmark2/pdo/.nlphero/data/adult_sample')

path.ls()

(#3) [Path('/Landmark2/pdo/.nlphero/data/adult_sample/adult.csv'),Path('/Landmark2/pdo/.nlphero/data/adult_sample/models'),Path('/Landmark2/pdo/.nlphero/data/adult_sample/export.pkl')]

df = pd.read_csv(path/"adult.csv"); df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  int64  
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   32074 non-null  float64
 5   marital-status  32561 non-null  object 
 6   occupation      32049 non-null  object 
 7   relationship    32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32561 non-null  int64  
 12  hours-per-week  32561 non-null  int64  
 13  native-country  32561 non-null  object 
 14  salary          32561 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.7+ MB

In this dataset ; from documentation I understand

We have to predict “salary”. This is a categorical variable
Categorical Inputs: workclass, education, marital-status, occupation, relationship, race, sex, native-country
Numerical Inputs: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week

doc(learn.fine_tune)

To achieve something closer to fit_one_cycle we probably need to use freeze_epochs equal to the epochs required for training without unfreezing layers required
I need to understand what is discriminate LR

df.nunique()

age                  73
workclass             9
fnlwgt            21648
education            16
education-num        16
marital-status        7
occupation           15
relationship          6
race                  5
sex                   2
capital-gain        119
capital-loss         92
hours-per-week       94
native-country       42
salary                2
dtype: int64

In the video setup, Jeremy rejects native-country & sex. Why?
Why not use capital-gains, capital-loss, hours-per-week?

dls = TabularDataLoaders.from_csv(path/"adult.csv", path, y_names="salary",
                           cat_names=["workclass", "education", "marital-status", 
                                      "occupation", "relationship", "race",
                                     ], #sex, native-country,
                           cont_names = ['age', 'fnlwgt', 'education-num'],
                           procs = [Categorify, FillMissing, Normalize]
                           ) 

learn = tabular_learner(dls, metrics=accuracy)

learn.fine_tune(epochs=3, freeze_epochs=5)

epoch	train_loss	valid_loss	accuracy	time
0	0.340121	0.358194	0.836302	00:04
1	0.336939	0.364154	0.830467	00:04
2	0.351640	0.362292	0.833538	00:04
3	0.353321	0.363159	0.833538	00:04
4	0.350493	0.364292	0.833077	00:04

epoch	train_loss	valid_loss	accuracy	time
0	0.348241	0.361380	0.833845	00:04
1	0.333852	0.361079	0.835995	00:04
2	0.328043	0.361146	0.834920	00:04

doc(tabular_learner)

Text Data¶

IMDB Reviews¶

from fastai.basics import *
from fastai.text.all import *
from nlphero.data.external import *

path = untar_data(URLs.IMDB); path

Path('/Landmark2/pdo/.nlphero/data/imdb')

path.ls()

(#7) [Path('/Landmark2/pdo/.nlphero/data/imdb/train'),Path('/Landmark2/pdo/.nlphero/data/imdb/README'),Path('/Landmark2/pdo/.nlphero/data/imdb/imdb.vocab'),Path('/Landmark2/pdo/.nlphero/data/imdb/test'),Path('/Landmark2/pdo/.nlphero/data/imdb/tmp_lm'),Path('/Landmark2/pdo/.nlphero/data/imdb/tmp_clas'),Path('/Landmark2/pdo/.nlphero/data/imdb/unsup')]

!cat /Landmark2/pdo/.nlphero/data/imdb/README

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a score >= 7 out of 10. Thus reviews with
more neutral ratings are not included in the train/test sets. In the
unsupervised set, reviews of any rating are included and there are an
even number of reviews > 5 and <= 5.

Files

There are two top-level directories [train/, test/] corresponding to
the training and test sets. Each contains [pos/, neg/] directories for
the reviews with binary labels positive and negative. Within these
directories, reviews are stored in text files named following the
convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
the star rating for that review on a 1-10 scale. For example, the file
[test/pos/200_8.txt] is the text for a positive-labeled test set
example with unique id 200 and star rating 8/10 from IMDb. The
[train/unsup/] directory has 0 for all ratings because the ratings are
omitted for this portion of the dataset.

We also include the IMDb URLs for each review in a separate
[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
have its URL on line 200 of this file. Due the ever-changing IMDb, we
are unable to link directly to the review, but only to the movie's
review page.

In addition to the review text files, we include already-tokenized bag
of words (BoW) features that were used in our experiments. These 
are stored in .feat files in the train/test directories. Each .feat
file is in LIBSVM format, an ascii sparse-vector format for labeled
data.  The feature indices in these files start from 0, and the text
tokens corresponding to a feature index is found in [imdb.vocab]. So a
line with 0:7 in a .feat file means the first word in [imdb.vocab]
(the) appears 7 times in that review.

LIBSVM page for details on .feat file format:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

We also include [imdbEr.txt] which contains the expected rating for
each token in [imdb.vocab] as computed by (Potts, 2011). The expected
rating is a good way to get a sense for the average polarity of a word
in the dataset.

Citing the dataset

When using this dataset please cite our ACL 2011 paper which
introduces it. This paper also contains classification results which
you may want to compare against.


@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

References

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
636-659.

Contact

For questions/comments/corrections please contact Andrew Maas
amaas@cs.stanford.edu

(path/"train").ls()

(#4) [Path('/Landmark2/pdo/.nlphero/data/imdb/train/labeledBow.feat'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/unsupBow.feat'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/neg')]

(path/"train/pos").ls()

(#12500) [Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/5701_8.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/4334_10.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/6794_10.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/4109_10.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/1842_9.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/4261_9.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/8862_8.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/3072_8.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/5202_8.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/2317_10.txt')...]

(path/"test").ls()

(#3) [Path('/Landmark2/pdo/.nlphero/data/imdb/test/labeledBow.feat'),Path('/Landmark2/pdo/.nlphero/data/imdb/test/pos'),Path('/Landmark2/pdo/.nlphero/data/imdb/test/neg')]

!head /Landmark2/pdo/.nlphero/data/imdb/train/pos/5701_8.txt

John Holmes is so famous, he's infamous (as the Three Amigos would say). This is a Rashomon-like story about the events surrounding the Wonderland Murders of the early 1980's, in Los Angeles. The story is pieced together from the retelling of a few of the participants. There is story from the friend's perspective, namely David Lind (played by Dylan McDermott). He is a participant in the robbery assault at Eddie Nash's place (Eddie Nash is a infamous drug dealer - and is the suppose to be the same character Alfred Molina played in Boogie Nights) and is heavily into the drug scene. There is John Holmes' perspective (played by Val Kilmer), which makes him out to be a pawn stuck between two kings (with a severe case of cocaine cravings). There is also the patchwork recollections of John's wife (Sharon - played by Lisa Kudrow) and his girlfriend (Dawn - played by Kate Bosworth) that fill in the spaces between the two stories. It is basically the same time frame that we are looking at, just each character's version. The only thing that is missing is the perspective from the dead people. <br /><br />Paul Thomas Anderson's Boogie Nights portrays John Holmes as a slightly heroic character, with a tragic yet comedic karma. He is a caricature of a real person. He was more of less, a mixed up kid that got what he got through his "large" endowment. Director James Cox turns the comedy off and makes this episode in John's life into a nightmare for all of us watching. The details of the real life murders make this movie even more eerie.<br /><br />Val Kilmer took what he learned of Jim Morrison, from the Doors, enhanced the performance for the Salton Sea, and then further enhanced that to bring us the deterioration of John Holmes through cocaine. All of the actors pull off very realistic looking portrayal's of cocaine junkies. Josh Lucas' performance stands out as one of the best in the movie. He plays Ron Launius (I think this character is suppose to be the same as the Thomas Jane character from Boogie Nights). Ron was the leader of the gang, loved having John Holmes around as a novelty and had a cocaine craving like sharks enjoy blood. The cocaine use seems so realistic as to make one think. Did they really use Splenda ?? <br /><br />Where Boogie Nights has a bubblegum pop feel to it (lots of color and 70's nostalgia), Wonderland is dark. The action is fast and furious, with a lot of jumps. It is twitchy and grainy. There is no comedy, just a never ending pace, as if the director is trying to put us into the nervous, fast paced, edgy cocaine high to make us feel what the characters are feeling. This is a graphic movie. It has one of the most intensely violent scenes I have ever seen in a movie. It actually shows the murders themselves (through the eyes of John Holmes at first and then from a third person perspective). It is so graphic, it looks like police evidence of a crime. I had to pause after this scene and remind myself this was just a movie. This movie is definitely not recommended for everyone. I recommend it as a good alternative to Boogie Nights, for those interested in the other sides of John Holmes.<br /><br />-Celluloid Rehab

From the dataset

Labels from folder name (“pos”, “neg”)
Train , test folders available

TextDataLoaders.from_folder??

dls = TextDataLoaders.from_folder(path, valid='test');dls

<fastai.data.core.DataLoaders at 0x7f451e4bb940>

learn = text_classifier_learner?

learn = text_classifier_learner

learn = text_classifier_learner(dls, AWD_LSTM, 
                                drop_mult=0.5, metrics=accuracy)
learn

<fastai.text.learner.TextLearner at 0x7f451e7a29d0>

learn.fine_tune(4, base_lr=1e-2)

epoch	train_loss	valid_loss	accuracy	time
0	0.605775	0.409991	0.812320	37:13

epoch	train_loss	valid_loss	accuracy	time
0	0.309280	0.256412	0.892680	1:16:42
1	0.237081	0.210406	0.916120	1:05:55
2	0.174406	0.195768	0.927280	58:30
3	0.165149	0.194630	0.928560	1:00:53

Code First NLP