Lecture 1: Introduction to FastAI

In this notebook, I will run through multiple fastai examples from surface level to understand basic structure of library.I will do this by

  1. Running different applications it offers namely Vision, Segmentation, Text, Tabular & Collaborative filtering.

  2. Running my own datasets from kaggle like Titanic / CIFAR to reinforce the structure and concepts

  3. Jotting down some pointers and comments from Video and Chapter1 of the book

Imports

  • Owing to some design decisions it has become important for now to import nlphero after fastai.basics. I will try to fix the same in future

  • We need to start jupyter after activating nlphero env for fastai import to work ?

  • Autocompletion issues launching jupyter from nlphero env. Try below fix

    pip3 install jupyter-tabnine
    jupyter nbextension install --py jupyter_tabnine
    jupyter nbextension enable --py jupyter_tabnine
    jupyter serverextension enable --py jupyter_tabnine
%config IPCompleter.greedy=True
from PIL import Image

Application Examples

Vision

from fastai.basics import *
from fastai.vision.all import *
from nlphero.data.external import *

Dogs vs Cats

path = untar_data(URLs.PETS)/"images"
(path).ls()
(#7393) [Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/wheaten_terrier_112.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/havanese_87.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/Bombay_144.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/British_Shorthair_74.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/Ragdoll_93.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/keeshond_40.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/great_pyrenees_88.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/Maine_Coon_155.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/basset_hound_14.jpg'),Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/shiba_inu_69.jpg')...]
imgs = get_image_files(path); len(imgs)
7390
imgs[0]; imgs[2]
Path('/Landmark2/pdo/.nlphero/data/oxford-iiit-pet/images/Bombay_144.jpg')
Image.open(imgs[0])
../_images/Lecture1_14_0.png
Image.open(imgs[2])
../_images/Lecture1_15_0.png

This dataset has

  • 7393 files in image folder

  • 7390 images

  • Some images are starting with a capital names. They are cats

  • Some images are starting with a small alphabet. They are dogs

Now let’s make a model

def is_cat(x): return x[0].isupper()
is_cat(imgs[0].name), is_cat(imgs[2].name)
(False, True)
imgs[2].name
'Bombay_144.jpg'
doc(ImageDataLoaders.from_name_func)
!pip uninstall jupyter-tabnine -y
WARNING: Skipping jupyter-tabnine as it is not installed.
dls = ImageDataLoaders.from_name_func(path/"images", imgs, label_func=is_cat, 
                                      valid_pct=0.2, seed=42, item_tfms=Resize(224))
dls
<fastai.data.core.DataLoaders at 0x7f24fc1e8940>
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn
<fastai.learner.Learner at 0x7f24fc1e8790>
learn.fine_tune(1)
epoch train_loss valid_loss error_rate time
0 0.172422 0.016313 0.005413 04:18
epoch train_loss valid_loss error_rate time
0 0.051411 0.018620 0.006766 06:20

CIFAR10

path = untar_data(URLs.CIFAR);path
Path('/Landmark2/pdo/.nlphero/data/cifar10')
(path/"train").ls()
(#10) [Path('/Landmark2/pdo/.nlphero/data/cifar10/train/deer'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/automobile'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/bird'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/horse'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/cat'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/airplane'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/frog'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/truck'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/ship'),Path('/Landmark2/pdo/.nlphero/data/cifar10/train/dog')]
(path/"test").ls()
(#10) [Path('/Landmark2/pdo/.nlphero/data/cifar10/test/horse'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/frog'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/automobile'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/truck'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/dog'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/deer'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/airplane'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/bird'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/cat'),Path('/Landmark2/pdo/.nlphero/data/cifar10/test/ship')]
doc(ImageDataLoaders.from_folder)
dls = ImageDataLoaders.from_folder(path, train='train', valid_pct=0.2, seed=42, item_tfms=Resize(32));dls
<fastai.data.core.DataLoaders at 0x7fa3549e4130>
learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)
epoch train_loss valid_loss error_rate time
0 0.901235 0.882115 0.301667 01:38
epoch train_loss valid_loss error_rate time
0 0.777667 0.767829 0.268333 03:56
1 0.648126 0.686950 0.233250 03:59
2 0.447472 0.660180 0.218583 03:58
3 0.303529 0.693361 0.219833 03:57
dls2 = ImageDataLoaders.from_folder(path, train='train', valid_pct=0.2, seed=42, item_tfms=Resize(28));dls2
<fastai.data.core.DataLoaders at 0x7fa34f572340>
learn2 = cnn_learner(dls2, resnet18, metrics=error_rate)
learn2.fine_tune(1)
epoch train_loss valid_loss error_rate time
0 1.772746 1.594504 0.568250 01:11
epoch train_loss valid_loss error_rate time
0 1.063720 0.954283 0.337167 03:56
Some Questions to figure out
  • How to get intuition on sizes for different resnet architectures?

  • What is the impact of different Resize 16, 32, 64, 128, 224, 299 etc…? In conjunction with resnet sizes?

  • What is the intuitution behind number of epochs to run?

  • What does fine-tune do internally? How to understand it in context of fit_one_cycle?

  • Should we keep test set completely blind and split train -> train & validation? Or take test as validation set for cifar dataset?

  • Intuition behind number 7 in terms of sizes?

Tabular Data

Adult Salary

from fastai.basics import *
from fastai.tabular.all import *
from nlphero.data.external import *
path = untar_data(URLs.ADULT_SAMPLE); path
Path('/Landmark2/pdo/.nlphero/data/adult_sample')
path.ls()
(#3) [Path('/Landmark2/pdo/.nlphero/data/adult_sample/adult.csv'),Path('/Landmark2/pdo/.nlphero/data/adult_sample/models'),Path('/Landmark2/pdo/.nlphero/data/adult_sample/export.pkl')]
df = pd.read_csv(path/"adult.csv"); df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  int64  
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   32074 non-null  float64
 5   marital-status  32561 non-null  object 
 6   occupation      32049 non-null  object 
 7   relationship    32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32561 non-null  int64  
 12  hours-per-week  32561 non-null  int64  
 13  native-country  32561 non-null  object 
 14  salary          32561 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.7+ MB

In this dataset ; from documentation I understand

  • We have to predict “salary”. This is a categorical variable

  • Categorical Inputs: workclass, education, marital-status, occupation, relationship, race, sex, native-country

  • Numerical Inputs: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week

doc(learn.fine_tune)
  • To achieve something closer to fit_one_cycle we probably need to use freeze_epochs equal to the epochs required for training without unfreezing layers required

  • I need to understand what is discriminate LR

df.nunique()
age                  73
workclass             9
fnlwgt            21648
education            16
education-num        16
marital-status        7
occupation           15
relationship          6
race                  5
sex                   2
capital-gain        119
capital-loss         92
hours-per-week       94
native-country       42
salary                2
dtype: int64
  • In the video setup, Jeremy rejects native-country & sex. Why?

  • Why not use capital-gains, capital-loss, hours-per-week?

dls = TabularDataLoaders.from_csv(path/"adult.csv", path, y_names="salary",
                           cat_names=["workclass", "education", "marital-status", 
                                      "occupation", "relationship", "race",
                                     ], #sex, native-country,
                           cont_names = ['age', 'fnlwgt', 'education-num'],
                           procs = [Categorify, FillMissing, Normalize]
                           ) 
learn = tabular_learner(dls, metrics=accuracy)
learn.fine_tune(epochs=3, freeze_epochs=5)
epoch train_loss valid_loss accuracy time
0 0.340121 0.358194 0.836302 00:04
1 0.336939 0.364154 0.830467 00:04
2 0.351640 0.362292 0.833538 00:04
3 0.353321 0.363159 0.833538 00:04
4 0.350493 0.364292 0.833077 00:04
epoch train_loss valid_loss accuracy time
0 0.348241 0.361380 0.833845 00:04
1 0.333852 0.361079 0.835995 00:04
2 0.328043 0.361146 0.834920 00:04
doc(tabular_learner)

Text Data

IMDB Reviews

from fastai.basics import *
from fastai.text.all import *
from nlphero.data.external import *
path = untar_data(URLs.IMDB); path
Path('/Landmark2/pdo/.nlphero/data/imdb')
path.ls()
(#7) [Path('/Landmark2/pdo/.nlphero/data/imdb/train'),Path('/Landmark2/pdo/.nlphero/data/imdb/README'),Path('/Landmark2/pdo/.nlphero/data/imdb/imdb.vocab'),Path('/Landmark2/pdo/.nlphero/data/imdb/test'),Path('/Landmark2/pdo/.nlphero/data/imdb/tmp_lm'),Path('/Landmark2/pdo/.nlphero/data/imdb/tmp_clas'),Path('/Landmark2/pdo/.nlphero/data/imdb/unsup')]
!cat /Landmark2/pdo/.nlphero/data/imdb/README
Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a score >= 7 out of 10. Thus reviews with
more neutral ratings are not included in the train/test sets. In the
unsupervised set, reviews of any rating are included and there are an
even number of reviews > 5 and <= 5.

Files

There are two top-level directories [train/, test/] corresponding to
the training and test sets. Each contains [pos/, neg/] directories for
the reviews with binary labels positive and negative. Within these
directories, reviews are stored in text files named following the
convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
the star rating for that review on a 1-10 scale. For example, the file
[test/pos/200_8.txt] is the text for a positive-labeled test set
example with unique id 200 and star rating 8/10 from IMDb. The
[train/unsup/] directory has 0 for all ratings because the ratings are
omitted for this portion of the dataset.

We also include the IMDb URLs for each review in a separate
[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
have its URL on line 200 of this file. Due the ever-changing IMDb, we
are unable to link directly to the review, but only to the movie's
review page.

In addition to the review text files, we include already-tokenized bag
of words (BoW) features that were used in our experiments. These 
are stored in .feat files in the train/test directories. Each .feat
file is in LIBSVM format, an ascii sparse-vector format for labeled
data.  The feature indices in these files start from 0, and the text
tokens corresponding to a feature index is found in [imdb.vocab]. So a
line with 0:7 in a .feat file means the first word in [imdb.vocab]
(the) appears 7 times in that review.

LIBSVM page for details on .feat file format:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

We also include [imdbEr.txt] which contains the expected rating for
each token in [imdb.vocab] as computed by (Potts, 2011). The expected
rating is a good way to get a sense for the average polarity of a word
in the dataset.

Citing the dataset

When using this dataset please cite our ACL 2011 paper which
introduces it. This paper also contains classification results which
you may want to compare against.


@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

References

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
636-659.

Contact

For questions/comments/corrections please contact Andrew Maas
amaas@cs.stanford.edu
(path/"train").ls()
(#4) [Path('/Landmark2/pdo/.nlphero/data/imdb/train/labeledBow.feat'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/unsupBow.feat'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/neg')]
(path/"train/pos").ls()
(#12500) [Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/5701_8.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/4334_10.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/6794_10.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/4109_10.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/1842_9.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/4261_9.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/8862_8.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/3072_8.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/5202_8.txt'),Path('/Landmark2/pdo/.nlphero/data/imdb/train/pos/2317_10.txt')...]
(path/"test").ls()
(#3) [Path('/Landmark2/pdo/.nlphero/data/imdb/test/labeledBow.feat'),Path('/Landmark2/pdo/.nlphero/data/imdb/test/pos'),Path('/Landmark2/pdo/.nlphero/data/imdb/test/neg')]
!head /Landmark2/pdo/.nlphero/data/imdb/train/pos/5701_8.txt
John Holmes is so famous, he's infamous (as the Three Amigos would say). This is a Rashomon-like story about the events surrounding the Wonderland Murders of the early 1980's, in Los Angeles. The story is pieced together from the retelling of a few of the participants. There is story from the friend's perspective, namely David Lind (played by Dylan McDermott). He is a participant in the robbery assault at Eddie Nash's place (Eddie Nash is a infamous drug dealer - and is the suppose to be the same character Alfred Molina played in Boogie Nights) and is heavily into the drug scene. There is John Holmes' perspective (played by Val Kilmer), which makes him out to be a pawn stuck between two kings (with a severe case of cocaine cravings). There is also the patchwork recollections of John's wife (Sharon - played by Lisa Kudrow) and his girlfriend (Dawn - played by Kate Bosworth) that fill in the spaces between the two stories. It is basically the same time frame that we are looking at, just each character's version. The only thing that is missing is the perspective from the dead people. <br /><br />Paul Thomas Anderson's Boogie Nights portrays John Holmes as a slightly heroic character, with a tragic yet comedic karma. He is a caricature of a real person. He was more of less, a mixed up kid that got what he got through his "large" endowment. Director James Cox turns the comedy off and makes this episode in John's life into a nightmare for all of us watching. The details of the real life murders make this movie even more eerie.<br /><br />Val Kilmer took what he learned of Jim Morrison, from the Doors, enhanced the performance for the Salton Sea, and then further enhanced that to bring us the deterioration of John Holmes through cocaine. All of the actors pull off very realistic looking portrayal's of cocaine junkies. Josh Lucas' performance stands out as one of the best in the movie. He plays Ron Launius (I think this character is suppose to be the same as the Thomas Jane character from Boogie Nights). Ron was the leader of the gang, loved having John Holmes around as a novelty and had a cocaine craving like sharks enjoy blood. The cocaine use seems so realistic as to make one think. Did they really use Splenda ?? <br /><br />Where Boogie Nights has a bubblegum pop feel to it (lots of color and 70's nostalgia), Wonderland is dark. The action is fast and furious, with a lot of jumps. It is twitchy and grainy. There is no comedy, just a never ending pace, as if the director is trying to put us into the nervous, fast paced, edgy cocaine high to make us feel what the characters are feeling. This is a graphic movie. It has one of the most intensely violent scenes I have ever seen in a movie. It actually shows the murders themselves (through the eyes of John Holmes at first and then from a third person perspective). It is so graphic, it looks like police evidence of a crime. I had to pause after this scene and remind myself this was just a movie. This movie is definitely not recommended for everyone. I recommend it as a good alternative to Boogie Nights, for those interested in the other sides of John Holmes.<br /><br />-Celluloid Rehab

From the dataset

  • Labels from folder name (“pos”, “neg”)

  • Train , test folders available

TextDataLoaders.from_folder??
dls = TextDataLoaders.from_folder(path, valid='test');dls
<fastai.data.core.DataLoaders at 0x7f451e4bb940>
learn = text_classifier_learner?
learn = text_classifier_learner
learn = text_classifier_learner(dls, AWD_LSTM, 
                                drop_mult=0.5, metrics=accuracy)
learn
<fastai.text.learner.TextLearner at 0x7f451e7a29d0>
learn.fine_tune(4, base_lr=1e-2)
epoch train_loss valid_loss accuracy time
0 0.605775 0.409991 0.812320 37:13
epoch train_loss valid_loss accuracy time
0 0.309280 0.256412 0.892680 1:16:42
1 0.237081 0.210406 0.916120 1:05:55
2 0.174406 0.195768 0.927280 58:30
3 0.165149 0.194630 0.928560 1:00:53