The purpose of this notebook is to show how sentiment classification is done via the classic techniques of Naive Bayes
, Logistic regression
, and Ngrams
. We will be using sklearn
and the fastai
library.
In a future lesson, we will revisit sentiment classification using deep learning
, so that you can compare the two approaches.
The content here was extended from Lesson 10 of the fast.ai Machine Learning course. Linear model is pretty close to the state of the art here. Jeremy surpassed state of the art using a RNN in fall 2017.
We will begin using the fastai library (version 1.0) in this notebook. We will use it more once we move on to neural networks.
The fastai library is built on top of PyTorch and encodes many state-of-the-art best practices. It is used in production at a number of companies. You can read more about it here:
With conda:
conda install -c pytorch -c fastai fastai=1.0
Or with pip:
pip install fastai==1.0
More installation information here.
Beginning in lesson 4, we will be using GPUs, so if you want, you could switch to a cloud option now to setup fastai.
The large movie review dataset contains a collection of 50,000 reviews from IMDB, We will use the version hosted as part fast.ai datasets on AWS Open Datasets.
The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.
The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai import *
from fastai.text import *
from fastai.utils.mem import GPUMemTrace #call with mtrace
import sklearn.feature_extraction.text as sklearn_text
import pickle
fast.ai has a number of datasets hosted via AWS Open Datasets for easy download. We can see them by checking the docs for URLs (remember ??
is a helpful command):
?? URLs
It is always good to start working on a sample of your data before you use the full dataset-- this allows for quicker computations as you debug and get your code working. For IMDB, there is a sample dataset already available:
path = untar_data(URLs.IMDB_SAMPLE)
path
WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb_sample')
is_valid
flag, respectively. is_valid
is a boolean flag indicating whether the row is from the validation set or not.¶df = pd.read_csv(path/'texts.csv')
df.head()
label | text | is_valid | |
---|---|---|---|
0 | negative | Un-bleeping-believable! Meg Ryan doesn't even ... | False |
1 | positive | This is a extremely well-made film. The acting... | False |
2 | negative | Every once in a long while a movie will come a... | False |
3 | positive | Name just says it all. I watched this movie wi... | False |
4 | negative | This movie succeeds at being one of the most u... | False |
%%time
# throws `BrokenProcessPool' Error sometimes. Keep trying `till it works!
count = 0
error = True
while error:
try:
# Preprocessing steps
movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
.split_from_df(col=2)
.label_from_df(cols=0))
error = False
print(f'failure count is {count}\n')
except: # catch *all* exceptions
# accumulate failure count
count = count + 1
print(f'failure count is {count}')
failure count is 1 Wall time: 28.2 s
A good first step for any data problem is to explore the data and get a sense of what it looks like. In this case we are looking at movie reviews, which have been labeled as "positive" or "negative". The reviews have already been tokenized
, i.e. split into tokens
, basic units such as words, prefixes, punctuation, capitalization, and other features of the text.
movie_reviews
LabelLists; Train: LabelList (800 items) x: TextList xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !,xxbos xxmaj this is a extremely well - made film . xxmaj the acting , script and camera - work are all first - rate . xxmaj the music is good , too , though it is mostly early in the film , when things are still relatively xxunk . xxmaj there are no really xxunk in the cast , though several faces will be familiar . xxmaj the entire cast does an excellent job with the script . xxmaj but it is hard to watch , because there is no good end to a situation like the one presented . xxmaj it is now xxunk to blame the xxmaj british for setting xxmaj hindus and xxmaj muslims against each other , and then xxunk xxunk them into two countries . xxmaj there is some merit in this view , but it 's also true that no one forced xxmaj hindus and xxmaj muslims in the region to xxunk each other as they did around the time of partition . xxmaj it seems more likely that the xxmaj british simply saw the xxunk between the xxunk and were clever enough to exploit them to their own ends . xxmaj the result is that there is much cruelty and inhumanity in the situation and this is very unpleasant to remember and to see on the screen . xxmaj but it is never painted as a black - and - white case . xxmaj there is xxunk and xxunk on both sides , and also the hope for change in the younger generation . xxmaj there is redemption of a sort , in the end , when xxmaj xxunk has to make a hard choice between a man who has ruined her life , but also truly loved her , and her family which has xxunk her , then later come looking for her . xxmaj but by that point , she has no xxunk that is without great pain for her . xxmaj this film carries the message that both xxmaj muslims and xxmaj hindus have their grave xxunk , and also that both can be xxunk and caring people . xxmaj the reality of partition makes that xxunk all the more wrenching , since there can never be real xxunk across the xxmaj india / xxmaj pakistan border . xxmaj in that sense , it is similar to " xxmaj mr & xxmaj xxunk xxmaj xxunk " . xxmaj in the end , we were glad to have seen the film , even though the resolution was xxunk . xxmaj if the xxup uk and xxup us could deal with their own xxunk of racism with this kind of xxunk , they would certainly be better off .,xxbos xxmaj every once in a long while a movie will come along that will be so awful that i feel compelled to warn people . xxmaj if i labor all my days and i can save but one soul from watching this movie , how great will be my joy . xxmaj where to begin my discussion of pain . xxmaj for xxunk , there was a musical xxunk every five minutes . xxmaj there was no character development . xxmaj every character was a stereotype . xxmaj we had xxunk guy , fat guy who eats donuts , goofy foreign guy , etc . xxmaj the script felt as if it were being written as the movie was being shot . xxmaj the production value was so incredibly low that it felt like i was watching a junior high video presentation . xxmaj have the directors , producers , etc . ever even seen a movie before ? xxmaj xxunk is getting worse and worse with every new entry . xxmaj the concept for this movie sounded so funny . xxmaj how could you go wrong with xxmaj gary xxmaj coleman and a handful of somewhat legitimate actors . xxmaj but trust me when i say this , things went wrong , xxup very xxup wrong .,xxbos xxmaj name just says it all . i watched this movie with my dad when it came out and having served in xxmaj xxunk he had great admiration for the man . xxmaj the disappointing thing about this film is that it only xxunk on a short period of the man 's life - interestingly enough the man 's entire life would have made such an epic bio - xxunk that it is staggering to imagine the cost for production . xxmaj some posters xxunk to the flawed xxunk about the man , which are cheap shots . xxmaj the theme of the movie " xxmaj duty , xxmaj honor , xxmaj country " are not just mere words xxunk from the lips of a high - xxunk officer - it is the deep xxunk of one man 's total devotion to his country . xxmaj ironically xxmaj xxunk being the liberal that he was xxunk a better understanding of the man . xxmaj he does a great job showing the xxunk general xxunk with the xxunk side of the man .,xxbos xxmaj this movie succeeds at being one of the most unique movies you 've seen . xxmaj however this comes from the fact that you ca n't make heads or xxunk of this mess . xxmaj it almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid . xxmaj if you do n't want to feel xxunk you 'll sit through this horrible film and develop a real sense of pity for the actors involved , they 've all seen better days , but then you realize they actually got paid quite a bit of money to do this and you 'll lose pity for them just like you 've already done for the film . i ca n't go on enough about this horrible movie , its almost something that xxmaj ed xxmaj wood would have made and in that case it surely would have been his masterpiece . xxmaj to start you are forced to sit through an opening dialogue the likes of which you 've never seen / heard , this thing has got to be five minutes long . xxmaj on top of that it is narrated , as to suggest that you the viewer can not read . xxmaj then we meet xxmaj mr. xxmaj xxunk and the xxunk of terrible lines gets xxunk , it is as if he is xxunk solely to get lines on to the movie poster xxunk line . xxmaj soon we meet xxmaj stephen xxmaj xxunk , who i typically enjoy ) and he does his best not to drown in this but ultimately he does . xxmaj then comes the ultimate insult , xxmaj tara xxmaj xxunk playing an intelligent role , oh help us ! xxmaj tara xxmaj xxunk is not a very talented actress and somehow she xxunk gets roles in movies , in my opinion though she should stick to movies of the xxmaj american pie type . xxmaj all in all you just may want to see this for yourself when it comes out on video , i know that i got a kick out of it , i mean lets all be honest here , sometimes its comforting to xxunk in the shortcomings of others . y: CategoryList negative,positive,negative,positive,negative Path: C:\Users\cross-entropy\.fastai\data\imdb_sample; Valid: LabelList (200 items) x: TextList xxbos xxmaj this very funny xxmaj british comedy shows what might happen if a section of xxmaj london , in this case xxmaj xxunk , were to xxunk itself independent from the rest of the xxup uk and its laws , xxunk & post - war xxunk . xxmaj merry xxunk is what would happen . xxmaj the explosion of a wartime bomb leads to the xxunk of ancient xxunk which show that xxmaj xxunk was xxunk to the xxmaj xxunk of xxmaj xxunk xxunk ago , a small historical xxunk long since forgotten . xxmaj to the new xxmaj xxunk , however , this is an unexpected opportunity to live as they please , free from any xxunk from xxmaj xxunk . xxmaj stanley xxmaj xxunk is excellent as the minor city xxunk who suddenly finds himself leading one of the world 's xxunk xxunk . xxmaj xxunk xxmaj margaret xxmaj xxunk is a delight as the history professor who sides with xxmaj xxunk . xxmaj others in the stand - out cast include xxmaj xxunk xxmaj xxunk , xxmaj paul xxmaj xxunk , xxmaj xxunk xxmaj xxunk , xxmaj xxunk xxmaj xxunk & xxmaj sir xxmaj michael xxmaj xxunk . xxmaj welcome to xxmaj xxunk !,xxbos i saw this movie once as a kid on the late - late show and fell in love with it . xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk,xxbos xxmaj this is , in my opinion , a very good film , especially for xxmaj michael xxmaj jackson lovers . xxmaj it contains a message on drugs , stunning special effects , and an awesome music video . xxmaj the main film is xxunk around the song and music video ' xxmaj smooth xxmaj criminal . ' xxmaj unlike the four - minute music video , it is normal speed and , in my opinion , much xxunk to watch . xxmaj the plot is rather weird , however . xxmaj michael xxmaj jackson plays a xxunk ' gangster ' that , when he sees a shooting star , he xxunk into a piece of xxunk . xxmaj throughout the film , he xxunk into a race car , a giant robot , and a space ship . xxmaj the robot scene in particular is a bit drawn out and strange . i found it a little out - of - whack compared to the rest of the film . a child is kidnapped , xxmaj michael tries to save her , is tortured and beaten , and suddenly turns into a giant robot that blows up all the bad guys . a little weird ? xxmaj yeah . xxmaj but besides the bizarre robot scene , it 's a very good movie , and any xxmaj michael xxmaj jackson fan will enjoy both the xxmaj smooth xxmaj criminal music video and the movie .,xxbos xxmaj in xxmaj iran , women are not xxunk to attend men 's sporting events , apparently to " xxunk " them from all the xxunk and foul language they might hear xxunk from the male fans ( so since men ca n't xxunk or xxunk themselves , women are forced to suffer . xxmaj go figure . ) . " xxmaj xxunk " tells the tale of a half dozen or so young women who , dressed like men , attempt to xxunk into the high - xxunk match between xxmaj iran and xxmaj xxunk that , in xxunk , qualified xxmaj iran to go to the xxmaj world xxmaj cup ( the movie was actually filmed in large part during that game ) . " xxmaj xxunk " is a xxunk - of - life comedy that will remind you of all those great xxunk films ( " xxmaj the xxmaj shop on xxmaj main xxmaj street , " " xxmaj loves of a xxmaj blonde , " " xxmaj closely xxmaj watched xxmaj trains " etc . ) that xxunk out of xxmaj communist xxmaj xxunk as part of the " xxmaj xxunk xxmaj xxunk " in the mid xxunk 's . xxmaj as with many of those works , " xxmaj xxunk " is more concerned with xxunk life than with xxunk any kind of xxunk contrived fictional narrative . xxmaj indeed , it is the simplicity of the xxunk and the xxunk of the style that make the movie so effective . xxmaj once their xxunk is discovered , the girls are xxunk into a small xxunk right outside the xxunk where they can hear the xxunk xxunk xxunk from the game inside . xxmaj stuck where they are , all they can do is xxunk with the security guards to let them go in , guards who are basically xxunk , good - xxunk xxunk who are compelled to do their duty as a part of their xxunk military service . xxmaj even most of the men going into the xxunk do n't seem particularly xxunk at the thought of these women being allowed in . xxmaj still the prohibition xxunk . xxmaj yet , how can one not be impressed by the very real courage and xxunk displayed by these women as they go up against a system that continues to xxunk such a xxunk xxunk and xxunk xxunk ? xxmaj and , yet , the purpose of these women is not to xxunk behind a cause or to make a " point . " xxmaj they are simply obsessed fans with a burning desire to watch a soccer game and , like all the men in the country , xxunk on their team . xxmaj it 's hard to tell just how much of the dialogue is scripted and how much of it is xxunk , but , in either case , the actors , with their xxunk xxunk faces , do a magnificent job making each moment seem utterly real and convincing . xxmaj xxunk xxmaj xxunk - xxunk and xxmaj xxunk xxmaj xxunk are notable xxunk in a xxunk excellent cast . xxmaj the structure of the film is also very loose and xxunk , as writer / director xxmaj xxunk xxmaj xxunk and co - writer xxmaj xxunk xxmaj xxunk focus for a few brief moments on one or two of the characters , then move xxunk and xxunk onto others . xxmaj with this documentary - type approach , we come to feel as if we are xxunk an actual event xxunk in " real time . " xxmaj very often , it 's quite easy for us to forget we 're actually watching a movie . xxmaj it was a very smart move on the part of the filmmakers to include so much good - xxunk humor in the film ( it 's what the xxmaj xxunk filmmakers did as well ) , the better to point up the utter absurdity of the situation and xxunk the appeal of the film for audiences both domestic and foreign . " xxmaj xxunk " is obviously a cry for justice , but it is one that is made all the more effective by its xxunk to make of its story a heavy - breathing tragedy . xxmaj instead , it realizes that nothing breaks down social xxunk quite as xxunk as humor and an appeal to the audience 's common humanity . xxmaj and is n't that what true art is supposed to be all about ? xxmaj in its own quiet , xxunk way , " xxmaj xxunk " is one of the great , under - appreciated xxunk of xxunk .,xxbos " xxmaj in xxmaj xxunk xxunk , the xxmaj university of xxmaj xxunk xxunk to xxunk xxmaj xxunk xxmaj national xxmaj xxunk , with an xxunk of xxmaj xxunk xxunk offering to xxunk the research . xxmaj xxunk xxunk became the first " national " xxunk . xxmaj it did not , however , remain at its original location in the xxmaj xxunk forest . xxmaj in xxunk , it moved xxunk west from the " xxmaj xxunk xxmaj city " to a new site on xxmaj xxunk xxunk . xxmaj when xxmaj xxunk xxmaj xxunk visited xxmaj xxunk 's director , xxmaj walter xxmaj xxunk , in xxunk , he asked him what kind of xxunk was to be built at the new site . xxmaj when xxmaj xxunk described a heavy - water xxunk xxunk at one - xxunk the power of the xxmaj xxunk xxmaj xxunk xxmaj xxunk under design at xxmaj xxunk xxmaj xxunk , xxmaj xxunk xxunk it would be xxunk if xxmaj xxunk took the xxmaj xxunk xxmaj xxunk design and xxunk the xxmaj xxunk xxmaj xxunk xxmaj xxunk at one - xxunk capacity . xxmaj the joke proved unintentionally xxunk . " xxmaj the xxup xxunk plant used xxunk to separate the xxunk in thousands of tall xxunk . xxmaj it was built next to the xxup xxunk power plant , which provided the necessary steam . xxmaj much less xxunk than xxup xxunk , the xxup xxunk plant was torn down after the war . xxmaj concerned that the xxmaj xxunk xxmaj energy xxmaj xxunk research program might become too xxunk , xxmaj xxunk xxunk a xxunk of industrial xxunk , and during a xxmaj xxunk visit to xxmaj xxunk xxmaj xxunk , he xxunk with xxmaj clark xxmaj center , manager of xxmaj xxunk & xxmaj xxunk , a xxunk of xxmaj union xxmaj xxunk xxmaj corporation at xxmaj xxunk xxmaj xxunk , the possibility of the company xxunk xxunk of the xxmaj xxunk . xxmaj prince xxmaj henry ( of xxmaj xxunk ) xxmaj xxunk in xxmaj washington and xxmaj visiting the xxmaj german xxmaj xxunk ( xxunk ) . xxmaj xxunk , with xxmaj prince xxmaj henry of xxmaj xxunk according to the xxunk of science and its xxunk their were already concerns with the xxunk of new science with military xxunk . xxmaj the xxmaj xxunk ( xxunk / xxup ii ) , " xxmaj xxunk xxmaj xxunk 's splendid xxunk at the xxunk xxmaj st. xxmaj xxunk , xxmaj new xxmaj york . xxmaj taken at the exact moment of xxmaj prince xxmaj henry 's xxunk , and the raising of the xxunk standard . " xxmaj if xxmaj xxunk knew of these necessary xxunk to xxunk xxunk then what was the xxunk of the xxunk xxup xxunk and xxup wwii . xxmaj the quality of xxunk control i xxunk ? xxmaj thus , did the xxunk of xxmaj xxunk xxmaj xxunk xxunk for a military mission , or a business plan , based on the security xxunk of xxmaj xxunk xxunk ? xxmaj because supposedly their were no survivors , and the ones who were caught in xxmaj europe ordered to be executed . xxmaj of the xxunk man commando team the survivors who were captured were executed under orders of the xxmaj german xxmaj army against xxunk , and xxunk acts of the xxmaj state of xxmaj germany . xxmaj the xxmaj xxunk xxmaj no . xxunk / xxunk xxunk xxmaj xxunk . xxup xxunk / xxunk , xxmaj xxunk xxup xxunk , 18 xxmaj xxunk xxunk , ( xxunk ) xxmaj xxunk xxmaj hitler ; xxmaj translation of xxmaj document no . xxup xxunk , xxmaj office of xxup u.s. xxmaj chief of xxmaj xxunk , xxunk true copy xxmaj xxunk xxmaj major , xxunk xxup xxunk xxunk xxmaj march xxunk , xxunk , xxunk at the xxup u.s. xxmaj national xxmaj xxunk . xxmaj the xxup xxunk xxmaj society xxunk xxunk xxmaj xxunk xxmaj xxunk . , xxunk xxunk , xxup xxunk xxunk y: CategoryList positive,positive,positive,positive,positive Path: C:\Users\cross-entropy\.fastai\data\imdb_sample; Test: None
movie_reviews
object:¶dir(movie_reviews)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'add_test', 'add_test_folder', 'databunch', 'filter_by_func', 'get_processors', 'label_const', 'label_empty', 'label_from_df', 'label_from_folder', 'label_from_func', 'label_from_list', 'label_from_lists', 'label_from_re', 'lists', 'load_empty', 'load_state', 'path', 'process', 'test', 'train', 'transform', 'transform_y', 'valid']
movie_reviews
splits the data into training and validation sets, .train
and .valid
¶print(f'There are {len(movie_reviews.train.x)} and {len(movie_reviews.valid.x)} reviews in the training and validations sets, respectively.')
There are 800 and 200 reviews in the training and validations sets, respectively.
All those tokens starting with "xx" are fastai special tokens. You can see the list of all of them and their meanings (in the fastai docs):
training set
¶print(f'\fThere are {len(movie_reviews.train.x)} movie reviews in the training set\n')
print(movie_reviews.train)
There are 800 movie reviews in the training set LabelList (800 items) x: TextList xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk !,xxbos xxmaj this is a extremely well - made film . xxmaj the acting , script and camera - work are all first - rate . xxmaj the music is good , too , though it is mostly early in the film , when things are still relatively xxunk . xxmaj there are no really xxunk in the cast , though several faces will be familiar . xxmaj the entire cast does an excellent job with the script . xxmaj but it is hard to watch , because there is no good end to a situation like the one presented . xxmaj it is now xxunk to blame the xxmaj british for setting xxmaj hindus and xxmaj muslims against each other , and then xxunk xxunk them into two countries . xxmaj there is some merit in this view , but it 's also true that no one forced xxmaj hindus and xxmaj muslims in the region to xxunk each other as they did around the time of partition . xxmaj it seems more likely that the xxmaj british simply saw the xxunk between the xxunk and were clever enough to exploit them to their own ends . xxmaj the result is that there is much cruelty and inhumanity in the situation and this is very unpleasant to remember and to see on the screen . xxmaj but it is never painted as a black - and - white case . xxmaj there is xxunk and xxunk on both sides , and also the hope for change in the younger generation . xxmaj there is redemption of a sort , in the end , when xxmaj xxunk has to make a hard choice between a man who has ruined her life , but also truly loved her , and her family which has xxunk her , then later come looking for her . xxmaj but by that point , she has no xxunk that is without great pain for her . xxmaj this film carries the message that both xxmaj muslims and xxmaj hindus have their grave xxunk , and also that both can be xxunk and caring people . xxmaj the reality of partition makes that xxunk all the more wrenching , since there can never be real xxunk across the xxmaj india / xxmaj pakistan border . xxmaj in that sense , it is similar to " xxmaj mr & xxmaj xxunk xxmaj xxunk " . xxmaj in the end , we were glad to have seen the film , even though the resolution was xxunk . xxmaj if the xxup uk and xxup us could deal with their own xxunk of racism with this kind of xxunk , they would certainly be better off .,xxbos xxmaj every once in a long while a movie will come along that will be so awful that i feel compelled to warn people . xxmaj if i labor all my days and i can save but one soul from watching this movie , how great will be my joy . xxmaj where to begin my discussion of pain . xxmaj for xxunk , there was a musical xxunk every five minutes . xxmaj there was no character development . xxmaj every character was a stereotype . xxmaj we had xxunk guy , fat guy who eats donuts , goofy foreign guy , etc . xxmaj the script felt as if it were being written as the movie was being shot . xxmaj the production value was so incredibly low that it felt like i was watching a junior high video presentation . xxmaj have the directors , producers , etc . ever even seen a movie before ? xxmaj xxunk is getting worse and worse with every new entry . xxmaj the concept for this movie sounded so funny . xxmaj how could you go wrong with xxmaj gary xxmaj coleman and a handful of somewhat legitimate actors . xxmaj but trust me when i say this , things went wrong , xxup very xxup wrong .,xxbos xxmaj name just says it all . i watched this movie with my dad when it came out and having served in xxmaj xxunk he had great admiration for the man . xxmaj the disappointing thing about this film is that it only xxunk on a short period of the man 's life - interestingly enough the man 's entire life would have made such an epic bio - xxunk that it is staggering to imagine the cost for production . xxmaj some posters xxunk to the flawed xxunk about the man , which are cheap shots . xxmaj the theme of the movie " xxmaj duty , xxmaj honor , xxmaj country " are not just mere words xxunk from the lips of a high - xxunk officer - it is the deep xxunk of one man 's total devotion to his country . xxmaj ironically xxmaj xxunk being the liberal that he was xxunk a better understanding of the man . xxmaj he does a great job showing the xxunk general xxunk with the xxunk side of the man .,xxbos xxmaj this movie succeeds at being one of the most unique movies you 've seen . xxmaj however this comes from the fact that you ca n't make heads or xxunk of this mess . xxmaj it almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid . xxmaj if you do n't want to feel xxunk you 'll sit through this horrible film and develop a real sense of pity for the actors involved , they 've all seen better days , but then you realize they actually got paid quite a bit of money to do this and you 'll lose pity for them just like you 've already done for the film . i ca n't go on enough about this horrible movie , its almost something that xxmaj ed xxmaj wood would have made and in that case it surely would have been his masterpiece . xxmaj to start you are forced to sit through an opening dialogue the likes of which you 've never seen / heard , this thing has got to be five minutes long . xxmaj on top of that it is narrated , as to suggest that you the viewer can not read . xxmaj then we meet xxmaj mr. xxmaj xxunk and the xxunk of terrible lines gets xxunk , it is as if he is xxunk solely to get lines on to the movie poster xxunk line . xxmaj soon we meet xxmaj stephen xxmaj xxunk , who i typically enjoy ) and he does his best not to drown in this but ultimately he does . xxmaj then comes the ultimate insult , xxmaj tara xxmaj xxunk playing an intelligent role , oh help us ! xxmaj tara xxmaj xxunk is not a very talented actress and somehow she xxunk gets roles in movies , in my opinion though she should stick to movies of the xxmaj american pie type . xxmaj all in all you just may want to see this for yourself when it comes out on video , i know that i got a kick out of it , i mean lets all be honest here , sometimes its comforting to xxunk in the shortcomings of others . y: CategoryList negative,positive,negative,positive,negative Path: C:\Users\cross-entropy\.fastai\data\imdb_sample
string
, which contains the tokens separated by spaces. Here is the text of the first review:¶print(movie_reviews.train.x[0].text)
print(f'\nThere are {len(movie_reviews.train.x[0].text)} characters in the review')
xxbos xxmaj un - xxunk - believable ! xxmaj meg xxmaj ryan does n't even look her usual xxunk lovable self in this , which normally makes me forgive her shallow xxunk acting xxunk . xxmaj hard to believe she was the producer on this dog . xxmaj plus xxmaj kevin xxmaj kline : what kind of suicide trip has his career been on ? xxmaj xxunk ... xxmaj xxunk ! ! ! xxmaj finally this was directed by the guy who did xxmaj big xxmaj xxunk ? xxmaj must be a replay of xxmaj jonestown - hollywood style . xxmaj xxunk ! There are 511 characters in the review
print(movie_reviews.train.x[0].text.split())
print(f'\nThe review has {len(movie_reviews.train.x[0].text.split())} tokens')
['xxbos', 'xxmaj', 'un', '-', 'xxunk', '-', 'believable', '!', 'xxmaj', 'meg', 'xxmaj', 'ryan', 'does', "n't", 'even', 'look', 'her', 'usual', 'xxunk', 'lovable', 'self', 'in', 'this', ',', 'which', 'normally', 'makes', 'me', 'forgive', 'her', 'shallow', 'xxunk', 'acting', 'xxunk', '.', 'xxmaj', 'hard', 'to', 'believe', 'she', 'was', 'the', 'producer', 'on', 'this', 'dog', '.', 'xxmaj', 'plus', 'xxmaj', 'kevin', 'xxmaj', 'kline', ':', 'what', 'kind', 'of', 'suicide', 'trip', 'has', 'his', 'career', 'been', 'on', '?', 'xxmaj', 'xxunk', '...', 'xxmaj', 'xxunk', '!', '!', '!', 'xxmaj', 'finally', 'this', 'was', 'directed', 'by', 'the', 'guy', 'who', 'did', 'xxmaj', 'big', 'xxmaj', 'xxunk', '?', 'xxmaj', 'must', 'be', 'a', 'replay', 'of', 'xxmaj', 'jonestown', '-', 'hollywood', 'style', '.', 'xxmaj', 'xxunk', '!'] The review has 103 tokens
numericalized
, ie. mapped to integers. So a movie review is also stored as an array of integers:¶print(movie_reviews.train.x[0].data)
print(f'\nThe array contains {len(movie_reviews.train.x[0].data)} numericalized tokens')
[ 2 5 4622 25 ... 10 5 0 52] The array contains 103 numericalized tokens
movie_revews
object also contains a .vocab
property, even though it is not shown withdir()
. (This may be an error in the fastai
library.)¶movie_reviews.vocab
<fastai.text.transform.Vocab at 0x25a0c5b2048>
vocab
object is a kind of reversible dictionary that translates back and forth between tokens and their integer representations. It has two methods of particular interest: stoi
and itos
, which stand for string-to-index
and index-to-string
¶movie_reviews.vocab.stoi
maps vocabulary tokens to their indexes
in vocab¶movie_reviews.vocab.stoi
defaultdict(int, {'xxunk': 0, 'xxpad': 1, 'xxbos': 2, 'xxeos': 3, 'xxfld': 4, 'xxmaj': 5, 'xxup': 6, 'xxrep': 7, 'xxwrep': 8, 'the': 9, '.': 10, ',': 11, 'and': 12, 'a': 13, 'of': 14, 'to': 15, 'is': 16, 'it': 17, 'in': 18, 'i': 19, 'that': 20, 'this': 21, '"': 22, "'s": 23, '\n \n ': 24, '-': 25, 'was': 26, 'as': 27, 'for': 28, 'movie': 29, 'with': 30, 'but': 31, 'film': 32, 'you': 33, ')': 34, 'on': 35, '(': 36, "n't": 37, 'are': 38, 'he': 39, 'his': 40, 'not': 41, 'have': 42, 'be': 43, 'one': 44, 'they': 45, 'all': 46, 'at': 47, 'by': 48, 'an': 49, 'from': 50, 'like': 51, '!': 52, 'so': 53, 'who': 54, 'there': 55, 'about': 56, 'just': 57, 'out': 58, 'if': 59, 'or': 60, 'do': 61, 'what': 62, 'her': 63, 'has': 64, "'": 65, 'some': 66, 'more': 67, 'good': 68, 'when': 69, 'up': 70, 'very': 71, '?': 72, 'she': 73, 'would': 74, 'no': 75, 'really': 76, 'were': 77, 'their': 78, 'my': 79, 'had': 80, 'time': 81, 'can': 82, 'only': 83, 'which': 84, 'even': 85, 'see': 86, 'story': 87, 'me': 88, 'into': 89, 'did': 90, ':': 91, 'well': 92, 'we': 93, 'will': 94, 'does': 95, 'than': 96, 'also': 97, 'get': 98, '...': 99, 'people': 100, 'other': 101, 'bad': 102, 'been': 103, 'could': 104, 'first': 105, 'much': 106, 'how': 107, 'most': 108, 'any': 109, 'because': 110, 'two': 111, 'then': 112, 'great': 113, 'him': 114, 'its': 115, 'too': 116, 'made': 117, 'them': 118, 'after': 119, 'movies': 120, 'make': 121, '/': 122, 'way': 123, 'think': 124, 'never': 125, 'watch': 126, 'acting': 127, 'seen': 128, ';': 129, 'films': 130, 'plot': 131, 'being': 132, 'many': 133, 'over': 134, 'where': 135, 'character': 136, 'man': 137, 'little': 138, 'better': 139, 'life': 140, 'characters': 141, 'love': 142, 'your': 143, 'here': 144, 'know': 145, 'scenes': 146, 'best': 147, 'end': 148, 'show': 149, 'while': 150, 'through': 151, 'should': 152, 'off': 153, 'ever': 154, 'these': 155, 'go': 156, 'such': 157, 'say': 158, '--': 159, 'something': 160, 'scene': 161, 'still': 162, 'before': 163, 'though': 164, 'watching': 165, 'between': 166, 'actually': 167, 'old': 168, '10': 169, 'find': 170, 'back': 171, 'now': 172, 'why': 173, 'years': 174, "'ve": 175, 'actors': 176, 'fact': 177, 'those': 178, "'m": 179, 'thing': 180, 'pretty': 181, 'quite': 182, 'part': 183, 'going': 184, 'same': 185, 'real': 186, 'another': 187, 'down': 188, 'funny': 189, 'nothing': 190, 'look': 191, 'makes': 192, '*': 193, 'new': 194, 'want': 195, 'action': 196, '&': 197, 'director': 198, 'work': 199, 'few': 200, "'re": 201, 'seems': 202, 'around': 203, 'world': 204, 'point': 205, 'without': 206, 'cast': 207, 'again': 208, 'own': 209, 'both': 210, 'lot': 211, 'enough': 212, 'every': 213, 'family': 214, 'got': 215, 'ca': 216, "'ll": 217, 'probably': 218, 'big': 219, 'bit': 220, 'might': 221, 'things': 222, 'horror': 223, 'us': 224, 'almost': 225, 'may': 226, 'right': 227, 'must': 228, 'away': 229, 'thought': 230, 'interesting': 231, 'least': 232, 'whole': 233, 'series': 234, 'gets': 235, 'each': 236, 'give': 237, 'young': 238, 'however': 239, 'making': 240, 'day': 241, 'fun': 242, 'anything': 243, 'minutes': 244, 'kind': 245, 'come': 246, 'girl': 247, 'saw': 248, 'script': 249, 'take': 250, 'long': 251, 'times': 252, 'someone': 253, 'found': 254, 'done': 255, 'feel': 256, 'far': 257, 'since': 258, 'role': 259, 'original': 260, 'course': 261, 'goes': 262, 'last': 263, 'true': 264, 'simply': 265, 'always': 266, "'d": 267, 'tv': 268, 'hard': 269, 'place': 270, 'set': 271, 'trying': 272, 'believe': 273, 'shot': 274, 'comes': 275, 'actor': 276, 'yet': 277, '4': 278, 'having': 279, 'book': 280, 'looks': 281, 'guy': 282, 'screen': 283, 'later': 284, 'shows': 285, 'performance': 286, 'worth': 287, 'audience': 288, 'comedy': 289, 'sure': 290, 'looking': 291, 'sense': 292, 'star': 293, 'effects': 294, 'read': 295, 'takes': 296, 'although': 297, 'ending': 298, 'john': 299, 'anyone': 300, 'worst': 301, 'american': 302, 'year': 303, 'especially': 304, 'women': 305, 'together': 306, 'dvd': 307, 'instead': 308, 'different': 309, 'am': 310, 'woman': 311, 'men': 312, '2': 313, 'our': 314, 'played': 315, 'music': 316, 'special': 317, 'three': 318, 'rest': 319, 'put': 320, 'maybe': 321, 'wife': 322, 'kids': 323, 'war': 324, 'left': 325, 'black': 326, 'once': 327, 'second': 328, 'watched': 329, 'next': 330, 'friends': 331, 'rather': 332, 'let': 333, '\x96': 334, 'job': 335, 'start': 336, 'others': 337, 'budget': 338, 'need': 339, 'mind': 340, 'said': 341, 'main': 342, 'else': 343, 'wrong': 344, 'beautiful': 345, 'half': 346, 'high': 347, 'idea': 348, 'death': 349, 'tell': 350, 'help': 351, 'nice': 352, 'seem': 353, 'perhaps': 354, 'hollywood': 355, 'everyone': 356, 'play': 357, 'case': 358, 'production': 359, 'piece': 360, 'episode': 361, 'camera': 362, 'low': 363, 'already': 364, 'top': 365, 'poor': 366, 'during': 367, '3': 368, 'stars': 369, 'house': 370, '..': 371, 'couple': 372, 'boring': 373, 'reason': 374, 'try': 375, 'along': 376, 'name': 377, 'small': 378, 'plays': 379, 'father': 380, 'everything': 381, 'used': 382, 'video': 383, 'getting': 384, 'money': 385, 'full': 386, 'less': 387, 'performances': 388, 'often': 389, 'liked': 390, 'came': 391, '1': 392, 'robert': 393, 'either': 394, 'fan': 395, 'given': 396, 'hand': 397, 'kill': 398, 'felt': 399, 'yes': 400, 'completely': 401, 'night': 402, 'children': 403, 'himself': 404, 'girls': 405, 'early': 406, 'awful': 407, 'oh': 408, 'live': 409, 'picture': 410, 'parts': 411, 'throughout': 412, 'until': 413, 'become': 414, 'town': 415, 'written': 416, 'terrible': 417, 'turn': 418, 'child': 419, 'despite': 420, 'moments': 421, 'boy': 422, 'problem': 423, 'able': 424, 'head': 425, 'stupid': 426, 'beginning': 427, 'home': 428, 'version': 429, 'excellent': 430, 'sometimes': 431, 'overall': 432, 'recommend': 433, 'sex': 434, 'keep': 435, 'human': 436, 'drama': 437, 'hero': 438, 'supposed': 439, 'seemed': 440, 'use': 441, 'writing': 442, 'wo': 443, 'remember': 444, 'went': 445, 'enjoy': 446, 'classic': 447, 'person': 448, 'killer': 449, 'lost': 450, 'late': 451, '5': 452, 'title': 453, 'king': 454, 'entire': 455, 'history': 456, 'son': 457, 'school': 458, 'lead': 459, 'english': 460, 'sound': 461, 'cinema': 462, 'seeing': 463, 'unfortunately': 464, 'genre': 465, 'sort': 466, 'mean': 467, 'friend': 468, 'fans': 469, 'close': 470, 'quality': 471, 'definitely': 472, 'james': 473, 'worse': 474, 'says': 475, 'except': 476, 'doing': 477, 'itself': 478, 'past': 479, 'certainly': 480, 'days': 481, 'five': 482, 'dialogue': 483, 'line': 484, 'anyway': 485, 'under': 486, 'tries': 487, 'called': 488, 'fine': 489, 'guys': 490, 'care': 491, 'style': 492, 'hope': 493, 'short': 494, 'lines': 495, 'told': 496, 'car': 497, 'decent': 498, 'brother': 499, 'killed': 500, 'wanted': 501, 'entertaining': 502, 'based': 503, 'absolutely': 504, 'feeling': 505, 'truly': 506, 'etc': 507, 'heard': 508, 'serious': 509, 'run': 510, 'wonderful': 511, 'lives': 512, 'gives': 513, 'moment': 514, 'game': 515, 'documentary': 516, 'self': 517, 'several': 518, 'waste': 519, 'dead': 520, 'blood': 521, 'matter': 522, 'wonder': 523, 'humor': 524, 'thinking': 525, 'against': 526, 'white': 527, 'side': 528, 'works': 529, 'mother': 530, 'flick': 531, 'stuff': 532, 'turns': 533, 'finally': 534, 'loved': 535, 'group': 536, 'wants': 537, 'face': 538, 'guess': 539, 'dark': 540, 'city': 541, 'events': 542, 'starts': 543, 'hour': 544, 'took': 545, 'george': 546, 'themselves': 547, 'red': 548, 'behind': 549, 'talking': 550, 'hit': 551, 'eyes': 552, 'attempt': 553, 'direction': 554, 'novel': 555, 'saying': 556, 'word': 557, 'dull': 558, 'light': 559, 'view': 560, 'playing': 561, 'opinion': 562, 'expect': 563, 'evil': 564, 'ten': 565, 'violence': 566, 'local': 567, 'final': 568, 'gave': 569, 'leave': 570, 'paul': 571, 'crap': 572, 'happens': 573, 'knows': 574, 'problems': 575, 'example': 576, 'relationship': 577, 'non': 578, 'michael': 579, 'victor': 580, 'ridiculous': 581, 'god': 582, 'similar': 583, 'general': 584, 'major': 585, 'bunch': 586, 'sister': 587, 'oscar': 588, 'turned': 589, 'brilliant': 590, 'highly': 591, 'nearly': 592, 'de': 593, 'please': 594, 'romance': 595, 'body': 596, 'extremely': 597, 'mr.': 598, 'soon': 599, 'yourself': 600, 'known': 601, 'lack': 602, 'age': 603, 'interest': 604, 'ago': 605, 'stories': 606, 'exactly': 607, 'finds': 608, 'modern': 609, 'voice': 610, 'perfect': 611, 'heart': 612, 'alone': 613, 'tells': 614, 'daughter': 615, 'directed': 616, 'needs': 617, 'kid': 618, 'lady': 619, 'sad': 620, 'fight': 621, 'happened': 622, 'eye': 623, 'favorite': 624, 'using': 625, 'upon': 626, 'ben': 627, 'none': 628, 'beyond': 629, 'nature': 630, 'change': 631, 'save': 632, 'shots': 633, 'country': 634, 'number': 635, 'shown': 636, 'surprised': 637, 'romantic': 638, 'huge': 639, 'murder': 640, 'steve': 641, 'slow': 642, 'myself': 643, 'woods': 644, 'apparently': 645, 'lake': 646, 'cheap': 647, 'involved': 648, 'roles': 649, '6': 650, 'gore': 651, 'obviously': 652, 'knew': 653, 'level': 654, '8': 655, 'experience': 656, 'became': 657, 'gone': 658, 'cover': 659, 'amazing': 660, 'create': 661, 'living': 662, 'usually': 663, 'order': 664, 'monster': 665, 'happen': 666, 'list': 667, 'clearly': 668, 'power': 669, 'features': 670, 're': 671, 'subject': 672, 'across': 673, 'parents': 674, 'seriously': 675, 'ways': 676, 'room': 677, 'filmed': 678, 'cheesy': 679, 'disappointed': 680, 'important': 681, 'plenty': 682, '7': 683, 'particular': 684, 'started': 685, 'today': 686, 'enjoyed': 687, 'cinematography': 688, 'annoying': 689, 'looked': 690, 'supporting': 691, 'mostly': 692, 'message': 693, 'somewhat': 694, 'viewer': 695, 'type': 696, 'certain': 697, 'release': 698, 'effort': 699, 'possible': 700, 'add': 701, 'figure': 702, 'named': 703, 'wish': 704, 'difficult': 705, 'falls': 706, 'four': 707, 'husband': 708, 'score': 709, 'leads': 710, 'form': 711, 'working': 712, 'writer': 713, 'sets': 714, 'including': 715, 'enjoyable': 716, 'ok': 717, 'note': 718, 'spent': 719, 'review': 720, 'art': 721, 'police': 722, 'sit': 723, 'horrible': 724, 'actress': 725, 'ones': 726, 'bring': 727, 'greatest': 728, 'dance': 729, 'earth': 730, 'becomes': 731, 'happy': 732, 'cut': 733, 'straight': 734, 'soundtrack': 735, 'leading': 736, 'laugh': 737, 'strange': 738, 'space': 739, 'b': 740, 'tale': 741, 'comic': 742, 'near': 743, 'due': 744, 'weak': 745, 'earlier': 746, 'follow': 747, 'british': 748, 'ends': 749, 'typical': 750, 'attention': 751, 'points': 752, 'talent': 753, 'tom': 754, 'female': 755, 'future': 756, 'fall': 757, 'laughs': 758, 'stop': 759, 'easy': 760, 'moving': 761, 'apart': 762, 'chance': 763, 'running': 764, 'york': 765, 'particularly': 766, 'luke': 767, 'bill': 768, 'forced': 769, 'theme': 770, 'easily': 771, 'rating': 772, 'coming': 773, 'davis': 774, 'totally': 775, 'realistic': 776, 'simple': 777, 'hours': 778, 'taken': 779, 'indeed': 780, 'released': 781, 'sexual': 782, 'feels': 783, 'french': 784, 'screenplay': 785, 'la': 786, 'jokes': 787, 'sequences': 788, 'chase': 789, 'portrayed': 790, 'dramatic': 791, 'mention': 792, 'talk': 793, 'gun': 794, 'thriller': 795, 'jimmy': 796, 'career': 797, 'reality': 798, 'incredibly': 799, 'whether': 800, 'towards': 801, 'entertainment': 802, 'feature': 803, 'western': 804, 'dialog': 805, 'business': 806, 'suspense': 807, 'focus': 808, 'doubt': 809, 'possibly': 810, 'water': 811, 'gay': 812, 'blob': 813, 'comments': 814, 'brothers': 815, 'clear': 816, 'agree': 817, 'allen': 818, 'door': 819, 'editing': 820, 'third': 821, 'deserves': 822, 'silly': 823, 'fantastic': 824, 'convincing': 825, 'hardly': 826, 'lame': 827, 'act': 828, 'former': 829, 'material': 830, 'appears': 831, 'understand': 832, 'twist': 833, 'episodes': 834, 'buy': 835, 'secret': 836, 'richard': 837, 'south': 838, 'bourne': 839, 'deal': 840, 'musical': 841, 'words': 842, 'unique': 843, 'mess': 844, 'opening': 845, 'society': 846, 'avoid': 847, 'footage': 848, 'joe': 849, 'free': 850, 'forget': 851, 'herself': 852, 'appear': 853, 'obvious': 854, 'box': 855, 'single': 856, 'average': 857, 'indian': 858, 'rent': 859, 'okay': 860, 'scary': 861, 'within': 862, 'office': 863, 'crime': 864, 'science': 865, '80': 866, 'believable': 867, 'period': 868, 'showing': 869, 'call': 870, 'return': 871, 'keeps': 872, 'lee': 873, 'expected': 874, 'stay': 875, 'middle': 876, 'jack': 877, 'hands': 878, 'david': 879, 'attempts': 880, 'strong': 881, 'tension': 882, 'crew': 883, 'hilarious': 884, 'grade': 885, 'outside': 886, 'means': 887, 'viewing': 888, 'sadly': 889, 'hell': 890, 'whatever': 891, 'sorry': 892, 'recently': 893, 'stage': 894, 'decides': 895, 'hear': 896, 'team': 897, 'learn': 898, 'nor': 899, 'open': 900, 'break': 901, 'question': 902, 'remake': 903, 'porn': 904, 'pain': 905, 'imagine': 906, 'deep': 907, 'zombie': 908, 'basically': 909, 'killing': 910, 'company': 911, 'poorly': 912, 'dr.': 913, 'predictable': 914, 'taking': 915, 'large': 916, 'language': 917, 'giving': 918, 'public': 919, 'audiences': 920, 'ask': 921, 'cool': 922, 'america': 923, 'slasher': 924, 'west': 925, 'mentioned': 926, 'die': 927, 'christmas': 928, 'complete': 929, 'needed': 930, 'martin': 931, 'makers': 932, 'cgi': 933, 'boys': 934, 'vargas': 935, 'usual': 936, 'begin': 937, 'dad': 938, 'total': 939, 'somehow': 940, 'stick': 941, 'shame': 942, 'successful': 943, 'sitting': 944, 'fred': 945, 'meets': 946, 'unless': 947, 'dancing': 948, 'sounds': 949, 'above': 950, 'elements': 951, 'whose': 952, 'german': 953, 'considering': 954, 'caught': 955, 'credit': 956, 'interested': 957, 'move': 958, 'filming': 959, 'truth': 960, 'eventually': 961, 'share': 962, 'ability': 963, 'meaning': 964, 'agent': 965, 'fast': 966, 'stand': 967, 'onto': 968, 'plain': 969, 'comment': 970, 'kept': 971, 'situation': 972, 'setting': 973, 'value': 974, 'willing': 975, 'realize': 976, 'acted': 977, 'weird': 978, 'alive': 979, 'fairly': 980, 'dream': 981, 'building': 982, 'hair': 983, 'bored': 984, 'minute': 985, 'emotional': 986, 'directing': 987, 'theatrical': 988, 'famous': 989, 'begins': 990, 'front': 991, 'catch': 992, 'sequence': 993, 'runs': 994, 'follows': 995, 'song': 996, 'government': 997, 'miss': 998, 'actual': 999, ...})
movie_reviews.vocab.itos
maps the indexes
of vocabulary tokens to strings
¶movie_reviews.vocab.itos
['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep', 'the', '.', ',', 'and', 'a', 'of', 'to', 'is', 'it', 'in', 'i', 'that', 'this', '"', "'s", '\n \n ', '-', 'was', 'as', 'for', 'movie', 'with', 'but', 'film', 'you', ')', 'on', '(', "n't", 'are', 'he', 'his', 'not', 'have', 'be', 'one', 'they', 'all', 'at', 'by', 'an', 'from', 'like', '!', 'so', 'who', 'there', 'about', 'just', 'out', 'if', 'or', 'do', 'what', 'her', 'has', "'", 'some', 'more', 'good', 'when', 'up', 'very', '?', 'she', 'would', 'no', 'really', 'were', 'their', 'my', 'had', 'time', 'can', 'only', 'which', 'even', 'see', 'story', 'me', 'into', 'did', ':', 'well', 'we', 'will', 'does', 'than', 'also', 'get', '...', 'people', 'other', 'bad', 'been', 'could', 'first', 'much', 'how', 'most', 'any', 'because', 'two', 'then', 'great', 'him', 'its', 'too', 'made', 'them', 'after', 'movies', 'make', '/', 'way', 'think', 'never', 'watch', 'acting', 'seen', ';', 'films', 'plot', 'being', 'many', 'over', 'where', 'character', 'man', 'little', 'better', 'life', 'characters', 'love', 'your', 'here', 'know', 'scenes', 'best', 'end', 'show', 'while', 'through', 'should', 'off', 'ever', 'these', 'go', 'such', 'say', '--', 'something', 'scene', 'still', 'before', 'though', 'watching', 'between', 'actually', 'old', '10', 'find', 'back', 'now', 'why', 'years', "'ve", 'actors', 'fact', 'those', "'m", 'thing', 'pretty', 'quite', 'part', 'going', 'same', 'real', 'another', 'down', 'funny', 'nothing', 'look', 'makes', '*', 'new', 'want', 'action', '&', 'director', 'work', 'few', "'re", 'seems', 'around', 'world', 'point', 'without', 'cast', 'again', 'own', 'both', 'lot', 'enough', 'every', 'family', 'got', 'ca', "'ll", 'probably', 'big', 'bit', 'might', 'things', 'horror', 'us', 'almost', 'may', 'right', 'must', 'away', 'thought', 'interesting', 'least', 'whole', 'series', 'gets', 'each', 'give', 'young', 'however', 'making', 'day', 'fun', 'anything', 'minutes', 'kind', 'come', 'girl', 'saw', 'script', 'take', 'long', 'times', 'someone', 'found', 'done', 'feel', 'far', 'since', 'role', 'original', 'course', 'goes', 'last', 'true', 'simply', 'always', "'d", 'tv', 'hard', 'place', 'set', 'trying', 'believe', 'shot', 'comes', 'actor', 'yet', '4', 'having', 'book', 'looks', 'guy', 'screen', 'later', 'shows', 'performance', 'worth', 'audience', 'comedy', 'sure', 'looking', 'sense', 'star', 'effects', 'read', 'takes', 'although', 'ending', 'john', 'anyone', 'worst', 'american', 'year', 'especially', 'women', 'together', 'dvd', 'instead', 'different', 'am', 'woman', 'men', '2', 'our', 'played', 'music', 'special', 'three', 'rest', 'put', 'maybe', 'wife', 'kids', 'war', 'left', 'black', 'once', 'second', 'watched', 'next', 'friends', 'rather', 'let', '\x96', 'job', 'start', 'others', 'budget', 'need', 'mind', 'said', 'main', 'else', 'wrong', 'beautiful', 'half', 'high', 'idea', 'death', 'tell', 'help', 'nice', 'seem', 'perhaps', 'hollywood', 'everyone', 'play', 'case', 'production', 'piece', 'episode', 'camera', 'low', 'already', 'top', 'poor', 'during', '3', 'stars', 'house', '..', 'couple', 'boring', 'reason', 'try', 'along', 'name', 'small', 'plays', 'father', 'everything', 'used', 'video', 'getting', 'money', 'full', 'less', 'performances', 'often', 'liked', 'came', '1', 'robert', 'either', 'fan', 'given', 'hand', 'kill', 'felt', 'yes', 'completely', 'night', 'children', 'himself', 'girls', 'early', 'awful', 'oh', 'live', 'picture', 'parts', 'throughout', 'until', 'become', 'town', 'written', 'terrible', 'turn', 'child', 'despite', 'moments', 'boy', 'problem', 'able', 'head', 'stupid', 'beginning', 'home', 'version', 'excellent', 'sometimes', 'overall', 'recommend', 'sex', 'keep', 'human', 'drama', 'hero', 'supposed', 'seemed', 'use', 'writing', 'wo', 'remember', 'went', 'enjoy', 'classic', 'person', 'killer', 'lost', 'late', '5', 'title', 'king', 'entire', 'history', 'son', 'school', 'lead', 'english', 'sound', 'cinema', 'seeing', 'unfortunately', 'genre', 'sort', 'mean', 'friend', 'fans', 'close', 'quality', 'definitely', 'james', 'worse', 'says', 'except', 'doing', 'itself', 'past', 'certainly', 'days', 'five', 'dialogue', 'line', 'anyway', 'under', 'tries', 'called', 'fine', 'guys', 'care', 'style', 'hope', 'short', 'lines', 'told', 'car', 'decent', 'brother', 'killed', 'wanted', 'entertaining', 'based', 'absolutely', 'feeling', 'truly', 'etc', 'heard', 'serious', 'run', 'wonderful', 'lives', 'gives', 'moment', 'game', 'documentary', 'self', 'several', 'waste', 'dead', 'blood', 'matter', 'wonder', 'humor', 'thinking', 'against', 'white', 'side', 'works', 'mother', 'flick', 'stuff', 'turns', 'finally', 'loved', 'group', 'wants', 'face', 'guess', 'dark', 'city', 'events', 'starts', 'hour', 'took', 'george', 'themselves', 'red', 'behind', 'talking', 'hit', 'eyes', 'attempt', 'direction', 'novel', 'saying', 'word', 'dull', 'light', 'view', 'playing', 'opinion', 'expect', 'evil', 'ten', 'violence', 'local', 'final', 'gave', 'leave', 'paul', 'crap', 'happens', 'knows', 'problems', 'example', 'relationship', 'non', 'michael', 'victor', 'ridiculous', 'god', 'similar', 'general', 'major', 'bunch', 'sister', 'oscar', 'turned', 'brilliant', 'highly', 'nearly', 'de', 'please', 'romance', 'body', 'extremely', 'mr.', 'soon', 'yourself', 'known', 'lack', 'age', 'interest', 'ago', 'stories', 'exactly', 'finds', 'modern', 'voice', 'perfect', 'heart', 'alone', 'tells', 'daughter', 'directed', 'needs', 'kid', 'lady', 'sad', 'fight', 'happened', 'eye', 'favorite', 'using', 'upon', 'ben', 'none', 'beyond', 'nature', 'change', 'save', 'shots', 'country', 'number', 'shown', 'surprised', 'romantic', 'huge', 'murder', 'steve', 'slow', 'myself', 'woods', 'apparently', 'lake', 'cheap', 'involved', 'roles', '6', 'gore', 'obviously', 'knew', 'level', '8', 'experience', 'became', 'gone', 'cover', 'amazing', 'create', 'living', 'usually', 'order', 'monster', 'happen', 'list', 'clearly', 'power', 'features', 're', 'subject', 'across', 'parents', 'seriously', 'ways', 'room', 'filmed', 'cheesy', 'disappointed', 'important', 'plenty', '7', 'particular', 'started', 'today', 'enjoyed', 'cinematography', 'annoying', 'looked', 'supporting', 'mostly', 'message', 'somewhat', 'viewer', 'type', 'certain', 'release', 'effort', 'possible', 'add', 'figure', 'named', 'wish', 'difficult', 'falls', 'four', 'husband', 'score', 'leads', 'form', 'working', 'writer', 'sets', 'including', 'enjoyable', 'ok', 'note', 'spent', 'review', 'art', 'police', 'sit', 'horrible', 'actress', 'ones', 'bring', 'greatest', 'dance', 'earth', 'becomes', 'happy', 'cut', 'straight', 'soundtrack', 'leading', 'laugh', 'strange', 'space', 'b', 'tale', 'comic', 'near', 'due', 'weak', 'earlier', 'follow', 'british', 'ends', 'typical', 'attention', 'points', 'talent', 'tom', 'female', 'future', 'fall', 'laughs', 'stop', 'easy', 'moving', 'apart', 'chance', 'running', 'york', 'particularly', 'luke', 'bill', 'forced', 'theme', 'easily', 'rating', 'coming', 'davis', 'totally', 'realistic', 'simple', 'hours', 'taken', 'indeed', 'released', 'sexual', 'feels', 'french', 'screenplay', 'la', 'jokes', 'sequences', 'chase', 'portrayed', 'dramatic', 'mention', 'talk', 'gun', 'thriller', 'jimmy', 'career', 'reality', 'incredibly', 'whether', 'towards', 'entertainment', 'feature', 'western', 'dialog', 'business', 'suspense', 'focus', 'doubt', 'possibly', 'water', 'gay', 'blob', 'comments', 'brothers', 'clear', 'agree', 'allen', 'door', 'editing', 'third', 'deserves', 'silly', 'fantastic', 'convincing', 'hardly', 'lame', 'act', 'former', 'material', 'appears', 'understand', 'twist', 'episodes', 'buy', 'secret', 'richard', 'south', 'bourne', 'deal', 'musical', 'words', 'unique', 'mess', 'opening', 'society', 'avoid', 'footage', 'joe', 'free', 'forget', 'herself', 'appear', 'obvious', 'box', 'single', 'average', 'indian', 'rent', 'okay', 'scary', 'within', 'office', 'crime', 'science', '80', 'believable', 'period', 'showing', 'call', 'return', 'keeps', 'lee', 'expected', 'stay', 'middle', 'jack', 'hands', 'david', 'attempts', 'strong', 'tension', 'crew', 'hilarious', 'grade', 'outside', 'means', 'viewing', 'sadly', 'hell', 'whatever', 'sorry', 'recently', 'stage', 'decides', 'hear', 'team', 'learn', 'nor', 'open', 'break', 'question', 'remake', 'porn', 'pain', 'imagine', 'deep', 'zombie', 'basically', 'killing', 'company', 'poorly', 'dr.', 'predictable', 'taking', 'large', 'language', 'giving', 'public', 'audiences', 'ask', 'cool', 'america', 'slasher', 'west', 'mentioned', 'die', 'christmas', 'complete', 'needed', 'martin', 'makers', 'cgi', 'boys', 'vargas', 'usual', 'begin', 'dad', 'total', 'somehow', 'stick', 'shame', 'successful', 'sitting', 'fred', 'meets', 'unless', 'dancing', 'sounds', 'above', 'elements', 'whose', 'german', 'considering', 'caught', 'credit', 'interested', 'move', 'filming', 'truth', 'eventually', 'share', 'ability', 'meaning', 'agent', 'fast', 'stand', 'onto', 'plain', 'comment', 'kept', 'situation', 'setting', 'value', 'willing', 'realize', 'acted', 'weird', 'alive', 'fairly', 'dream', 'building', 'hair', 'bored', 'minute', 'emotional', 'directing', 'theatrical', 'famous', 'begins', 'front', 'catch', 'sequence', 'runs', 'follows', 'song', 'government', 'miss', 'actual', ...]
See Hint below
print('itos ', 'length ',len(movie_reviews.vocab.itos),type(movie_reviews.vocab.itos) )
print('stoi ', 'length ',len(movie_reviews.vocab.stoi),type(movie_reviews.vocab.stoi) )
itos length 6016 <class 'list'> stoi length 19160 <class 'collections.defaultdict'>
stoi
is an instance of the class defaultdict
¶defaultdict
, rare words that appear fewer than three times in the corpus, and words that are not in the dictionary, are mapped to a default value
, in this case, zero¶rare_words = ['acrid','a_random_made_up_nonexistant_word','acrimonious','allosteric','anodyne','antikythera']
for word in rare_words:
print(movie_reviews.vocab.stoi[word])
0 0 0 0 0 0
token
corresponding to the default
value?¶print(movie_reviews.vocab.itos[0])
xxunk
stoi
(string-to-int) is larger than itos
(int-to-string).¶print(f'len(stoi) = {len(movie_reviews.vocab.stoi)}')
print(f'len(itos) = {len(movie_reviews.vocab.itos)}')
print(f'len(stoi) - len(itos) = {len(movie_reviews.vocab.stoi) - len(movie_reviews.vocab.itos)}')
len(stoi) = 19165 len(itos) = 6016 len(stoi) - len(itos) = 13149
unknown
. We can confirm here:¶unk = []
for word, num in movie_reviews.vocab.stoi.items():
if num==0:
unk.append(word)
len(unk)
13155
Hint: remember the list of rare words we used to query stoi
a few cells back?
unknown
¶unk[:25]
['xxunk', 'bleeping', 'pert', 'ticky', 'schtick', 'whoosh', 'banzai', 'chill', 'wooofff', 'cheery', 'superstars', 'fashionable', 'cruelly', 'separating', 'mistreat', 'tensions', 'religions', 'baseness', 'nobility', 'puro', 'disowned', 'option', 'faults', 'dignified', 'realisation']
print(f'There are {len(movie_reviews.vocab.itos)} unique tokens in the IMDb review sample vocabulary')
print(f'The numericalized token values run from {min(movie_reviews.vocab.stoi.values())} to {max(movie_reviews.vocab.stoi.values())} ')
There are 6016 unique tokens in the IMDb review sample vocabulary The numericalized token values run from 0 to 6015
embedding vector
whose indices correspond to the numericalized tokens, and whose values are the number of times the corresponding token appeared in the review. To do this efficiently we need to learn a bit about Counters
.¶A Counter is a useful Python object. A Counter applied to a list returns an ordered dictionary whose keys are the unique elements in the list, and whose values are the counts of the unique elements. Counters are from the collections module (along with OrderedDict, defaultdict, deque, and namedtuple). Here is how Counters work:
TokenCounter = lambda review_index : Counter((movie_reviews.train.x)[review_index].data)
TokenCounter(0).items()
dict_items([(2, 1), (5, 15), (4622, 1), (25, 3), (0, 8), (867, 1), (52, 5), (3776, 1), (1800, 1), (95, 1), (37, 1), (85, 1), (191, 1), (63, 2), (936, 1), (2740, 1), (517, 1), (18, 1), (21, 3), (11, 1), (84, 1), (2418, 1), (192, 1), (88, 1), (3777, 1), (1801, 1), (127, 1), (10, 3), (269, 1), (15, 1), (273, 1), (73, 1), (26, 2), (9, 2), (1360, 1), (35, 2), (1213, 1), (1144, 1), (1145, 1), (2419, 1), (91, 1), (62, 1), (245, 1), (14, 2), (1361, 1), (1447, 1), (64, 1), (40, 1), (797, 1), (103, 1), (72, 2), (99, 1), (534, 1), (616, 1), (48, 1), (282, 1), (54, 1), (90, 1), (219, 1), (228, 1), (43, 1), (13, 1), (3778, 1), (3779, 1), (355, 1), (492, 1)])
keys
are the numericalized tokens
that apper in the review¶TokenCounter(0).keys()
dict_keys([2, 5, 4622, 25, 0, 867, 52, 3776, 1800, 95, 37, 85, 191, 63, 936, 2740, 517, 18, 21, 11, 84, 2418, 192, 88, 3777, 1801, 127, 10, 269, 15, 273, 73, 26, 9, 1360, 35, 1213, 1144, 1145, 2419, 91, 62, 245, 14, 1361, 1447, 64, 40, 797, 103, 72, 99, 534, 616, 48, 282, 54, 90, 219, 228, 43, 13, 3778, 3779, 355, 492])
values
are the token multiplicities
, i.e the number of times each token
appears in the review¶TokenCounter(0).values()
dict_values([1, 15, 1, 3, 8, 1, 5, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
embedding vectors
¶n_terms = len(movie_reviews.vocab.itos)
n_docs = len(movie_reviews.train.x)
make_token_counter = lambda review_index: Counter(movie_reviews.train.x[review_index].data)
def count_vectorizer(review_index,n_terms = n_terms,make_token_counter = make_token_counter):
# input: review index, n_terms, and tokenizer function
# output: embedding vector for the review
embedding_vector = np.zeros(n_terms)
keys = list(make_token_counter(review_index).keys())
values = list(make_token_counter(review_index).values())
embedding_vector[keys] = values
return embedding_vector
# make the embedding vector for the first review
embedding_vector = count_vectorizer(0)
embedding vector
for the first review in the training data set¶print(f'The review is embedded in a {len(embedding_vector)} dimensional vector')
embedding_vector
The review is embedded in a 6016 dimensional vector
array([8., 0., 1., 0., ..., 0., 0., 0., 0.])
which words
were used in a review, and how often each word got used
. This is known as the bag of words
approach, and it suggests a really simple way to store a document (in this case, a movie review).¶vector
whose length
is the number of tokens in the vocabulary, which we will call n
. The indexes
of this vector
correspond to the tokens
in the IMDb vocabulary
, and thevalues
of the vector are the number of times the corresponding tokens appeared in the review. For example the values stored at indexes 0, 1, 2, 3, 4 of the vector record the number of times the 5 tokens ['xxunk','xxpad','xxbos','xxeos','xxfld'] appeared in the review, respectively.¶m
reviews, and each review is represented by a vector
of length n
, then vertically stacking the row vectors for all the reviews creates a matrix representation of the IMDb, which we call its document-term matrix
. The rows
correspond to documents
(reviews), while the columns
correspond to terms
(or tokens in the vocabulary).¶In the previous lesson, we used sklearn's CountVectorizer to generate the vectors
that represent individual reviews. Today we will create our own (similar) version. This is for two reasons:
# Define a function to build the full document-term matrix
print(f'there are {n_docs} reviews, and {n_terms} unique tokens in the vocabulary')
def make_full_doc_term_matrix(count_vectorizer,n_terms=n_terms,n_docs=n_docs):
# loop through the movie reviews
for doc_index in range(n_docs):
# make the embedding vector for the current review
embedding_vector = count_vectorizer(doc_index,n_terms)
# append the embedding vector to the document-term matrix
if(doc_index == 0):
A = embedding_vector
else:
A = np.vstack((A,embedding_vector))
# return the document-term matrix
return A
# Build the full document term matrix for the movie_reviews training set
A = make_full_doc_term_matrix(count_vectorizer)
there are 800 reviews, and 6016 unique tokens in the vocabulary
sparsity
of the document-term matrix¶sparsity
of a matrix is defined as the fraction of of zero-valued elements¶NNZ = np.count_nonzero(A)
sparsity = (A.size-NNZ)/A.size
print(f'Only {NNZ} of the {A.size} elements in the document-term matrix are nonzero')
print(f'The sparsity of the document-term matrix is {sparsity}')
Only 112413 of the 4812800 elements in the document-term matrix are nonzero The sparsity of the document-term matrix is 0.9766429105718085
spy
method, we can visualize the structure of the document-term matrix
¶spy
plots the array, indicating each non-zero value with a dot.
fig = plt.figure()
plt.spy(A, markersize=0.10, aspect = 'auto')
fig.set_size_inches(8,6)
fig.savefig('doc_term_matrix.png', dpi=800)
sparse
ie. has a high proportion of zeros!left
edge. This makes sense because the tokens are ordered by usage frequency, with frequency increasing toward the left
.density ripples
. If anyone has an explanation, please let me know!scipy
provides tools for efficient sparse matrix representatin and operations.¶sparse
(the opposite of sparse is dense
). For sparse matrices, you can save a lot of memory by only storing the non-zero values.¶There are the most common sparse storage formats:
Let's start out with a presecription for the CSR format (ref. https://en.wikipedia.org/wiki/Sparse_matrix)
Given a full matrix A
that has m
rows, n
columns, and N
nonzero values, the CSR (Compressed Sparse Row) representation uses three arrays as follows:
Val[0:N]
contains the values of the N
non-zero elements.
Col[0:N]
contains the column indices of the N
non-zero elements.
For each row i
of A
, RowPointer[i]
contains the index in Val of the the first nonzero value in row i
. If there are no nonzero values in the ith row, then RowPointer[i] = None
. And, by convention, an extra value RowPointer[m] = N
is tacked on at the end.
Question: How many floats and ints does it take to store the matrix A
in CSR format?
Let's walk through a few examples at the Emory University website
i.e. given the TextList
object containing the list of reviews, return the three arrays (values, column_indices, row_pointer)
From the Scipy Sparse Matrix Documentation
This is done by implementing the definition of CSR format
, given above.
# construct the document-term matrix in CSR format
# i.e. return (values, column_indices, row_pointer)
def get_doc_term_matrix(text_list, n_terms):
# inputs:
# text_list, a TextList object
# n_terms, the number of tokens in our IMDb vocabulary
# output:
# the CSR format sparse representation of the document-term matrix in the form of a
# scipy.sparse.csr.csr_matrix object
# initialize arrays
values = []
column_indices = []
row_pointer = []
row_pointer.append(0)
# from the TextList object
for _, doc in enumerate(text_list):
feature_counter = Counter(doc.data)
column_indices.extend(feature_counter.keys())
values.extend(feature_counter.values())
# Tack on N (number of nonzero elements in the matrix) to the end of the row_pointer array
row_pointer.append(len(values))
return scipy.sparse.csr_matrix((values, column_indices, row_pointer),
shape=(len(row_pointer) - 1, n_terms),
dtype=int)
%%time
train_doc_term = get_doc_term_matrix(movie_reviews.train.x, len(movie_reviews.vocab.itos))
Wall time: 129 ms
type(train_doc_term)
scipy.sparse.csr.csr_matrix
train_doc_term.shape
(800, 6016)
%%time
valid_doc_term = get_doc_term_matrix(movie_reviews.valid.x, len(movie_reviews.vocab.itos))
Wall time: 32.9 ms
type(valid_doc_term)
scipy.sparse.csr.csr_matrix
valid_doc_term.shape
(200, 6016)
First create $\text{m}\times \text{n}$ matrix with all zeros. We will recover $\text{A}$ by overwriting the entries in the zeros matrix row by row with the non-zero entries in $\text{A}$ as follows:
def CSR_to_full(values, column_indices, row_ptr, m,n):
A = zeros(m,n)
for row in range(n):
if row_ptr is not null:
A[row,column_indices[row_ptr[row]:row_ptr[row+1]]] = values[row_ptr[row]:row_ptr[row+1]]
return A
.todense()
method converts a sparse matrix back to a regular (dense) matrix.¶valid_doc_term
<200x6016 sparse matrix of type '<class 'numpy.int32'>' with 27848 stored elements in Compressed Sparse Row format>
valid_doc_term.todense()[:10,:10]
matrix([[32, 0, 1, 0, ..., 1, 0, 0, 10], [ 9, 0, 1, 0, ..., 1, 0, 0, 7], [ 6, 0, 1, 0, ..., 0, 0, 0, 12], [78, 0, 1, 0, ..., 0, 0, 0, 44], ..., [ 8, 0, 1, 0, ..., 0, 0, 0, 8], [43, 0, 1, 0, ..., 8, 1, 0, 25], [ 7, 0, 1, 0, ..., 1, 0, 0, 9], [19, 0, 1, 0, ..., 2, 0, 0, 5]])
review = movie_reviews.valid.x[1]
review
Text xxbos i saw this movie once as a kid on the late - late show and fell in love with it . xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk
Exercise 1: How many times does the word "it" appear in this review? Confirm that the correct values is stored in the document-term matrix, for the row corresponding to this review and the column corresponding to the word "it".
# try it!
# Your code here.
Exercise 2: Confirm that the review has 144 tokens, 81 of which are distinct
valid_doc_term[1]
<1x6016 sparse matrix of type '<class 'numpy.int32'>' with 81 stored elements in Compressed Sparse Row format>
valid_doc_term[1].sum()
144
len(set(review.data))
81
Exercise 3: How could you convert review.data back to text (without just using review.text)?
review.data
array([ 2, 19, 248, 21, ..., 9, 0, 10, 0], dtype=int64)
word_list = [movie_reviews.vocab.itos[a] for a in review.data]
print(word_list)
['xxbos', 'i', 'saw', 'this', 'movie', 'once', 'as', 'a', 'kid', 'on', 'the', 'late', '-', 'late', 'show', 'and', 'fell', 'in', 'love', 'with', 'it', '.', '\n \n ', 'xxmaj', 'it', 'took', '30', '+', 'years', ',', 'but', 'i', 'recently', 'did', 'find', 'it', 'on', 'xxup', 'dvd', '-', 'it', 'was', "n't", 'cheap', ',', 'either', '-', 'in', 'a', 'xxunk', 'that', 'xxunk', 'in', 'war', 'movies', '.', 'xxmaj', 'we', 'watched', 'it', 'last', 'night', 'for', 'the', 'first', 'time', '.', 'xxmaj', 'the', 'audio', 'was', 'good', ',', 'however', 'it', 'was', 'grainy', 'and', 'had', 'the', 'trailers', 'between', 'xxunk', '.', 'xxmaj', 'even', 'so', ',', 'it', 'was', 'better', 'than', 'i', 'remembered', 'it', '.', 'i', 'was', 'also', 'impressed', 'at', 'how', 'true', 'it', 'was', 'to', 'the', 'play', '.', '\n \n ', 'xxmaj', 'the', 'xxunk', 'is', 'around', 'here', 'xxunk', '.', 'xxmaj', 'if', 'you', "'re", 'xxunk', 'in', 'finding', 'it', ',', 'fire', 'me', 'a', 'xxunk', 'and', 'i', "'ll", 'see', 'if', 'i', 'can', 'get', 'you', 'the', 'xxunk', '.', 'xxunk']
reconstructed_text = ' '.join(word_list)
print(reconstructed_text)
xxbos i saw this movie once as a kid on the late - late show and fell in love with it . xxmaj it took 30 + years , but i recently did find it on xxup dvd - it was n't cheap , either - in a xxunk that xxunk in war movies . xxmaj we watched it last night for the first time . xxmaj the audio was good , however it was grainy and had the trailers between xxunk . xxmaj even so , it was better than i remembered it . i was also impressed at how true it was to the play . xxmaj the xxunk is around here xxunk . xxmaj if you 're xxunk in finding it , fire me a xxunk and i 'll see if i can get you the xxunk . xxunk
bag of words model
considers a movie review as equivalent to a list of the counts of all the tokens that it contains. When you do this, you throw away the rich information that comes from the sequential arrangement of the tokens into sentences and paragraphs.¶token counts
, you can usually still get a pretty good sense of whether the review was good or bad. How do you do this? By mentally gauging the overall positive
or negative
sentiment that the collection of words conveys, right?¶Naive Bayes Classifier
is an algorithm that encodes this simple reasoning process mathematically. It is based on two important pieces of information that we can learn from the training set:¶class priors
, i.e. the probabilities that a randomly chosen review will be positive
, or negative
token likelihoods
i.e. how likely is it that a given token would appear in a positive
or negative
reviewprior probabilities
for reviews of each class, and the class occurrence counts
and class likelihood ratios
for each token
in the vocabulary
.¶class priors
$p$ and $q$, which are the overall probabilities that a randomly chosen review is in the positive
, or negative
class, resepectively.¶positive
and negative
reviews, and $N$ is the total number of reviews in the training set, so that¶occurrence counts
¶occurrence counts
of token $t$ in positive
and negative
reviews, respectively, and $N^{+}$ and $N^{-}$ be the total numbers ofpositive
and negative
reviews in the data set, respectively.¶occurrence counts
¶dir(movie_reviews)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__slotnames__', '__str__', '__subclasshook__', '__weakref__', 'add_test', 'add_test_folder', 'databunch', 'filter_by_func', 'get_processors', 'label_const', 'label_empty', 'label_from_df', 'label_from_folder', 'label_from_func', 'label_from_list', 'label_from_lists', 'label_from_re', 'lists', 'load_empty', 'load_state', 'path', 'process', 'test', 'train', 'transform', 'transform_y', 'valid']
movie_reviews.y.c
2
movie_reviews.y.classes
['negative', 'positive']
positive = movie_reviews.y.c2i['positive']
negative = movie_reviews.y.c2i['negative']
print(f'Integer representations: positive: {positive}, negative: {negative}')
Integer representations: positive: 1, negative: 0
x = train_doc_term
y = movie_reviews.train.y
valid_y = movie_reviews.valid.y
v = movie_reviews.vocab
x.shape
(800, 260402)
count arrays
C1
and C0
list the total occurrence counts
of the tokens in positive
and negative
reviews, respectively.¶C1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
C0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))
For each vocabulary token, we are summing up how many positive reviews it is in, and how many negative reviews it is in. Here are the occurrence counts for the first 10 tokens in the vocabulary.
print(C1[:10])
print(C0[:10])
[ 6468 0 383 0 0 10267 674 57 0 5260] [ 7153 0 417 0 0 10741 908 53 1 6150]
C0
and C1
to do some more data exploration!¶Exercise 4: Compare how often the word "loved" appears in positive reviews vs. negative reviews. Do the same for the word "hate"
# Exercise: How often does the word "love" appear in neg vs. pos reviews?
ind = v.stoi['love']
pos_counts = C1[ind]
neg_counts = C0[ind]
print(f'The word "love" appears {pos_counts} and {neg_counts} times in positive and negative documents, respectively')
The word "love" appears 133 and 75 times in positive and negative documents, respectively
# Exercise: How often does the word "hate" appear in neg vs. pos reviews?
ind = v.stoi['hate']
pos_counts = C1[ind]
neg_counts = C0[ind]
print(f'The word "hate" appears {pos_counts} and {neg_counts} times in positive and negative documents, respectively')
The word "hate" appears 5 and 13 times in positive and negative documents, respectively
index = v.stoi['hated']
a = np.argwhere((x[:,index] > 0))[:,0]
print(a)
b = np.argwhere(y.items==positive)[:,0]
print(b)
c = list(set(a).intersection(set(b)))[0]
review = movie_reviews.train.x[c]
review.text
[ 15 49 304 351 393 612 695 773] [ 1 3 10 11 ... 787 789 790 797]
'xxbos xxmaj there are numerous films relating to xxup xxunk , but xxmaj mother xxmaj night is quite xxunk among them : xxmaj in this film , we are introduced to xxmaj howard xxmaj campbell ( xxmaj nolte ) , an xxmaj american living in xxmaj berlin and married to a xxmaj german , xxmaj xxunk xxmaj xxunk ( xxmaj lee ) , who decides to accept the role of a spy : xxmaj more specifically , a xxup cia agent xxmaj major xxmaj xxunk ( xxmaj goodman ) recruits xxmaj campbell who becomes a xxmaj nazi xxunk in order to enter the highest xxunk of the xxmaj hitler xxunk . xxmaj however , the deal is that the xxup us xxmaj government will never xxunk xxmaj campbell \'s role in the war for national security reasons , and so xxmaj campbell becomes a hated figure across the xxup us . xxmaj after the war , he tries to xxunk his identity , but the past comes back and xxunk him . xxmaj his only " friend " is xxmaj xxunk , but even he can not do much for the xxunk of events that fall upon poor xxmaj campbell ... \n \n xxmaj the story is deeply touching , as we watch the tragedy of xxmaj campbell who although a great patriot , is treated by xxunk by everybody who xxunk him . xxmaj not only that , but he also gradually realizes that even the persons who are most close to him , have many xxunk of their own . xxmaj vonnegut provides us with a moving atmosphere , with xxmaj campbell \'s despair building up and almost choking the viewer . \n \n xxmaj nolte plays the role of his life , in my opinion ; he is even better than in " xxmaj xxunk " , although in both roles he plays tragic figures who are destined to self - destruction . xxmaj xxunk xxmaj lee is also excellent , and the same can be said for the whole cast in general . \n \n i have n\'t read the book , so i can not xxunk how the film compares to it . xxmaj in any case , this is something of no importance here : xxmaj my xxunk is upon the film per xxunk , and the film xxunk deserves a 9 / 10 .'
index = v.stoi['loved']
a = np.argwhere((x[:,index] > 0))[:,0]
print(a)
b = np.argwhere(y.items==negative)[:,0]
print(b)
c = list(set(a).intersection(set(b)))[0]
review = movie_reviews.train.x[c]
review.text
[ 1 15 29 69 75 79 174 185 200 205 262 296 303 333 350 351 398 407 440 489 496 528 538 600 602 605 627 642 657 660 700 712 729 735 755 767 785] [ 0 2 4 5 ... 795 796 798 799]
'xxbos xxmaj oh if only i could give this rubbish less than one star ! xxmaj there were two mildly amusing parts in the whole film and that is it ! one was where a line or two from the song xxmaj do n\'t xxmaj worry , xxmaj be xxmaj happy was xxunk by the slugs and the other was where xxmaj roddy fell of the toilet roll and landed with his feet and legs apart so that everything else he landed on on the way down hit him in the xxunk . xxmaj that is it there was nothing more amusing than that , at least not for me anyway ! xxmaj xxunk is not right in saying \' xxmaj fans of the completely terrible " xxmaj shrek " might enjoy , but " xxmaj wallace & xxmaj xxunk " fans will probably turn away in xxunk . \' xxmaj as i loved xxmaj shrek 1 2 and 3 and i also love xxmaj wallace and xxmaj xxunk . xxmaj you see what it xxunk down to is that if an animation is done extremely well then it is definitely worth watching , this however was about as far from done well as you can possibly get ! xxmaj the continuity mistakes were too big in number . xxmaj some were pointed out by the makers of this site others were not . i wo n\'t point out all of the others , but here are a few more to see : xxmaj when the young daughter leaves at the start of the film the catch to the cage door comes down and the hook part of it that is on the right clearly goes back around behind the round xxunk thus effectively making sure xxmaj roddy would not be able to get out and yet he does just by simply kicking at it . xxmaj at one point the ruby falls down xxmaj roddy \'s back and gets pushed straight up into the the air by xxmaj xxunk all the while the ship is moving forwards . xxmaj in the next scene xxmaj roddy has caught it again . xxmaj this is impossible . xxmaj seeing as how the ship is moving forwards the only place when the ruby was xxunk out from under the back of xxmaj roddy \'s shirt the only place it could have landed was in the water not in xxmaj roddy \'s hand . xxmaj there was a third one i wanted to point out but for now i have forgotten it . \n \n xxmaj too many , for want of a better word , \' jokes \' were repeated in one way or another , there was not enough time to establish any sort of connection with any of the characters , the characters were xxunk , shallow and empty , and the whole film left you wanting xxrep 4 . wanting to watch xxunk minutes of anything else ! xxmaj paint xxunk or grass growing are two superb xxunk !'
log-count ratio
¶log-count ratio
ranks tokens by their relative affinities for positive and negative reviews¶positive
reviews are more likely to contain this tokennegative
reviews are more likely to contain this tokenpositive
and negative
reviewsoccurrence count
arrays, we can compute the class likelihoods
and log-count ratios
of all the tokens in the vocabulary.¶class likelihoods
¶conditional likelihoods
, by adding 1 to the numerator and denominator to insure numerically stability.¶L1 = (C1+1) / ((y.items==positive).sum() + 1)
L0 = (C0+1) / ((y.items==negative).sum() + 1)
log-count ratios
¶R = np.log(L1/L0)
print(R)
[-0.015811 0.084839 0. 0.084839 ... 0.084839 0.084839 0.084839 0.084839]
n_tokens = 10
highest_R = np.argpartition(R, -n_tokens)[-n_tokens:]
lowest_R = np.argpartition(R, n_tokens)[:n_tokens]
print(f'Highest {n_tokens} log-count ratios: {R[list(highest_R)]}\n')
print(f'Lowest {n_tokens} log-count ratios: {R[list(lowest_R)]}')
Highest 10 log-count ratios: [2.569746 2.649788 2.649788 2.723896 2.723896 2.649788 2.792889 2.857428 2.975211 3.029278] Lowest 10 log-count ratios: [-2.68775 -2.554218 -2.8596 -3.134037 -2.623211 -3.093215 -2.805533 -2.748374 -2.636457 -2.554218]
highest_R
array([1723, 1662, 1620, 796, 1529, 1666, 1386, 1358, 1212, 1143], dtype=int64)
[v.itos[k] for k in highest_R]
['sport', 'davies', 'jabba', 'jimmy', 'felix', 'gilliam', 'noir', 'astaire', 'fanfan', 'biko']
token = 'biko'
train_doc_term[:,v.stoi[token]]
<800x1 sparse matrix of type '<class 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
index = np.argmax(train_doc_term[:,v.stoi[token]])
n_times = train_doc_term[index,v.stoi[token]]
print(f'review # {index} has {n_times} occurrences of "{token}"\n')
print(movie_reviews.train.x[index].text)
review # 515 has 14 occurrences of "biko" xxbos " xxmaj the xxmaj true xxmaj story xxmaj of xxmaj the xxmaj friendship xxmaj that xxmaj shook xxmaj south xxmaj africa xxmaj and xxmaj xxunk xxmaj the xxmaj world . " xxmaj richard xxmaj attenborough , who directed " a xxmaj bridge xxmaj too xxmaj far " and " xxmaj gandhi " , wanted to bring the story of xxmaj steve xxmaj biko to life , and the journey and trouble that xxunk xxmaj donald xxmaj woods went through in order to get his story told . xxmaj the films uses xxmaj wood 's two books for it 's information and basis - " xxmaj biko " and " xxmaj asking for xxmaj trouble " . xxmaj the film takes place in the late 1970 's , in xxmaj south xxmaj africa . xxmaj south xxmaj africa is in the grip of the terrible apartheid , which keeps the blacks separated from the whites and xxunk the whites as the superior race . xxmaj the blacks are forced to live in xxunk on the xxunk of the cities and xxunk , and they come under frequent xxunk by the police and the army . xxmaj we are shown a dawn xxunk on a xxunk , as xxunk and armed police force their way through the camp beating and even killing the inhabitants . xxmaj then we are introduced to xxmaj donald xxmaj woods ( xxmaj kevin xxmaj kline ) , who is the editor of a popular newspaper . xxmaj after xxunk a negative story about black xxunk xxmaj steve xxmaj biko ( xxmaj denzel xxmaj washington ) , xxmaj woods goes to meet with him . xxmaj the two are xxunk of each other at first , but they soon become good friends and xxmaj biko shows the horrors of the apartheid system from a black persons point of view to xxmaj woods . xxmaj this xxunk xxmaj woods to speak out against what 's happening around him , and makes him desperate to bring xxmaj steve xxmaj biko 's story out of the xxunk of the white man 's xxmaj south xxmaj africa and to the world . xxmaj soon , xxmaj steve xxmaj biko is arrested and is killed in prison . xxmaj now xxmaj woods and his family are daring to escape from xxmaj south xxmaj africa to xxmaj england , where xxmaj woods can xxunk his book about xxmaj steve xxmaj biko and the apartheid . xxmaj when i first heard of " xxmaj cry xxmaj freedom " , i was under the impression that it was a movie completely dedicated to the life of xxmaj steve xxmaj biko . i had never actually heard of xxmaj steve xxmaj biko before i seen this film , as the events in this film were really before my time . xxmaj but it 's more about the story of xxmaj donald xxmaj woods and his journey across the border into xxmaj xxunk as he tried to xxunk the xxmaj south xxmaj african xxunk . xxmaj woods was put on a five year type house xxunk after xxmaj steve xxmaj biko was killed . xxmaj so in order to xxunk his xxunk on xxmaj steve xxmaj biko , he had to escape . xxmaj because the xxunk would be considered xxunk in xxmaj south xxmaj africa and that could have resulted in xxmaj woods meeting a fate similar to that of xxmaj biko 's . xxmaj the real xxmaj donald xxmaj woods and his wife acted as xxunk to this film . xxmaj denzel xxmaj washington is only in the film for the first hour , and i was disappointed with that as i was expecting to see him for the entire movie . xxmaj but he was amazing as xxmaj steve xxmaj biko , and captured his personality from what i 've read really well and his accent sounded perfect . xxmaj his performance earned him an xxmaj oscar nomination for xxmaj best xxmaj supporting xxmaj actor . xxmaj kevin xxmaj kline delivers a excellent and thought - xxunk performance as xxmaj donald xxmaj woods , and xxmaj penelope xxmaj xxunk is excellent as his wife xxmaj xxunk . xxmaj filming took place in xxmaj xxunk , as needless to say problems xxunk when they tried to film it in xxmaj south xxmaj africa . xxmaj while in xxmaj south xxmaj africa , the xxmaj south xxmaj african xxunk followed the film crew everywhere , so they got the bad xxunk and they pulled out and went to xxunk xxmaj xxunk instead . xxmaj despite everything , and the fact that the apartheid did n't end ' xxunk seven years later , " xxmaj cry xxmaj freedom " was n't xxunk in xxmaj south xxmaj africa . xxmaj but xxunk showing the movie received bomb threats . xxmaj richard xxmaj attenborough brings the horrors of the apartheid to the screen with extreme force and determination . xxmaj he does n't hold back at the end of the movie when showing what was supposed to be a xxunk xxunk by students in a xxunk , turns into a massacre when police open fire on them . xxmaj the film ends with the names of all the anti - apartheid xxunk who died in prison , and the explanations for their deaths . xxmaj many had " xxmaj no xxmaj explanation " . xxmaj quite a few were " xxmaj xxunk " , which is hard to believe , and many more either fell from the top of the xxunk or were " xxmaj suicide from xxmaj hanging " . xxmaj no one will ever know what really happened to them , but i think it 's fair to say that none of these men died at their own hands , but at the hands of others ; or to be more xxunk , at the hands of the police . " xxmaj cry xxmaj freedom " is a must - see movie for it 's portrayal and story of xxmaj steve xxmaj biko . xxmaj it 's also a xxunk and xxunk portrayal of a beautiful land divided and in the xxunk grips of racial xxunk and violence .
lowest_R
array([1345, 1545, 572, 904, 1438, 935, 1189, 1213, 301, 1544], dtype=int64)
[v.itos[k] for k in lowest_R]
['crater', 'soderbergh', 'crap', 'porn', 'disappointment', 'vargas', 'naschy', 'dog', 'worst', 'fuqua']
token = 'soderbergh'
train_doc_term[:,v.stoi[token]]
<800x1 sparse matrix of type '<class 'numpy.int32'>' with 1 stored elements in Compressed Sparse Row format>
index = np.argmax(train_doc_term[:,v.stoi[token]])
n_times = train_doc_term[index,v.stoi[token]]
print(f'review # {index} has {n_times} occurrences of "{token}"\n')
print(movie_reviews.train.x[index].text)
review # 434 has 13 occurrences of "soderbergh" xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " xxmaj at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after xxunk years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside " edgy " projects . xxmaj none of this excuses him this present , almost diabolical failure . xxmaj as xxmaj david xxmaj xxunk xxunk , " two parts of xxmaj che do n't ( even ) make a whole " . xxmaj epic xxunk in name only , xxmaj che(2008 ) barely qualifies as a feature film ! xxmaj it certainly has no legs , xxunk as except for its xxunk ultimate resolution forced upon it by history , xxmaj soderbergh 's xxunk - long xxunk just goes nowhere . xxmaj even xxmaj margaret xxmaj xxunk , the more xxunk of xxmaj australia 's xxmaj at xxmaj the xxmaj movies duo , noted about xxmaj soderbergh 's xxunk waste of ( xxup xxunk digital xxunk ) : " you 're in the woods ... xxunk in the woods ... xxunk in the woods ... " . i too am surprised xxmaj soderbergh did n't give us another xxunk of xxup that somewhere between his xxunk two xxmaj parts , because he still left out massive xxunk of xxmaj che 's " xxunk " life ! xxmaj for a xxunk of an important but infamous historical figure , xxmaj soderbergh xxunk xxunk , if not deliberately insults , his audiences by 1 . never providing most of xxmaj che 's story ; 2 . xxunk xxunk film xxunk with mere xxunk xxunk ; 3 . xxunk both true xxunk and a narrative of events ; 4 . barely developing an idea , or a character ; 5 . remaining xxunk episodic ; 6 . xxunk proper context for scenes --- whatever we do get is xxunk in xxunk xxunk ; 7 . xxunk xxunk all audiences ( even xxmaj spanish - xxunk will be confused by the xxunk xxunk in xxmaj english ) ; and 8 . xxunk xxunk his main subject into one dimension . xxmaj why , at xxup this late stage ? xxmaj the t - shirt franchise has been a success ! xxmaj our sense of xxunk is surely due to xxmaj peter xxmaj xxunk and xxmaj benjamin xxunk xxmaj xxunk xxunk their screenplay solely on xxmaj xxunk 's memoirs . xxmaj so , like a poor student who has read only xxup one of his xxunk xxunk for his xxunk , xxmaj soderbergh 's product is xxunk limited in perspective . xxmaj the audience is held captive within the same xxunk knowledge , scenery and circumstances of the " revolutionaries " , but that does n't xxunk our sympathy . xxmaj instead , it xxunk on us that " xxmaj ah , xxmaj soderbergh 's trying to xxunk his audiences the same as the xxmaj latino peasants were at the time " . xxmaj but these are the xxup same illiterate xxmaj latino peasants who xxunk out the good doctor to his enemies . xxmaj why does xxmaj soderbergh feel the need to xxunk us with them , and keep us equally mentally captive ? xxmaj such audience xxunk must have a purpose . xxmaj part2 is more xxunk than xxmaj part1 , but it 's literally mind - numbing with its repetitive bush - bashing , misery of xxunk , and lack of variety or character xxunk . deltoro 's xxmaj che has no opportunity to grow as a person while he struggles to xxunk his own ill - xxunk troops . xxmaj the only xxunk is the humour as xxmaj che deals with his sometimes deeply ignorant " revolutionaries " , some of whom xxunk lack self - control around local peasants or food . xxmaj we certainly get no insight into what caused the conditions , nor any xxunk xxunk of their xxunk xxunk , such as it was . xxmaj part2 's xxunk xxunk remains xxunk episodic : again , nothing is telegraphed or xxunk . xxmaj thus even the scenes with xxmaj xxunk xxmaj xxunk ( xxmaj xxunk xxmaj xxunk ) are unexpected and disconcerting . xxmaj any xxunk events are portrayed xxunk and xxmaj latino - xxunk , with xxmaj part1 's interviews xxunk by time - xxunk xxunk between the corrupt xxmaj xxunk president ( xxmaj xxunk de xxmaj xxunk ) and xxup us xxmaj government xxunk promising xxup cia xxunk ( ! ) . xxmaj the rest of xxmaj part2 's " woods " and day - for - night blue xxunk just xxunk the audience until they 're xxunk the xxunk . xxmaj perhaps deltoro felt too xxunk the frustration of many non - xxmaj american xxmaj latinos about never getting a truthful , xxunk history of xxmaj che 's xxunk within their own countries . xxmaj when foreign xxunk still wo n't deliver a free press to their people -- for whatever reason -- then one can see how a popular xxmaj american indie producer might set out to xxunk the not - so - well - read ( " i may not be able to read or write , but i 'm xxup not xxunk . xxmaj the xxmaj inspector xxmaj xxunk ) ) out to their own local xxunk . xxmaj the film 's obvious xxunk and gross over - xxunk hint very strongly that it 's aiming only at the xxunk of the less - informed xxup who xxup still xxup speak xxup little xxmaj english . xxmaj if they did , they 'd have read xxunk on the subject already , and xxunk the relevant social issues amongst themselves -- learning the lessons of history as they should . xxmaj such insights are precisely what societies still need -- and not just the remaining illiterate xxmaj latinos of xxmaj central and xxmaj south xxmaj america -- yet it 's what xxmaj che(2008 ) xxunk fails to deliver . xxmaj soderbergh xxunk his lead because he 's weak on narrative . i am xxunk why xxmaj xxunk deltoro deliberately chose xxmaj soderbergh for this project if he knew this . xxmaj it 's been xxunk , xxunk about xxmaj xxunk was xxunk wanted : it 's what i went to see this film for , but the director xxunk robs us of that . xxmaj david xxmaj xxunk , writing in xxmaj the xxmaj australian ( xxunk ) observed that while xxmaj part1 was " uneven " , xxmaj part2 actually " goes rapidly downhill " from there , " xxunk xxmaj che 's final xxunk in xxmaj xxunk in xxunk detail " , which " ... feels almost unbearably slow and turgid " . xxmaj che : xxmaj the xxmaj xxunk aka xxmaj part2 is certainly no xxunk for xxmaj xxunk , painting it a picture of misery and xxunk . xxmaj the entire second half is only xxunk by the aforementioned humour , and the dramatic -- yet tragic -- capture and execution of the film 's subject . xxmaj the rest of this xxunk cinema xxunk is just confusing , irritating misery -- xxunk , for a xxmaj soderbergh film , to be avoided at all costs . xxmaj it is bound to break the hearts of all who know even just a xxunk about the xxunk / 10 )
train_doc_term[:,v.stoi[token]]
<800x1 sparse matrix of type '<class 'numpy.int32'>' with 1 stored elements in Compressed Sparse Row format>
p = (y.items==positive).mean()
q = (y.items==negative).mean()
print(f'The prior probabilities for positive and negative classes are {p} annd {q}')
The prior probabilities for positive and negative classes are 0.47875 annd 0.52125
b = np.log((y.items==positive).mean() / (y.items==negative).mean())
print(f'The log probability ratio is L = {b}')
The log probability ratio is L = -0.08505123261815539
negative
reviews.¶In this section, we'll start with a discussion of Bayes' Theorem, then we'll use it to derive the Naive Bayes Classifier. Next we'll apply the Naive Bayes classifier to our movie reviews problem. Finally we'll review the prescription for building a Naive Bayes Classifier.
Consider two events, $A$ and $B$
Then the probability of $A$ and $B$ occurring together can be written in two ways:
$p(A,B) = p(A|B)\cdot p(B)$
$p(A,B) = p(B|A)\cdot p(A)$
where $p(A|B)$ and $p(B|A)$ are conditional probabilities: $p(A|B)$ is the probability of $A$ occurring given that $B$ has occurred, $p(A)$ is the probability that $A$ occurs, $p(B)$ is the probabilityt that $B$ occurs
$\textbf{Bayes Theorem}$ is just the statement that the right hand sides of the above two equations are equal:
$p(A|B) \cdot p(B) = p(B|A) \cdot p(A)$
Applying $\textbf{Bayes Theorem}$ to our IMDb movie review problem:
We identify $A$ and $B$ as
$A \equiv \text{class}$, i.e. positive or negative, and
$B \equiv \text{tokens}$, i.e. the "bag" of tokens used in the review
Then $\textbf{Bayes Theorem}$ says
$p(\text{class}|\text{tokens})\cdot p(\text{tokens}) = p(\text{tokens}|\text{class}) \cdot p(\text{class})$
so that
$p(\text{class}|\text{tokens}) = p(\text{tokens}|\text{class})\cdot \frac{p(\text{class})}{p(\text{tokens})}$
Since $p(\text{tokens})$ is a constant, we have the proportionality
$p(\text{class}|\text{tokens}) \propto p(\text{tokens}|\text{class})\cdot p(\text{class})$
The left hand side of the above expression is called the $\textbf{posterior class probability}$, the probability that the review is positive (or negative), given the tokens it contains. This is exactly what we want to predict!
positive
or negative
¶posterior class probabilities
.¶positive
or negative
, and $\text{tokens}$ is the list of tokens that appear in the review.¶positive
and negative
classes, we have¶linear
problem:¶bias
, and the second term is the dot product of the binarized embedding vector and the log-count ratios¶positive
, else we predict the review is negative
.¶binarized document-term matrix
, whose rows are the binarized embedding vectors for the movie reviewslog-count ratios
for the tokens, andW = train_doc_term.sign()
preds_train = (W @ R + b) > 0
train_accuracy = (preds_train == y.items).mean()
print(f'The prediction accuracy for the training set is {train_accuracy}')
The prediction accuracy for the training set is 0.9
W = valid_doc_term.sign()
preds_valid = (W @ R + b) > 0
valid_accuracy = (preds_valid == valid_y.items).mean()
print(f'The prediction accuracy for the validation set is {valid_accuracy}')
The prediction accuracy for the validation set is 0.68
x
and the training labels y
¶C0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))
C1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
L0 = (C0+1) / ((y.items==negative).sum() + 1)
L1 = (C1+1) / ((y.items==positive).sum() + 1)
R = np.log(L1/L0)
b = np.log((y.items==positive).mean() / (y.items==negative).mean())
preds = (W @ R + b) > 0,
where the weights matrix W = valid_doc_term.sign() is the binarizedvalid_doc_term matrix
whose rows are the binarized embedding vectors for the movie reviews for which you want to predict ratings.
Now that we have our approach working on a smaller sample of the data, we can try using it on the full dataset.
path = untar_data(URLs.IMDB)
path.ls()
[WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/data_clas.pkl'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/data_lm.pkl'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/finetuned.pth'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/finetuned_enc.pth'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/imdb.vocab'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ld.pkl'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ll_clas.pkl'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/ll_lm.pkl'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/models'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/pretrained'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/README'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/test'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/tmp_clas'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/tmp_lm'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/unsup'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/vocab_lm.pkl')]
(path/'train').ls()
[WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/labeledBow.feat'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/neg'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/pos'), WindowsPath('C:/Users/cross-entropy/.fastai/data/imdb/train/unsupBow.feat')]
BrokenProcessPool
error; we apply a brute force
approach, trying repeatedly until we succeed. Takes 10 minutes if it goes on the first try.¶%%time
# throws `BrokenProcessPool' Error sometimes. Keep trying `till it works!
count = 0
error = True
while error:
try:
# Preprocessing steps
reviews_full = (TextList.from_folder(path)
# Make a `TextList` object that is a list of `WindowsPath` objects,
# each of which contains the full path to one of the data files.
.split_by_folder(valid='test')
# Generate a `LabelLists` object that splits files by training and validation folders
# Note: .label_from_folder in next line causes the `BrokenProcessPool` error
.label_from_folder(classes=['neg', 'pos']))
# Create a `CategoryLists` object which contains the data and
# its labels that are derived from folder names
error = False
print(f'failure count is {count}\n')
except: # catch *all* exceptions
# accumulate failure count
count = count + 1
print(f'failure count is {count}')
failure count is 9 Wall time: 13min 4s
%%time
valid_doc_term = get_doc_term_matrix(reviews_full.valid.x, len(reviews_full.vocab.itos))
Wall time: 3.72 s
%%time
train_doc_term = get_doc_term_matrix(reviews_full.train.x, len(reviews_full.vocab.itos))
Wall time: 3.78 s
When storing data like this, always make sure it's included in your .gitignore
file
scipy.sparse.save_npz("train_doc_term.npz", train_doc_term)
scipy.sparse.save_npz("valid_doc_term.npz", valid_doc_term)
with open('reviews_full.pickle', 'wb') as handle:
pickle.dump(reviews_full, handle, protocol=pickle.HIGHEST_PROTOCOL)
train_doc_term = scipy.sparse.load_npz("train_doc_term.npz")
valid_doc_term = scipy.sparse.load_npz("valid_doc_term.npz")
with open('reviews_full.pickle', 'rb') as handle:
pickle.load(handle)
$^\dagger$API $\equiv$ Application Programming Interface
LabelLists
object, which contains LabelList
objects train
, valid
and potentially test
¶type(reviews_full)
fastai.data_block.LabelLists
type(reviews_full.valid)
fastai.data_block.LabelList
vocab
object though it is not shown with the dir() command. This is an error.¶print(reviews_full.vocab)
<fastai.text.transform.Vocab object at 0x0000025A0AC634A8>
vocabulary
in a variable full_vocab
¶full_vocab = reviews_full.vocab
vocab
object has a method itos
which returns a list of tokens¶full_vocab.itos[100:110]
['bad', 'people', 'will', 'other', 'also', 'into', 'first', 'because', 'great', 'how']
TextList
object x
and a CategoryList
object y
¶reviews_full.valid
LabelList (25000 items) x: TextList xxbos xxmaj once again xxmaj mr. xxmaj costner has dragged out a movie for far longer than necessary . xxmaj aside from the terrific sea rescue sequences , of which there are very few i just did not care about any of the characters . xxmaj most of us have ghosts in the closet , and xxmaj costner 's character are realized early on , and then forgotten until much later , by which time i did not care . xxmaj the character we should really care about is a very cocky , overconfident xxmaj ashton xxmaj kutcher . xxmaj the problem is he comes off as kid who thinks he 's better than anyone else around him and shows no signs of a cluttered closet . xxmaj his only obstacle appears to be winning over xxmaj costner . xxmaj finally when we are well past the half way point of this stinker , xxmaj costner tells us all about xxmaj kutcher 's ghosts . xxmaj we are told why xxmaj kutcher is driven to be the best with no prior inkling or foreshadowing . xxmaj no magic here , it was all i could do to keep from turning it off an hour in .,xxbos xxmaj this is an example of why the majority of action films are the same . xxmaj generic and boring , there 's really nothing worth watching here . a complete waste of the then barely - tapped talents of xxmaj ice - t and xxmaj ice xxmaj cube , who 've each proven many times over that they are capable of acting , and acting well . xxmaj do n't bother with this one , go see xxmaj new xxmaj jack xxmaj city , xxmaj ricochet or watch xxmaj new xxmaj york xxmaj undercover for xxmaj ice - t , or xxmaj boyz n the xxmaj hood , xxmaj higher xxmaj learning or xxmaj friday for xxmaj ice xxmaj cube and see the real deal . xxmaj ice - t 's horribly cliched dialogue alone makes this film grate at the teeth , and i 'm still wondering what the heck xxmaj bill xxmaj paxton was doing in this film ? xxmaj and why the heck does he always play the exact same character ? xxmaj from xxmaj aliens onward , every film i 've seen with xxmaj bill xxmaj paxton has him playing the exact same irritating character , and at least in xxmaj aliens his character died , which made it somewhat gratifying ... xxmaj overall , this is second - rate action trash . xxmaj there are countless better films to see , and if you really want to see this one , watch xxmaj judgement xxmaj night , which is practically a carbon copy but has better acting and a better script . xxmaj the only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing , which comes close to making up for the horrible film itself - but not quite . 4 / 10 .,xxbos xxmaj first of all i hate those moronic rappers , who could'nt act if they had a gun pressed against their foreheads . xxmaj all they do is curse and shoot each other and acting like xxunk version of gangsters . xxmaj the movie does n't take more than five minutes to explain what is going on before we 're already at the warehouse xxmaj there is not a single sympathetic character in this movie , except for the homeless guy , who is also the only one with half a brain . xxmaj bill xxmaj paxton and xxmaj william xxmaj sadler are both hill xxunk and xxmaj xxunk character is just as much a villain as the gangsters . i did'nt like him right from the start . xxmaj the movie is filled with pointless violence and xxmaj walter xxmaj hills specialty : people falling through windows with glass flying everywhere . xxmaj there is pretty much no plot and it is a big problem when you root for no - one . xxmaj everybody dies , except from xxmaj paxton and the homeless guy and everybody get what they deserve . xxmaj the only two black people that can act is the homeless guy and the junkie but they 're actors by profession , not annoying ugly brain dead rappers . xxmaj stay away from this crap and watch 48 hours 1 and 2 instead . xxmaj at lest they have characters you care about , a sense of humor and nothing but real actors in the cast .,xxbos xxmaj not even the xxmaj beatles could write songs everyone liked , and although xxmaj walter xxmaj hill is no mop - top he 's second to none when it comes to thought provoking action movies . xxmaj the nineties came and social platforms were changing in music and film , the emergence of the xxmaj rapper turned movie star was in full swing , the acting took a back seat to each man 's overpowering regional accent and transparent acting . xxmaj this was one of the many ice - t movies i saw as a kid and loved , only to watch them later and cringe . xxmaj bill xxmaj paxton and xxmaj william xxmaj sadler are firemen with basic lives until a burning building tenant about to go up in flames hands over a map with gold implications . i hand it to xxmaj walter for quickly and neatly setting up the main characters and location . xxmaj but i fault everyone involved for turning out xxmaj lame - o performances . xxmaj ice - t and cube must have been red hot at this time , and while i 've enjoyed both their careers as rappers , in my opinion they fell flat in this movie . xxmaj it 's about ninety minutes of one guy ridiculously turning his back on the other guy to the point you find yourself locked in multiple states of disbelief . xxmaj now this is a movie , its not a documentary so i wo nt waste my time recounting all the stupid plot twists in this movie , but there were many , and they led nowhere . i got the feeling watching this that everyone on set was xxunk of confused and just playing things off the cuff . xxmaj there are two things i still enjoy about it , one involves a scene with a needle and the other is xxmaj sadler 's huge 45 pistol . xxmaj bottom line this movie is like domino 's pizza . xxmaj yeah ill eat it if i 'm hungry and i do n't feel like cooking , xxmaj but i 'm well aware it tastes like crap . 3 stars , meh .,xxbos xxmaj brass pictures ( movies is not a fitting word for them ) really are somewhat brassy . xxmaj their alluring visual qualities are reminiscent of expensive high class xxup tv commercials . xxmaj but unfortunately xxmaj brass pictures are feature films with the pretense of wanting to entertain viewers for over two hours ! xxmaj in this they fail miserably , their undeniable , but rather soft and flabby than steamy , erotic qualities non withstanding . xxmaj xxunk ' 45 is a remake of a film by xxmaj luchino xxmaj visconti with the same title and xxmaj alida xxmaj valli and xxmaj farley xxmaj granger in the lead . xxmaj the original tells a story of senseless love and lust in and around xxmaj venice during the xxmaj italian wars of independence . xxmaj brass moved the action from the 19th into the 20th century , 1945 to be exact , so there are xxmaj mussolini xxunk , men in black shirts , xxmaj german uniforms or the tattered garb of the xxunk . xxmaj but it is just window dressing , the historic context is completely negligible . xxmaj anna xxmaj xxunk plays the attractive aristocratic woman who falls for the amoral xxup ss guy who always puts on too much lipstick . xxmaj she is an attractive , versatile , well trained xxmaj italian actress and clearly above the material . xxmaj her wide range of facial expressions ( xxunk boredom , loathing , delight , fear , hate ... and ecstasy ) are the best reason to watch this picture and worth two stars . xxmaj she endures this basically trashy stuff with an astonishing amount of dignity . i wish some really good parts come along for her . xxmaj she really deserves it . y: CategoryList neg,neg,neg,neg,neg Path: C:\Users\cross-entropy\.fastai\data\imdb
TextList
object is a list of Text
objects containing the reviews as items¶type(reviews_full.valid.x[0])
fastai.text.data.Text
reviews_full.valid.x[0].text
"xxbos xxmaj once again xxmaj mr. xxmaj costner has dragged out a movie for far longer than necessary . xxmaj aside from the terrific sea rescue sequences , of which there are very few i just did not care about any of the characters . xxmaj most of us have ghosts in the closet , and xxmaj costner 's character are realized early on , and then forgotten until much later , by which time i did not care . xxmaj the character we should really care about is a very cocky , overconfident xxmaj ashton xxmaj kutcher . xxmaj the problem is he comes off as kid who thinks he 's better than anyone else around him and shows no signs of a cluttered closet . xxmaj his only obstacle appears to be winning over xxmaj costner . xxmaj finally when we are well past the half way point of this stinker , xxmaj costner tells us all about xxmaj kutcher 's ghosts . xxmaj we are told why xxmaj kutcher is driven to be the best with no prior inkling or foreshadowing . xxmaj no magic here , it was all i could do to keep from turning it off an hour in ."
data
, which is an array of integers representing the tokens in the review:¶reviews_full.valid.x[0].data
array([ 2, 5, 303, 192, ..., 50, 555, 18, 10], dtype=int64)
Text
object also has a method .items
which returns the integer array representations for all the reviews¶reviews_full.valid.x.items
array([array([ 2, 5, 303, 192, ..., 50, 555, 18, 10], dtype=int64), array([ 2, 5, 20, 16, ..., 236, 126, 182, 10], dtype=int64), array([ 2, 5, 106, 14, ..., 18, 9, 197, 10], dtype=int64), array([ 2, 5, 38, 77, ..., 399, 11, 23500, 10], dtype=int64), ..., array([ 2, 5, 279, 19, ..., 32312, 78, 608, 10], dtype=int64), array([ 2, 5, 53, 9, ..., 51, 336, 56, 10], dtype=int64), array([ 2, 5, 20, 30, ..., 44, 1161, 5947, 10], dtype=int64), array([ 2, 19, 161, 130, ..., 78, 127, 3208, 10], dtype=int64)], dtype=object)
CategoryList
object¶type(reviews_full.valid.y)
fastai.data_block.CategoryList
CategoryList
object is a list of Category
objects¶type(reviews_full.valid.y[0])
fastai.core.Category
reviews_full.valid.y[0]
Category neg
Category
object also has a method .items
which returns an array of integers labels for all the reviews¶reviews_full.valid.y.items
array([0, 0, 0, 0, ..., 1, 1, 1, 1], dtype=int64)
reviews_full.valid.y[0]
Category neg
reviews_full.valid.y.classes
['neg', 'pos']
reviews_full.valid.y.c
2
reviews_full.valid.y.c2i
{'neg': 0, 'pos': 1}
reviews_full.valid.y[0].data
0
reviews_full.valid.y[0].obj
'neg'
len(reviews_full.train), len(reviews_full.valid)
(25000, 25000)
x=train_doc_term
y=reviews_full.train.y
valid_y = reviews_full.valid.y.items
x
<25000x38464 sparse matrix of type '<class 'numpy.int32'>' with 3716501 stored elements in Compressed Sparse Row format>
positive = y.c2i['pos']
negative = y.c2i['neg']
C0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))
C1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
C0
array([26553, 0, 12500, 0, ..., 0, 0, 0, 0], dtype=int32)
C1
array([28399, 0, 12500, 0, ..., 0, 0, 0, 0], dtype=int32)
L1 = (C1+1) / ((y.items==positive).sum() + 1)
L0 = (C0+1) / ((y.items==negative).sum() + 1)
R = np.log(L1/L0)
Check that log-count ratios are negative for words with negative
sentiment and positive for words with positive
sentiment!
R[full_vocab.stoi['hated']]
-0.7133498878774648
R[full_vocab.stoi['loved']]
1.1563661500586044
R[full_vocab.stoi['liked']]
0.4418327522790391
R[full_vocab.stoi['worst']]
-2.2826243504315076
R[full_vocab.stoi['best']]
0.7225576052173609
bias
$b$ is 0.¶b = np.log((y.items==positive).mean() / (y.items==negative).mean())
print(f'The bias term b is {b}')
The bias term b is 0.0
# predict labels for the validation data
W = valid_doc_term.sign()
preds = (W @ R + b) > 0
valid_accuracy = (preds == valid_y).mean()
print(f'Validation accuracy is {valid_accuracy} for the full data set')
Validation accuracy is 0.83292 for the full data set
sci-kit learn
library, we can fit logistic a regression model where the features are the unigrams. Here $C$ is a regularization parameter.¶from sklearn.linear_model import LogisticRegression
document-term matrix
:¶m = LogisticRegression(C=0.1, dual=False,solver = 'liblinear')
# 'liblinear' and 'newton-cg' solvers both get 0.88328 accuracy
# 'sag', 'saga', and 'lbfgs' don't converge
m.fit(train_doc_term, y.items.astype(int))
preds = m.predict(valid_doc_term)
valid_accuracy = (preds==valid_y).mean()
print(f'Validation accuracy is {valid_accuracy} using the full doc-term matrix')
Validation accuracy is 0.88328 using the full doc-term matrix
document-term
matrix gets a slightly higher accuracy:¶m = LogisticRegression(C=0.1, dual=False,solver = 'liblinear')
m.fit(train_doc_term.sign(), y.items.astype(int))
preds = m.predict(valid_doc_term.sign())
valid_accuracy = (preds==valid_y).mean()
print(f'Validation accuracy is {valid_accuracy} using the binarized doc-term matrix')
Validation accuracy is 0.88532 using the binarized doc-term matrix
Trigram
representation of the IMDb_sample
: preprocessing¶IMDb_sample
data set.¶ngrams
?¶path = untar_data(URLs.IMDB_SAMPLE)
TextList
API sometimes (about 50% of the time) throws a BrokenProcessPool
Error. This is puzzling, I don't know why it happens. But usually works on 1st or 2nd try.¶%%time
# throws `BrokenProcessPool' Error sometimes. Keep trying `till it works!
count = 0
error = True
while error:
try:
# Preprocessing steps
movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')
.split_from_df(col=2)
.label_from_df(cols=0))
error = False
print(f'failure count is {count}\n')
except: # catch *all* exceptions
# accumulate failure count
count = count + 1
print(f'failure count is {count}')
failure count is 0 Wall time: 14.9 s
vocab_sample = movie_reviews.vocab.itos
vocab_len = len(vocab_sample)
print(f'IMDb_sample vocabulary has {vocab_len} tokens')
IMDb_sample vocabulary has 6016 tokens
ngram-doc matrix
for the training data¶doc-term matrix
encodes the token
features, the ngram-doc matrix
encodes the ngram
features.¶min_n=1
max_n=3
j_indices = []
indptr = []
values = []
indptr.append(0)
num_tokens = vocab_len
itongram = dict()
ngramtoi = dict()
%%time
for i, doc in enumerate(movie_reviews.train.x):
feature_counter = Counter(doc.data)
j_indices.extend(feature_counter.keys())
values.extend(feature_counter.values())
this_doc_ngrams = list()
m = 0
for n in range(min_n, max_n + 1):
for k in range(vocab_len - n + 1):
ngram = doc.data[k: k + n]
if str(ngram) not in ngramtoi:
if len(ngram)==1:
num = ngram[0]
ngramtoi[str(ngram)] = num
itongram[num] = ngram
else:
ngramtoi[str(ngram)] = num_tokens
itongram[num_tokens] = ngram
num_tokens += 1
this_doc_ngrams.append(ngramtoi[str(ngram)])
m += 1
ngram_counter = Counter(this_doc_ngrams)
j_indices.extend(ngram_counter.keys())
values.extend(ngram_counter.values())
indptr.append(len(j_indices))
Wall time: 2min 53s
itongram
(index to n-gram) and ngramtoi
(n-gram to index) dictionaries. This takes a few minutes...¶%%time
train_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
shape=(len(indptr) - 1, len(ngramtoi)),
dtype=int)
Wall time: 161 ms
train_ngram_doc_matrix
<800x260402 sparse matrix of type '<class 'numpy.int32'>' with 678912 stored elements in Compressed Sparse Row format>
len(ngramtoi), len(itongram)
(260402, 260402)
itongram[20005]
array([125, 340, 10], dtype=int64)
ngramtoi[str(itongram[20005])]
20005
vocab_sample[125],vocab_sample[340],vocab_sample[10],
('never', 'mind', '.')
itongram[100000]
array([42, 49], dtype=int64)
vocab_sample[42], vocab_sample[49]
('have', 'an')
itongram[100010]
array([ 38, 862], dtype=int64)
vocab_sample[38], vocab_sample[862]
('are', 'within')
itongram[6116]
array([867, 52, 5], dtype=int64)
vocab_sample[867], vocab_sample[52], vocab_sample[5]
('believable', '!', 'xxmaj')
itongram[6119]
array([3776, 5, 1800], dtype=int64)
vocab_sample[3376], vocab_sample[5], vocab_sample[1800]
('parallel', 'xxmaj', 'ryan')
itongram[80000]
array([ 0, 1240, 0], dtype=int64)
vocab_sample[0], vocab_sample[1240], vocab_sample[0]
('xxunk', 'involving', 'xxunk')
ngram-doc matrix
for the validation data¶%%time
j_indices = []
indptr = []
values = []
indptr.append(0)
for i, doc in enumerate(movie_reviews.valid.x):
feature_counter = Counter(doc.data)
j_indices.extend(feature_counter.keys())
values.extend(feature_counter.values())
this_doc_ngrams = list()
m = 0
for n in range(min_n, max_n + 1):
for k in range(vocab_len - n + 1):
ngram = doc.data[k: k + n]
if str(ngram) in ngramtoi:
this_doc_ngrams.append(ngramtoi[str(ngram)])
m += 1
ngram_counter = Counter(this_doc_ngrams)
j_indices.extend(ngram_counter.keys())
values.extend(ngram_counter.values())
indptr.append(len(j_indices))
Wall time: 40.8 s
%%time
valid_ngram_doc_matrix = scipy.sparse.csr_matrix((values, j_indices, indptr),
shape=(len(indptr) - 1, len(ngramtoi)),
dtype=int)
Wall time: 37.9 ms
valid_ngram_doc_matrix
<200x260402 sparse matrix of type '<class 'numpy.int32'>' with 121597 stored elements in Compressed Sparse Row format>
ngram
data so we won't have to spend the time to generate it again¶scipy.sparse.save_npz("train_ngram_matrix.npz", train_ngram_doc_matrix)
scipy.sparse.save_npz("valid_ngram_matrix.npz", valid_ngram_doc_matrix)
with open('itongram.pickle', 'wb') as handle:
pickle.dump(itongram, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('ngramtoi.pickle', 'wb') as handle:
pickle.dump(ngramtoi, handle, protocol=pickle.HIGHEST_PROTOCOL)
ngram
data¶train_ngram_doc_matrix = scipy.sparse.load_npz("train_ngram_matrix.npz")
valid_ngram_doc_matrix = scipy.sparse.load_npz("valid_ngram_matrix.npz")
with open('itongram.pickle', 'rb') as handle:
b = pickle.load(handle)
with open('ngramtoi.pickle', 'rb') as handle:
b = pickle.load(handle)
x=train_ngram_doc_matrix
x
<800x260402 sparse matrix of type '<class 'numpy.int32'>' with 678912 stored elements in Compressed Sparse Row format>
k = x.shape[1]
print(f'There are {k} 1-gram, 2-gram, and 3-gram features in the IMDb_sample vocabulary')
There are 260402 1-gram, 2-gram, and 3-gram features in the IMDb_sample vocabulary
y=movie_reviews.train.y
y.items
y.items.shape
(800,)
positive = y.c2i['positive']
negative = y.c2i['negative']
print(f'positive and negative review labels are represented numerically by {positive} and {negative}')
positive and negative review labels are represented numerically by 1 and 0
valid_labels = [label == positive for label in movie_reviews.valid.y.items]
valid_labels=np.array(valid_labels)[:,np.newaxis]
valid_labels.shape
(200, 1)
positive
and negative
reviews in the training set¶pos = (y.items == positive)
neg = (y.items == negative)
ngram_doc_matrix
¶occurrence count
vectors¶The kernel dies if I use the sparse matrix x here, so converting x to a dense matrix
C0 = np.squeeze(x.todense()[neg].sum(0))
C1 = np.squeeze(x.todense()[pos].sum(0))
class likelihood
vectors¶L0 = (C0+1) / (neg.sum() + 1)
L1 = (C1+1) / (pos.sum() + 1)
log-count ratio
column vector¶R = np.log(L1/L0).reshape((-1,1))
(y.items==positive).mean(), (y.items==negative).mean()
(0.47875, 0.52125)
b = np.log((y.items==positive).mean() / (y.items==negative).mean())
print(b)
-0.08505123261815539
ngram_doc_matrix
¶W = valid_ngram_doc_matrix
preds = W @ R + b
preds = preds > 0
accuracy = (preds == valid_labels).mean()
print(f'Accuracy for Naive Bayes with the full trigrams Model = {accuracy}' )
Accuracy for Naive Bayes with the full trigrams Model = 0.76
n_gram_doc_matrix
¶x = train_ngram_doc_matrix.sign()
x
<800x260402 sparse matrix of type '<class 'numpy.int32'>' with 566499 stored elements in Compressed Sparse Row format>
occurrence count
vectors¶The kernel dies if I use the sparse matrix x here, so converting x to a dense matrix
C0 = np.squeeze(x.todense()[neg].sum(0))
C1 = np.squeeze(x.todense()[pos].sum(0))
class likelihood
vectors¶L1 = (C1+1) / ((y.items==positive).sum() + 1)
L0 = (C0+1) / ((y.items==negative).sum() + 1)
log-count ratio
column vector¶R = np.log(L1/L0).reshape((-1,1))
print(R)
[[-0.005675] [ 0.084839] [ 0. ] [ 0.084839] ... [-0.608308] [-0.608308] [-0.608308] [-0.608308]]
ngram_doc_matrix
¶W = valid_ngram_doc_matrix.sign()
preds = W @ R + b
preds = preds>0
accuracy = (preds==valid_labels).mean()
print(f'Accuracy for Binarized Naive Bayes with Trigrams Model = {accuracy}' )
Accuracy for Binarized Naive Bayes with Trigrams Model = 0.735
regularized
logistic regression where the features are the trigrams.¶from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer
to create the train_ngram_doc
matrix¶veczr = CountVectorizer(ngram_range=(1,3), preprocessor=noop, tokenizer=noop, max_features=800000)
train_docs = movie_reviews.train.x
train_words = [[movie_reviews.vocab.itos[o] for o in doc.data] for doc in train_docs]
valid_docs = movie_reviews.valid.x
valid_words = [[movie_reviews.vocab.itos[o] for o in doc.data] for doc in valid_docs]
%%time
train_ngram_doc_matrix_veczr = veczr.fit_transform(train_words)
train_ngram_doc_matrix_veczr
Wall time: 1.35 s
<800x260401 sparse matrix of type '<class 'numpy.int64'>' with 565699 stored elements in Compressed Sparse Row format>
valid_ngram_doc_matrix_veczr = veczr.transform(valid_words)
valid_ngram_doc_matrix_veczr
<200x260401 sparse matrix of type '<class 'numpy.int64'>' with 93549 stored elements in Compressed Sparse Row format>
vocab = veczr.get_feature_names()
vocab[200000:200005]
['the running man', 'the rural', 'the rural xxmaj', 'the sad', 'the sad recognition']
# fit model
m = LogisticRegression(C=0.1, dual=False, solver = 'liblinear')
m.fit(train_ngram_doc_matrix_veczr.sign(), y.items);
# get predictions
preds = m.predict(valid_ngram_doc_matrix_veczr.sign())
valid_labels = [label == positive for label in movie_reviews.valid.y.items]
# check accuracy
accuracy = (preds==valid_labels).mean()
print(f'Accuracy = {accuracy} for Logistic Regression, with binarized trigram counts from `CountVectorizer`' )
Accuracy = 0.83 for Logistic Regression, with binarized trigram counts from `CountVectorizer`
Performance is worse with full trigram counts.
m = LogisticRegression(C=0.1, dual=False, solver = 'liblinear')
m.fit(train_ngram_doc_matrix_veczr, y.items);
preds = m.predict(valid_ngram_doc_matrix_veczr)
accuracy =(preds==valid_labels).mean()
print(f'Accuracy = {accuracy} for Logistic Regression, with full trigram counts from `CountVectorizer`' )
Accuracy = 0.78 for Logistic Regression, with full trigram counts from `CountVectorizer`
our
ngrams to create the train_ngram_doc
matrix¶train_ngram_doc_matrix.shape
(800, 260402)
m2=None
m2 = LogisticRegression(C=0.1, dual=False, solver = 'liblinear')
m2.fit(train_ngram_doc_matrix.sign(), y.items)
preds = m2.predict(valid_ngram_doc_matrix.sign())
accuracy = (preds==valid_labels).mean()
print(f'Accuracy = {accuracy} for Logistic Regression, with our binarized trigram counts' )
Accuracy = 0.83 for Logistic Regression, with our binarized trigram counts
Performance is again worse with full trigram counts.
m2 = LogisticRegression(C=0.1, dual=False,solver='liblinear')
m2.fit(train_ngram_doc_matrix, y.items)
preds = m2.predict(valid_ngram_doc_matrix)
accuracy = (preds==valid_labels).mean()
print(f'Accuracy = {accuracy} for Not-Binarized Logistic Regression, with our Trigrams' )
Accuracy = 0.795 for Not-Binarized Logistic Regression, with our Trigrams
x=train_ngram_doc_matrix.sign()
valid_x=valid_ngram_doc_matrix.sign()
C0 = np.squeeze(x.todense()[neg].sum(axis=0))
C1 = np.squeeze(x.todense()[pos].sum(axis=0))
L1 = (C1+1) / ((pos).sum() + 1)
L0 = (C0+1) / ((neg).sum() + 1)
R = np.log(L1/L0)
R.shape
(1, 260402)
R_tile = np.tile(R,[x.shape[0],1])
print(R_tile.shape)
(800, 260402)
# The next line causes the kernel to die?
# x_nb = x.multiply(R)
# As a workaround, use the full matrices
x_nb = np.multiply(x.todense(),R_tile)
m = LogisticRegression(dual=False, C=0.1,solver='liblinear')
m.fit(x_nb, y.items);
# why does valid_x.multiply(R) work but x.multiply(R) does not?
valid_x_nb = valid_x.multiply(R)
preds = m.predict(valid_x_nb)
accuracy = (preds==valid_labels).mean()
print(f'Accuracy = {accuracy} for Logistic Regression, with trigram log-count ratios' )
Accuracy = 0.835 for Logistic Regression, with trigram log-count ratios
from IPython.display import HTML, display
# Note: to install the `tabulate` package,
# go to a shell terminal and run the command
# `conda install tabulate`
import tabulate
table = [["Model","Data Set","Token Unit","Validation Accuracy(%)"],
["Naive Bayes","IMDb_sample", "Full Unigram","64.5 (from video #5)"],
["Naive Bayes","IMDb_sample", "Binarized Unigram","68.0"],
["Naive Bayes","IMDb_sample", "Full Trigram","76.0"],
["Naive Bayes","IMDb_sample", "Binarized Trigram","73.5"],
["Logistic Regression","IMDb_sample", "Full Trigram","78.0, 80.0 (our Trigrams)"],
["Logistic Regression","IMDb_sample", "Binarized Trigram","83.0"],
["Logistic Regression","IMDb_sample", "Binarized Trigram log-count ratios","83.5"],
["Naive Bayes","Full IMDb","IMDb_sample", "Binarized Trigram","83.3"],
["Logistic Regression","Full IMDb", "Full Trigram","88.3"],
["Logistic Regression","Full IMDb", "Binarized Trigram","88.5"]]
display(HTML(tabulate.tabulate(table, tablefmt='html')))
Model | Data Set | Token Unit | Validation Accuracy(%) | |
Naive Bayes | IMDb_sample | Full Unigram | 64.5 (from video #5) | |
Naive Bayes | IMDb_sample | Binarized Unigram | 68.0 | |
Naive Bayes | IMDb_sample | Full Trigram | 76.0 | |
Naive Bayes | IMDb_sample | Binarized Trigram | 73.5 | |
Logistic Regression | IMDb_sample | Full Trigram | 78.0, 80.0 (our Trigrams) | |
Logistic Regression | IMDb_sample | Binarized Trigram | 83.0 | |
Logistic Regression | IMDb_sample | Binarized Trigram log-count ratios | 83.5 | |
Naive Bayes | Full IMDb | IMDb_sample | Binarized Trigram | 83.3 |
Logistic Regression | Full IMDb | Full Trigram | 88.3 | |
Logistic Regression | Full IMDb | Binarized Trigram | 88.5 |