In [0]:

!pip install autoviml

Collecting autoviml
  Downloading https://files.pythonhosted.org/packages/bb/99/ef8a21805d516a47a5d51ed427c73d300a68fd728d62443acc3aebe2d382/autoviml-0.1.623-py3-none-any.whl (92kB)
     |████████████████████████████████| 102kB 4.2MB/s 
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from autoviml) (1.0.3)
Collecting catboost
  Downloading https://files.pythonhosted.org/packages/b1/61/2b8106c8870601671d99ca94d8b8d180f2b740b7cdb95c930147508abcf9/catboost-0.23-cp36-none-manylinux1_x86_64.whl (64.7MB)
     |████████████████████████████████| 64.8MB 58kB/s 
Requirement already satisfied: ipython in /usr/local/lib/python3.6/dist-packages (from autoviml) (5.5.0)
Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (from autoviml) (3.6.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from autoviml) (3.2.1)
Requirement already satisfied: textblob in /usr/local/lib/python3.6/dist-packages (from autoviml) (0.15.3)
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (from autoviml) (3.2.5)
Requirement already satisfied: jupyter in /usr/local/lib/python3.6/dist-packages (from autoviml) (1.0.0)
Requirement already satisfied: regex in /usr/local/lib/python3.6/dist-packages (from autoviml) (2019.12.20)
Collecting vaderSentiment
  Downloading https://files.pythonhosted.org/packages/44/a3/1218a3b5651dbcba1699101c84e5c84c36cbba360d9dbf29f2ff18482982/vaderSentiment-3.3.1-py2.py3-none-any.whl (125kB)
     |████████████████████████████████| 133kB 32.4MB/s 
Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (from autoviml) (0.90)
Requirement already satisfied: seaborn in /usr/local/lib/python3.6/dist-packages (from autoviml) (0.10.1)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.6/dist-packages (from autoviml) (0.4.3)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.6/dist-packages (from autoviml) (4.6.3)
Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.6/dist-packages (from autoviml) (0.22.2.post1)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->autoviml) (2.8.1)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/dist-packages (from pandas->autoviml) (1.18.4)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->autoviml) (2018.9)
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (from catboost->autoviml) (4.4.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from catboost->autoviml) (1.4.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from catboost->autoviml) (1.12.0)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost->autoviml) (0.10.1)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from ipython->autoviml) (46.1.3)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from ipython->autoviml) (4.3.3)
Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from ipython->autoviml) (4.4.2)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython->autoviml) (1.0.18)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython->autoviml) (4.8.0)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from ipython->autoviml) (2.1.3)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython->autoviml) (0.8.1)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython->autoviml) (0.7.5)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim->autoviml) (2.0.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->autoviml) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->autoviml) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->autoviml) (2.4.7)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (from jupyter->autoviml) (5.6.1)
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.6/dist-packages (from jupyter->autoviml) (7.5.1)
Requirement already satisfied: ipykernel in /usr/local/lib/python3.6/dist-packages (from jupyter->autoviml) (4.10.1)
Requirement already satisfied: jupyter-console in /usr/local/lib/python3.6/dist-packages (from jupyter->autoviml) (5.2.0)
Requirement already satisfied: qtconsole in /usr/local/lib/python3.6/dist-packages (from jupyter->autoviml) (4.7.3)
Requirement already satisfied: notebook in /usr/local/lib/python3.6/dist-packages (from jupyter->autoviml) (5.2.2)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.22->autoviml) (0.14.1)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly->catboost->autoviml) (1.3.3)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->ipython->autoviml) (0.2.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->autoviml) (0.1.9)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython->autoviml) (0.6.0)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim->autoviml) (2.23.0)
Requirement already satisfied: boto in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim->autoviml) (2.49.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim->autoviml) (1.13.3)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter->autoviml) (0.3)
Requirement already satisfied: nbformat>=4.4 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter->autoviml) (5.0.6)
Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter->autoviml) (3.1.5)
Requirement already satisfied: jinja2>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter->autoviml) (2.11.2)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter->autoviml) (0.6.0)
Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter->autoviml) (0.4.4)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter->autoviml) (1.4.2)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter->autoviml) (0.8.4)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter->autoviml) (4.6.3)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->jupyter->autoviml) (3.5.1)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.6/dist-packages (from ipykernel->jupyter->autoviml) (5.3.4)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.6/dist-packages (from ipykernel->jupyter->autoviml) (4.5.3)
Requirement already satisfied: pyzmq>=17.1 in /usr/local/lib/python3.6/dist-packages (from qtconsole->jupyter->autoviml) (19.0.0)
Requirement already satisfied: qtpy in /usr/local/lib/python3.6/dist-packages (from qtconsole->jupyter->autoviml) (1.9.0)
Requirement already satisfied: terminado>=0.3.3; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from notebook->jupyter->autoviml) (0.8.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim->autoviml) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim->autoviml) (2.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim->autoviml) (2020.4.5.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim->autoviml) (1.24.3)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim->autoviml) (0.9.5)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim->autoviml) (0.3.3)
Requirement already satisfied: botocore<1.17.0,>=1.16.3 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim->autoviml) (1.16.3)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.4->nbconvert->jupyter->autoviml) (2.6.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->jupyter->autoviml) (0.5.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->jupyter->autoviml) (20.3)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.4->nbconvert->jupyter->autoviml) (1.1.1)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.17.0,>=1.16.3->boto3->smart-open>=1.2.1->gensim->autoviml) (0.15.2)
Installing collected packages: catboost, vaderSentiment, autoviml
Successfully installed autoviml-0.1.623 catboost-0.23 vaderSentiment-3.3.1

In [0]:

import tensorflow_datasets as tfds
import numpy as np
import pandas as pd

In [0]:

dataset, info = tfds.load('amazon_us_reviews/Personal_Care_Appliances_v1_00', with_info=True, batch_size=-1)
train_dataset = dataset['train']

Downloading and preparing dataset amazon_us_reviews/Personal_Care_Appliances_v1_00/0.1.0 (download: 16.82 MiB, generated: Unknown size, total: 16.82 MiB) to /root/tensorflow_datasets/amazon_us_reviews/Personal_Care_Appliances_v1_00/0.1.0...

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/amazon_us_reviews/Personal_Care_Appliances_v1_00/0.1.0.incompleteW5XX91/amazon_us_reviews-train.tfrecord

HBox(children=(FloatProgress(value=0.0, max=85981.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Computing statistics...', max=1.0, style=ProgressStyle(de…

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

ERROR:absl:Statistics generation doesn't work for nested structures yet

Dataset amazon_us_reviews downloaded and prepared to /root/tensorflow_datasets/amazon_us_reviews/Personal_Care_Appliances_v1_00/0.1.0. Subsequent calls will reuse this data.

In [0]:

info

Out[0]:

tfds.core.DatasetInfo(
name='amazon_us_reviews',
version=0.1.0,
description='Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.

Over 130+ million customer reviews are available to researchers as part of this release. The data is available in TSV files in the amazon-reviews-pds S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters).

Each Dataset contains the following columns :
marketplace - 2 letter country code of the marketplace where the review was written.
customer_id - Random identifier that can be used to aggregate reviews written by a single author.
review_id - The unique ID of the review.
product_id - The unique Product ID the review pertains to. In the multilingual dataset the reviews
for the same product in different countries can be grouped by the same product_id.
product_parent - Random identifier that can be used to aggregate reviews for the same product.
product_title - Title of the product.
product_category - Broad product category that can be used to group reviews
(also used to group the dataset into coherent parts).
star_rating - The 1-5 star rating of the review.
helpful_votes - Number of helpful votes.
total_votes - Number of total votes the review received.
vine - Review was written as part of the Vine program.
verified_purchase - The review is on a verified purchase.
review_headline - The title of the review.
review_body - The review text.
review_date - The date the review was written.
',
homepage='https://s3.amazonaws.com/amazon-reviews-pds/readme.html',
features=FeaturesDict({
'data': FeaturesDict({
'customer_id': tf.string,
'helpful_votes': tf.int32,
'marketplace': tf.string,
'product_category': tf.string,
'product_id': tf.string,
'product_parent': tf.string,
'product_title': tf.string,
'review_body': tf.string,
'review_date': tf.string,
'review_headline': tf.string,
'review_id': tf.string,
'star_rating': tf.int32,
'total_votes': tf.int32,
'verified_purchase': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
'vine': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
}),
}),
total_num_examples=85981,
splits={
'train': 85981,
},
supervised_keys=None,
citation="""""",
redistribution_info=,
)

In [0]:

dataset=tfds.as_numpy(train_dataset)

In [0]:

dataset

Out[0]:

{'data': {'customer_id': array([b'13986323', b'50574716', b'50593972', ..., b'40719682',
         b'35596948', b'29430209'], dtype=object),
  'helpful_votes': array([0, 3, 0, ..., 0, 0, 0], dtype=int32),
  'marketplace': array([b'US', b'US', b'US', ..., b'US', b'US', b'US'], dtype=object),
  'product_category': array([b'Personal_Care_Appliances', b'Personal_Care_Appliances',
         b'Personal_Care_Appliances', ..., b'Personal_Care_Appliances',
         b'Personal_Care_Appliances', b'Personal_Care_Appliances'],
        dtype=object),
  'product_id': array([b'B00847JQZ6', b'B00N5HD340', b'B0077L1X24', ..., b'B000UZ8X2W',
         b'B000NURPPK', b'B001EY5GNW'], dtype=object),
  'product_parent': array([b'997683625', b'955577225', b'120764066', ..., b'96066145',
         b'58591097', b'986877728'], dtype=object),
  'product_title': array([b'SE - Reading Glass - Spring Loaded Hinges, 4.0x - RTS62400',
         b'Straight Razor',
         b'Philips Sonicare Flexcare & Healthy White Plastic Travel Handle Case New Bulk Package',
         ...,
         b'Remington R-9200 Microflex Ultra TCT Shaver [Health and Beauty]',
         b'SUNBEAM Cool Mist HUMIDIFIER with PermaFilter # 1120 \xe2\x80\x93 1 Ea',
         b'Andis Blade Set for T-Outliner Trimmer'], dtype=object),
  'review_body': array([b"These glasses are an excellent value.  The fit is good and they are very comfortable.  Because of my legal blindness, there aren't a lot of options to try to see better, but I believe these help with my other visual aids, and because they are reasonably priced I can have more than one pair available.",
         b"Always wanted to try straight razor shaving (as a DE safety razor user), and this was a cheap way for me to determine I was not into it.<br /><br />Because the blades are disposable and always sharp, I could put a new one in and reasonably rely upon that fact that cuts were probably due to my technique and not the blade.<br /><br />It's very hard to do straight razor shaving on yourself because the ANGLE is difficult to control without switching hands. Being very right-handed, I really couldn't do that. I bet I could shave someone else's face with it though.<br /><br />An immediate upside? Using a DE safety razor (slant edged even) seems SUPER safe now! I'm increased my speed with the DE due to that confidence, and I'd been using it for years now.",
         b'I usually either throw my toothbrush in a plastic bag with spare head so this product is very convenient for keeping all the parts apart, dry and undamaged, and i now keep it in my travel bag all the time ready to go.',
         ...,
         b"I have had a Remington before but needed a new one when the batteries died and the cutters were all but gone.  It was cheaper to buy a new one.  The new one has a nice charge level but the trimmer didn't work when I got it.",
         b"I was surprised that it really didn't do much compared to the 1950s version that I'd inherited. Keeping a wet wash cloth next to my bed for when I start coughing in the middle of the night works better.",
         b'The blades were an excellent fit for my T-line trimmers.  Within five minutes I had my trimmers cleaned, the blades installed, and was putting them to use.  I saw the blades in several locations for almost twice the price I paid, so this worked out to be an awesome deal.'],
        dtype=object),
  'review_date': array([b'2015-01-04', b'2015-08-05', b'2012-11-17', ..., b'2008-02-08',
         b'2007-09-07', b'2012-07-26'], dtype=object),
  'review_headline': array([b'These glasses are an excellent value. The fit is good and they are ...',
         b'A fantastic way to cheaply try straight razor shaving.',
         b'Great for travel', ..., b'Trimmer Not Working',
         b'Loud and ineffectual', b'Excellent product, awesomoe price'],
        dtype=object),
  'review_id': array([b'R3VEUFVA9QJY55', b'R2DTQV5SMJ0CK7', b'R3OJ06NK99WLNJ', ...,
         b'R1ZQ0XZXOD9N18', b'R1FJ9OU429X00Y', b'RI28R1W94N1R6'],
        dtype=object),
  'star_rating': array([4, 5, 4, ..., 3, 2, 5], dtype=int32),
  'total_votes': array([0, 3, 0, ..., 0, 0, 0], dtype=int32),
  'verified_purchase': array([0, 0, 0, ..., 1, 0, 0]),
  'vine': array([1, 1, 1, ..., 1, 1, 1])}}

In [0]:

helpful_votes=dataset['data']['helpful_votes']
review_headline=dataset['data']['review_headline']
review_body=dataset['data']['review_body']
rating=dataset['data']['star_rating']

In [0]:

reviews_df=pd.DataFrame(np.hstack((helpful_votes[:,None],review_headline[:,None],review_body[:,None],rating[:,None])),columns=['votes','headline','reviews','rating'])

In [0]:

convert_dict = {'votes': int, 
                'headline': str,
                'reviews': str,
                'rating': int
               } 

In [0]:

reviews_df = reviews_df.astype(convert_dict) 

In [0]:

reviews_df

Out[0]:

	votes	headline	reviews	rating
0	0	b'These glasses are an excellent value. The fi...	b"These glasses are an excellent value. The f...	4
1	3	b'A fantastic way to cheaply try straight razo...	b"Always wanted to try straight razor shaving ...	5
2	0	b'Great for travel'	b'I usually either throw my toothbrush in a pl...	4
3	0	b'Five Stars'	b'Top quality.'	5
4	1	b'*Product sent not as shown'	b'Today I received 1 Fl. Oz, Natures Balance ...	3
...	...	...	...	...
85976	2	b'YES!'	b"This is the real deal. Don't bother with the...	5
85977	1	b'Bryton Picks'	b'I like the Bryton Picks very much. Have orde...	5
85978	0	b'Trimmer Not Working'	b"I have had a Remington before but needed a n...	3
85979	0	b'Loud and ineffectual'	b"I was surprised that it really didn't do muc...	2
85980	0	b'Excellent product, awesomoe price'	b'The blades were an excellent fit for my T-li...	5

85981 rows × 4 columns

In [0]:

reviews_df["target"] = reviews_df["rating"].apply(lambda x: 1 if x>= 4 else 0) 

In [0]:

reviews_df

Out[0]:

	votes	headline	reviews	rating	target
0	0	b'These glasses are an excellent value. The fi...	b"These glasses are an excellent value. The f...	4	1
1	3	b'A fantastic way to cheaply try straight razo...	b"Always wanted to try straight razor shaving ...	5	1
2	0	b'Great for travel'	b'I usually either throw my toothbrush in a pl...	4	1
3	0	b'Five Stars'	b'Top quality.'	5	1
4	1	b'*Product sent not as shown'	b'Today I received 1 Fl. Oz, Natures Balance ...	3	0
...	...	...	...	...	...
85976	2	b'YES!'	b"This is the real deal. Don't bother with the...	5	1
85977	1	b'Bryton Picks'	b'I like the Bryton Picks very much. Have orde...	5	1
85978	0	b'Trimmer Not Working'	b"I have had a Remington before but needed a n...	3	0
85979	0	b'Loud and ineffectual'	b"I was surprised that it really didn't do muc...	2	0
85980	0	b'Excellent product, awesomoe price'	b'The blades were an excellent fit for my T-li...	5	1

85981 rows × 5 columns

In [0]:

reviews_df.shape[0]

Out[0]:

In [0]:

reviews_df["target"].value_counts()

Out[0]:

1    62554
0    23427
Name: target, dtype: int64

In [0]:

reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85981 entries, 0 to 85980
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   votes     85981 non-null  int64 
 1   headline  85981 non-null  object
 2   reviews   85981 non-null  object
 3   rating    85981 non-null  int64 
 4   target    85981 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 3.3+ MB

In [0]:

from sklearn.model_selection import train_test_split
train, test = train_test_split(reviews_df, test_size=0.25)

In [0]:

from autoviml.Auto_NLP import Auto_NLP

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    | Downloading package stopwords to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/stopwords.zip.
[nltk_data]    | Downloading package treebank to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/treebank.zip.
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/twitter_samples.zip.
[nltk_data]    | Downloading package omw to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/omw.zip.
[nltk_data]    | Downloading package wordnet to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/wordnet.zip.
[nltk_data]    | Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/wordnet_ic.zip.
[nltk_data]    | Downloading package words to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/words.zip.
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data]    | Downloading package punkt to /root/nltk_data...
[nltk_data]    |   Unzipping tokenizers/punkt.zip.
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection popular
Imported Auto_NLP version: 0.0.33.. Call using:
     train_nlp, test_nlp, nlp_pipeline, predictions = Auto_NLP(
                nlp_column, train, test, target, score_type='balanced-accuracy',
                modeltype='Classification',top_num_features=200, verbose=0,
                build_model=True)

In [0]:

nlp_column = 'reviews'
target = 'target'
train_nlp, test_nlp, nlp_transformer, preds = Auto_NLP(
                nlp_column, train, test, target, score_type='balanced_accuracy',
                modeltype='Classification',top_num_features=50, verbose=2,
                build_model=True)

Auto NLP processing on NLP Column: reviews
Shape of Train Data: 64485 rows
    Shape of Test Data: 21496 rows

    Added 9 summary columns for counts of words and characters in each row
    Cleaning text in reviews before doing transformation...
Train and Test data Text cleaning completed. Time taken = 337 seconds
    A U T O - N L P   P R O C E S S I N G  O N   N L P   C O L U M N = reviews 
#################################################################################
Generating new features for NLP column = reviews using NLP Transformers
    Cleaning text in reviews before doing transformation...
    However max_features limit = 4846 will limit numerous features from being generated

#### Optimizing Count Vectorizer with best max_df=0.50, 1-3 n-grams and high features...
    balanced_accuracy Metrics for 4846 features = 0.7589

#### Using Count Vectorizer with limited max_features and a min_df=2 with n_gram (1-5)
    balanced_accuracy Metrics for 4846 features = 0.7677

# Using TFIDF vectorizer with min_df=2, ngram (1,3) and very high max_features

In [0]:

nlp_transformer

In [0]:

nlp_transformer.predict(test[nlp_column])

In [0]:

nlp_transformer.