Walkthrough for S3-helper function

Forget where you put some sensitive information on your cloud database?
Tired of downloading files from s3 only to preview them in excel, or textpad?
Want to touch and reshape data but feel it's caged off from you?
Look no further, your s3 problems are solved!

Table of Contents

This Jupyter notebook gives a walkthrough of several handy functions from the s3 module.
With best intentions, these functions mirror the use of standard libraries, while empolying the backend of popular open source projects.

The notebook highlights 7 functions:
  1. List files (with wildcard) in a s3 bucket/key using ls()
  2. Read files into a string or bytes using read() and open()
  3. Read csv and json files on s3 into Pandas dataframes using read_csv() and read_json()
  4. Write csv and json files from Pandas dataframes to s3 using to_csv() and to_json()
  5. Write local files to s3 using write()
  6. Saving and Loading Scikit-Learn classifiers
  7. Moving files to new buckets and keys using mv()

The only requirements are setting AWS environment variables or setting up the AWS CLI, and installing the requirements.txt modules.

For this tutorial, we'll use the red wine quality dataset from UCI Center for Machine Learning and Intelligent Systems.

In [1]:
import os
import s3

Listing files in a S3 bucket and key using ls( )

s3.ls will list all the files and directories in a bucket/key akin to os.listdir()
see the code

In [2]:
s3_path = 's3://prod-datalytics/playground/'

It takes in a bucket or bucket, key pair.

In [3]:
s3.ls(s3_path)
Out[3]:
['s3://prod-datalytics/playground/json_bourne.json',
 's3://prod-datalytics/playground/wine_is_fine.csv',
 's3://prod-datalytics/playground/wine_is_not_fine.tsv']

s3.ls also supports regex-like wildcard patterns exactly like glob.glob()

In [4]:
s3.ls(s3_path + '*.csv')
Out[4]:
['s3://prod-datalytics/playground/wine_is_fine.csv']

With a programmatic method of getting s3 file paths, we can start doing some cools stuff.
top

Read files in s3 with open()

see the code

In [5]:
f = s3.ls(s3_path + '*.csv')[0]
f
Out[5]:
's3://prod-datalytics/playground/wine_is_fine.csv'

we can open the file as a streaming body of bytes.

In [6]:
s3.open(f)
Out[6]:
<botocore.response.StreamingBody at 0x10a52ada0>

this is helpful sometimes, but typically we want to read a file like Python's native
 Open(filename, 'r') as f:
   f.read()

In [7]:
s3.read(f, encoding='utf-8')[:200] # displays the first 200 characters.
Out[7]:
'fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality\n7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5'

For more structured data, we can leverage Pandas' parsing engines...
top

Read S3 files to memory with read_csv( ) and read_json( )

s3.read_csv and read_json are identical to their Pandas' ancestor and backbone.
Using this handy function, you have data displayed in a nice tabular format:
see the code

In [8]:
df = s3.read_csv(f, sep=',')
df.head(3)
Out[8]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

a csv is the most simple use case, we can handle alternative delimiters and json files too.

In [9]:
files = s3.ls(s3_path)
files
Out[9]:
['s3://prod-datalytics/playground/json_bourne.json',
 's3://prod-datalytics/playground/wine_is_fine.csv',
 's3://prod-datalytics/playground/wine_is_not_fine.tsv']

here are tab-separated values (tsv).

In [10]:
print("We can read the {} tsv easily.".format(files[-1]))

df = s3.read_csv(files[-1], sep='\t')
df.tail(3)
We can read the s3://prod-datalytics/playground/wine_is_not_fine.tsv tsv easily.
Out[10]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

here's a json file

In [11]:
print("We can also read the {} file easily.".format(files[0]))

df = s3.read_json(files[0])
df.sample(3)
We can also read the s3://prod-datalytics/playground/json_bourne.json file easily.
Out[11]:
alcohol chlorides citric acid density fixed acidity free sulfur dioxide pH quality residual sugar sulphates total sulfur dioxide volatile acidity
1289 10.2 0.068 0.30 0.99914 7.0 20.0 3.30 5 4.5 1.17 110.0 0.60
607 10.5 0.092 0.41 0.99820 8.8 26.0 3.31 6 3.3 0.53 52.0 0.48
675 10.2 0.064 0.39 0.99840 9.3 12.0 3.26 5 2.2 0.65 31.0 0.41

they're actually all the same file-- in different formats!
If you're new to Pandas, you'll be happy to learn that it is the de-facto tool for data manipulation.

In [12]:
df.dtypes
Out[12]:
alcohol                 float64
chlorides               float64
citric acid             float64
density                 float64
fixed acidity           float64
free sulfur dioxide     float64
pH                      float64
quality                   int64
residual sugar          float64
sulphates               float64
total sulfur dioxide    float64
volatile acidity        float64
dtype: object

Getting basic stats and distributions are a function away..

In [13]:
df.describe().T
Out[13]:
count mean std min 25% 50% 75% max
alcohol 1599.0 10.422983 1.065668 8.40000 9.5000 10.20000 11.100000 14.90000
chlorides 1599.0 0.087467 0.047065 0.01200 0.0700 0.07900 0.090000 0.61100
citric acid 1599.0 0.270976 0.194801 0.00000 0.0900 0.26000 0.420000 1.00000
density 1599.0 0.996747 0.001887 0.99007 0.9956 0.99675 0.997835 1.00369
fixed acidity 1599.0 8.319637 1.741096 4.60000 7.1000 7.90000 9.200000 15.90000
free sulfur dioxide 1599.0 15.874922 10.460157 1.00000 7.0000 14.00000 21.000000 72.00000
pH 1599.0 3.311113 0.154386 2.74000 3.2100 3.31000 3.400000 4.01000
quality 1599.0 5.636023 0.807569 3.00000 5.0000 6.00000 6.000000 8.00000
residual sugar 1599.0 2.538806 1.409928 0.90000 1.9000 2.20000 2.600000 15.50000
sulphates 1599.0 0.658149 0.169507 0.33000 0.5500 0.62000 0.730000 2.00000
total sulfur dioxide 1599.0 46.467792 32.895324 6.00000 22.0000 38.00000 62.000000 289.00000
volatile acidity 1599.0 0.527821 0.179060 0.12000 0.3900 0.52000 0.640000 1.58000

Everything is indexed!
Here we get a quick calculation for the 75th percentile of alcohol content.

In [14]:
df.describe()['alcohol']['75%']
Out[14]:
11.1

It's easy to filter a dataframe:
Here we're going to get all the heavily alcoholic wines...

In [15]:
df_alcoholic = df[df['alcohol'] > df.describe()['alcohol']['75%']]
df_alcoholic.head()
Out[15]:
alcohol chlorides citric acid density fixed acidity free sulfur dioxide pH quality residual sugar sulphates total sulfur dioxide volatile acidity
45 13.1 0.054 0.15 0.9934 4.6 8.0 3.90 4 2.1 0.56 65.0 0.52
95 12.9 0.058 0.17 0.9932 4.7 17.0 3.85 6 2.3 0.60 106.0 0.60
131 13.0 0.049 0.09 0.9937 5.6 17.0 3.63 5 2.3 0.63 99.0 0.50
132 13.0 0.049 0.09 0.9937 5.6 17.0 3.63 5 2.3 0.63 99.0 0.50
142 14.0 0.050 0.00 0.9916 5.2 27.0 3.68 6 1.8 0.79 63.0 0.34

It's also stupid easy to plot-- as Pandas extends the Matplotlib package.

In [16]:
#this line is run once, typically at the beginning of the notebook to enable plotting.
%matplotlib inline
In [17]:
df_alcoholic.plot(kind='scatter', x='residual sugar', y='density')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x10de76e10>

What is that outlier?

In [2]:
df_alcoholic[df_alcoholic['residual sugar'] > 12]

After processing and normalizing the data, we may want to upload this new file to s3.
top

Write DataFrames to S3 with to_csv( ) and to_json( )

s3.read_csv and read_json are almost identical to their Pandas ancestor and backbone.
The difference is that s3.to_csv takes the dataframe as an argument, rather than being a function of a dataframe.
see the code

In [18]:
# where will the file get stored?
s3_target = 's3://prod-datalytics/playground/wine_list.tsv.gz'

We can now use our filtered dataset, to write a new file to s3.
Using Pandas to_csv args, we have a lot of control of the output format.

In [19]:
s3.to_csv(df_alcoholic, s3_target, sep='\t',
          index=False, compression='gzip')
Out[19]:
"File uploaded to 's3://prod-datalytics/playground/wine_list.tsv.gz'"

top

Write local files to S3 with disk_2_s3( )

We can send local files to s3 too, first let's write a file to local disk using the built-in Pandas to_csv().

In [20]:
local_file = 'wine_list.tsv.gz'
In [21]:
df_alcoholic.to_csv(local_file, sep='\t', index=False, compression='gzip')
In [22]:
s3.disk_2_s3(file=local_file,
             s3_path=s3_target)
Out[22]:
"'wine_list.tsv.gz' loaded to 's3://prod-datalytics/playground/wine_list.tsv.gz'"
In [23]:
# purge it!
os.remove(local_file)

Saving and Loading Scikit-Learn Classifiers

If you're into machine learning, you're in luck!
see the code

In [24]:
from sklearn.ensemble import RandomForestClassifier

for the example let's just use a vanilla Random Forest Model

In [25]:
clf = RandomForestClassifier()
clf
Out[25]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Here is where we'd train and evaluate the model...

In [30]:
# fit the model!
# clf.fit(X, y)
ACCURACY ON TRAINING SET: 0.99
ACCURACY OF TEST SET: 0.61

My first run (not shown) I got a an test set accuracy of only 61%, which is pretty bad.
You should try to beat that score!

In [ ]:
'''
write some code here:
look into train_test_split, gridsearchCV, and kfolds from Scikit-Learn.

This is also a great dataset to practice:
scaling values (see standardScaler)
dimensionality reduction (see PCA)
and a linear model (see Lasso or Logistic Regression)
'''

Once you're happy with the performance, we can persist the model as a pickle file.

In [31]:
s3.dump_clf(clf, 's3://prod-datalytics/playground/models/clf.pkl')
Out[31]:
"'clf.pkl' loaded to 's3://prod-datalytics/playground/models/clf.pkl'"

And re-use it when the time is right!

In [32]:
s3.load_clf('s3://prod-datalytics/playground/models/clf.pkl')
Out[32]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

top

Movin' Files between buckets and keys

In the interest of good file-keeping let's move our saved classifier to it's own special folder (key).

In [33]:
s3.cp(old_path='s3://prod-datalytics/playground/models/clf.pkl',
      new_path='s3://prod-datalytics/production_space/models/clf.pkl',)
Out[33]:
{'CopyObjectResult': {'ETag': '"fd28ec0656661ce2a86373b097a95b89"',
  'LastModified': datetime.datetime(2017, 3, 2, 0, 13, 43, tzinfo=tzutc())},
 'CopySourceVersionId': 'ov3ei3i4mEGFOcBMEmi2g8atvAbVHKJx',
 'ResponseMetadata': {'HTTPStatusCode': 200,
  'HostId': 'ax6Q2HTAn+86P6wz6v2MWX3ZLsYoksdpqgcJtyKaXcEur80A4awZMiEEDuMLzzcydYNoyX3wBGQ=',
  'RequestId': 'AE98FD0B9CE85D9F'},
 'VersionId': 'OJX.ffdLzhrSD5kYc3wyMAWhtSWBIawN'}

to move the file (and delete the old instance) we use mv, instead of cp.

In [34]:
s3.mv(old_path='s3://prod-datalytics/playground/models/clf.pkl',
      new_path='s3://prod-datalytics/production_space/models/clf.pkl',)
Out[34]:
{'DeleteMarker': True,
 'ResponseMetadata': {'HTTPStatusCode': 204,
  'HostId': 'wevPToOM9kIl8QId3r9NzheWcq5c1Rw43kUnH9Js7Ja8N3Ah/8G5DxzfKO9JVaL4uZ8RMkSkN5o=',
  'RequestId': '9E3FC788589304B4'},
 'VersionId': 'tnPVq8F1usk.sBxewcQ2SUWaYOrG3KXN'}

top