Walkthrough for S3-helper function¶

Forget where you put some sensitive information on your cloud database?
Tired of downloading files from s3 only to preview them in excel, or textpad?
Want to touch and reshape data but feel it's caged off from you?
Look no further, your s3 problems are solved!

Table of Contents ¶

This Jupyter notebook gives a walkthrough of several handy functions from the s3 module.
With best intentions, these functions mirror the use of standard libraries, while empolying the backend of popular open source projects.

The notebook highlights 7 functions:

List files (with wildcard) in a s3 bucket/key using ls()
Read files into a string or bytes using read() and open()
Read csv and json files on s3 into Pandas dataframes using read_csv() and read_json()
Write csv and json files from Pandas dataframes to s3 using to_csv() and to_json()
Write local files to s3 using write()
Saving and Loading Scikit-Learn classifiers
Moving files to new buckets and keys using mv()

The only requirements are setting AWS environment variables or setting up the AWS CLI, and installing the requirements.txt modules.

For this tutorial, we'll use the red wine quality dataset from UCI Center for Machine Learning and Intelligent Systems.

In [1]:

import os
import s3

Listing files in a S3 bucket and key using ls( )¶

s3.ls will list all the files and directories in a bucket/key akin to os.listdir()
see the code

In [2]:

s3_path = 's3://prod-datalytics/playground/'

It takes in a bucket or bucket, key pair.

In [3]:

s3.ls(s3_path)

Out[3]:

['s3://prod-datalytics/playground/json_bourne.json',
 's3://prod-datalytics/playground/wine_is_fine.csv',
 's3://prod-datalytics/playground/wine_is_not_fine.tsv']

s3.ls also supports regex-like wildcard patterns exactly like glob.glob()

In [4]:

s3.ls(s3_path + '*.csv')

Out[4]:

['s3://prod-datalytics/playground/wine_is_fine.csv']

With a programmatic method of getting s3 file paths, we can start doing some cools stuff.
top

Read files in s3 with open() ¶

see the code

In [5]:

f = s3.ls(s3_path + '*.csv')[0]
f

Out[5]:

's3://prod-datalytics/playground/wine_is_fine.csv'

we can open the file as a streaming body of bytes.

In [6]:

s3.open(f)

Out[6]:

<botocore.response.StreamingBody at 0x10a52ada0>

this is helpful sometimes, but typically we want to read a file like Python's native
Open(filename, 'r') as f:
f.read()

In [7]:

s3.read(f, encoding='utf-8')[:200] # displays the first 200 characters.

Out[7]:

'fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality\n7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5'

For more structured data, we can leverage Pandas' parsing engines...
top

Read S3 files to memory with read_csv( ) and read_json( ) ¶

s3.read_csv and read_json are identical to their Pandas' ancestor and backbone.
Using this handy function, you have data displayed in a nice tabular format:
see the code

In [8]:

df = s3.read_csv(f, sep=',')
df.head(3)

Out[8]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5

a csv is the most simple use case, we can handle alternative delimiters and json files too.

In [9]:

files = s3.ls(s3_path)
files

Out[9]:

['s3://prod-datalytics/playground/json_bourne.json',
 's3://prod-datalytics/playground/wine_is_fine.csv',
 's3://prod-datalytics/playground/wine_is_not_fine.tsv']

here are tab-separated values (tsv).

In [10]:

print("We can read the {} tsv easily.".format(files[-1]))

df = s3.read_csv(files[-1], sep='\t')
df.tail(3)

We can read the s3://prod-datalytics/playground/wine_is_not_fine.tsv tsv easily.

Out[10]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
1596	6.3	0.510	0.13	2.3	0.076	29.0	40.0	0.99574	3.42	0.75	11.0	6
1597	5.9	0.645	0.12	2.0	0.075	32.0	44.0	0.99547	3.57	0.71	10.2	5
1598	6.0	0.310	0.47	3.6	0.067	18.0	42.0	0.99549	3.39	0.66	11.0	6

here's a json file

In [11]:

print("We can also read the {} file easily.".format(files[0]))

df = s3.read_json(files[0])
df.sample(3)

We can also read the s3://prod-datalytics/playground/json_bourne.json file easily.

Out[11]:

	alcohol	chlorides	citric acid	density	fixed acidity	free sulfur dioxide	pH	quality	residual sugar	sulphates	total sulfur dioxide	volatile acidity
1289	10.2	0.068	0.30	0.99914	7.0	20.0	3.30	5	4.5	1.17	110.0	0.60
607	10.5	0.092	0.41	0.99820	8.8	26.0	3.31	6	3.3	0.53	52.0	0.48
675	10.2	0.064	0.39	0.99840	9.3	12.0	3.26	5	2.2	0.65	31.0	0.41

they're actually all the same file-- in different formats!
If you're new to Pandas, you'll be happy to learn that it is the de-facto tool for data manipulation.

In [12]:

df.dtypes

Out[12]:

alcohol                 float64
chlorides               float64
citric acid             float64
density                 float64
fixed acidity           float64
free sulfur dioxide     float64
pH                      float64
quality                   int64
residual sugar          float64
sulphates               float64
total sulfur dioxide    float64
volatile acidity        float64
dtype: object

Getting basic stats and distributions are a function away..

In [13]:

df.describe().T

Out[13]:

	count	mean	std	min	25%	50%	75%	max
alcohol	1599.0	10.422983	1.065668	8.40000	9.5000	10.20000	11.100000	14.90000
chlorides	1599.0	0.087467	0.047065	0.01200	0.0700	0.07900	0.090000	0.61100
citric acid	1599.0	0.270976	0.194801	0.00000	0.0900	0.26000	0.420000	1.00000
density	1599.0	0.996747	0.001887	0.99007	0.9956	0.99675	0.997835	1.00369
fixed acidity	1599.0	8.319637	1.741096	4.60000	7.1000	7.90000	9.200000	15.90000
free sulfur dioxide	1599.0	15.874922	10.460157	1.00000	7.0000	14.00000	21.000000	72.00000
pH	1599.0	3.311113	0.154386	2.74000	3.2100	3.31000	3.400000	4.01000
quality	1599.0	5.636023	0.807569	3.00000	5.0000	6.00000	6.000000	8.00000
residual sugar	1599.0	2.538806	1.409928	0.90000	1.9000	2.20000	2.600000	15.50000
sulphates	1599.0	0.658149	0.169507	0.33000	0.5500	0.62000	0.730000	2.00000
total sulfur dioxide	1599.0	46.467792	32.895324	6.00000	22.0000	38.00000	62.000000	289.00000
volatile acidity	1599.0	0.527821	0.179060	0.12000	0.3900	0.52000	0.640000	1.58000

Everything is indexed!
Here we get a quick calculation for the 75th percentile of alcohol content.

In [14]:

df.describe()['alcohol']['75%']

Out[14]:

11.1

It's easy to filter a dataframe:
Here we're going to get all the heavily alcoholic wines...

In [15]:

df_alcoholic = df[df['alcohol'] > df.describe()['alcohol']['75%']]
df_alcoholic.head()

Out[15]:

	alcohol	chlorides	citric acid	density	fixed acidity	free sulfur dioxide	pH	quality	residual sugar	sulphates	total sulfur dioxide	volatile acidity
45	13.1	0.054	0.15	0.9934	4.6	8.0	3.90	4	2.1	0.56	65.0	0.52
95	12.9	0.058	0.17	0.9932	4.7	17.0	3.85	6	2.3	0.60	106.0	0.60
131	13.0	0.049	0.09	0.9937	5.6	17.0	3.63	5	2.3	0.63	99.0	0.50
132	13.0	0.049	0.09	0.9937	5.6	17.0	3.63	5	2.3	0.63	99.0	0.50
142	14.0	0.050	0.00	0.9916	5.2	27.0	3.68	6	1.8	0.79	63.0	0.34

It's also stupid easy to plot-- as Pandas extends the Matplotlib package.

In [16]:

#this line is run once, typically at the beginning of the notebook to enable plotting.
%matplotlib inline

In [17]:

df_alcoholic.plot(kind='scatter', x='residual sugar', y='density')

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x10de76e10>

What is that outlier?

In [2]:

df_alcoholic[df_alcoholic['residual sugar'] > 12]

After processing and normalizing the data, we may want to upload this new file to s3.
top

Write DataFrames to S3 with to_csv( ) and to_json( ) ¶

s3.read_csv and read_json are almost identical to their Pandas ancestor and backbone.
The difference is that s3.to_csv takes the dataframe as an argument, rather than being a function of a dataframe.
see the code

In [18]:

# where will the file get stored?
s3_target = 's3://prod-datalytics/playground/wine_list.tsv.gz'

We can now use our filtered dataset, to write a new file to s3.
Using Pandas to_csv args, we have a lot of control of the output format.

In [19]:

s3.to_csv(df_alcoholic, s3_target, sep='\t',
          index=False, compression='gzip')

Out[19]:

"File uploaded to 's3://prod-datalytics/playground/wine_list.tsv.gz'"

top

Write local files to S3 with disk_2_s3( ) ¶

We can send local files to s3 too, first let's write a file to local disk using the built-in Pandas to_csv().

In [20]:

local_file = 'wine_list.tsv.gz'

In [21]:

df_alcoholic.to_csv(local_file, sep='\t', index=False, compression='gzip')

In [22]:

s3.disk_2_s3(file=local_file,
             s3_path=s3_target)

Out[22]:

"'wine_list.tsv.gz' loaded to 's3://prod-datalytics/playground/wine_list.tsv.gz'"

In [23]:

# purge it!
os.remove(local_file)

Saving and Loading Scikit-Learn Classifiers ¶

If you're into machine learning, you're in luck!
see the code

In [24]:

from sklearn.ensemble import RandomForestClassifier

for the example let's just use a vanilla Random Forest Model

In [25]:

clf = RandomForestClassifier()
clf

Out[25]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Here is where we'd train and evaluate the model...

In [30]:

# fit the model!
# clf.fit(X, y)

ACCURACY ON TRAINING SET: 0.99
ACCURACY OF TEST SET: 0.61

My first run (not shown) I got a an test set accuracy of only 61%, which is pretty bad.
You should try to beat that score!

In [ ]:

'''
write some code here:
look into train_test_split, gridsearchCV, and kfolds from Scikit-Learn.

This is also a great dataset to practice:
scaling values (see standardScaler)
dimensionality reduction (see PCA)
and a linear model (see Lasso or Logistic Regression)
'''

Once you're happy with the performance, we can persist the model as a pickle file.

In [31]:

s3.dump_clf(clf, 's3://prod-datalytics/playground/models/clf.pkl')

Out[31]:

"'clf.pkl' loaded to 's3://prod-datalytics/playground/models/clf.pkl'"

And re-use it when the time is right!

In [32]:

s3.load_clf('s3://prod-datalytics/playground/models/clf.pkl')

Out[32]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

top

Movin' Files between buckets and keys ¶

In the interest of good file-keeping let's move our saved classifier to it's own special folder (key).

In [33]:

s3.cp(old_path='s3://prod-datalytics/playground/models/clf.pkl',
      new_path='s3://prod-datalytics/production_space/models/clf.pkl',)

Out[33]:

{'CopyObjectResult': {'ETag': '"fd28ec0656661ce2a86373b097a95b89"',
  'LastModified': datetime.datetime(2017, 3, 2, 0, 13, 43, tzinfo=tzutc())},
 'CopySourceVersionId': 'ov3ei3i4mEGFOcBMEmi2g8atvAbVHKJx',
 'ResponseMetadata': {'HTTPStatusCode': 200,
  'HostId': 'ax6Q2HTAn+86P6wz6v2MWX3ZLsYoksdpqgcJtyKaXcEur80A4awZMiEEDuMLzzcydYNoyX3wBGQ=',
  'RequestId': 'AE98FD0B9CE85D9F'},
 'VersionId': 'OJX.ffdLzhrSD5kYc3wyMAWhtSWBIawN'}

to move the file (and delete the old instance) we use mv, instead of cp.

In [34]:

s3.mv(old_path='s3://prod-datalytics/playground/models/clf.pkl',
      new_path='s3://prod-datalytics/production_space/models/clf.pkl',)

Out[34]:

{'DeleteMarker': True,
 'ResponseMetadata': {'HTTPStatusCode': 204,
  'HostId': 'wevPToOM9kIl8QId3r9NzheWcq5c1Rw43kUnH9Js7Ja8N3Ah/8G5DxzfKO9JVaL4uZ8RMkSkN5o=',
  'RequestId': '9E3FC788589304B4'},
 'VersionId': 'tnPVq8F1usk.sBxewcQ2SUWaYOrG3KXN'}

top