Forget where you put some sensitive information on your cloud database?
Tired of downloading files from s3 only to preview them in excel, or textpad?
Want to touch and reshape data but feel it's caged off from you?
Look no further, your s3 problems are solved!
The only requirements are setting AWS environment variables or setting up the AWS CLI, and installing the requirements.txt
modules.
For this tutorial, we'll use the red wine quality dataset from UCI Center for Machine Learning and Intelligent Systems.
import os
import s3
s3.ls will list all the files and directories in a bucket/key akin to os.listdir()
see the code
s3_path = 's3://prod-datalytics/playground/'
It takes in a bucket or bucket, key pair.
s3.ls(s3_path)
s3.ls also supports regex-like wildcard patterns exactly like glob.glob()
s3.ls(s3_path + '*.csv')
With a programmatic method of getting s3 file paths, we can start doing some cools stuff.
top
f = s3.ls(s3_path + '*.csv')[0]
f
we can open the file as a streaming body of bytes.
s3.open(f)
this is helpful sometimes, but typically we want to read a file like Python's native
Open(filename, 'r') as f:
f.read()
s3.read(f, encoding='utf-8')[:200] # displays the first 200 characters.
For more structured data, we can leverage Pandas' parsing engines...
top
s3.read_csv and read_json are identical to their Pandas'
ancestor and backbone.
Using this handy function, you have data displayed in a nice tabular format:
see the code
df = s3.read_csv(f, sep=',')
df.head(3)
a csv is the most simple use case, we can handle alternative delimiters and json files too.
files = s3.ls(s3_path)
files
here are tab-separated values (tsv).
print("We can read the {} tsv easily.".format(files[-1]))
df = s3.read_csv(files[-1], sep='\t')
df.tail(3)
here's a json file
print("We can also read the {} file easily.".format(files[0]))
df = s3.read_json(files[0])
df.sample(3)
they're actually all the same file-- in different formats!
If you're new to Pandas, you'll be happy to learn that it is the de-facto tool for data manipulation.
df.dtypes
Getting basic stats and distributions are a function away..
df.describe().T
Everything is indexed!
Here we get a quick calculation for the 75th percentile of alcohol content.
df.describe()['alcohol']['75%']
It's easy to filter a dataframe:
Here we're going to get all the heavily alcoholic wines...
df_alcoholic = df[df['alcohol'] > df.describe()['alcohol']['75%']]
df_alcoholic.head()
It's also stupid easy to plot-- as Pandas extends the Matplotlib package.
#this line is run once, typically at the beginning of the notebook to enable plotting.
%matplotlib inline
df_alcoholic.plot(kind='scatter', x='residual sugar', y='density')
What is that outlier?
df_alcoholic[df_alcoholic['residual sugar'] > 12]
After processing and normalizing the data, we may want to upload this new file to s3.
top
s3.read_csv and read_json are almost identical to their Pandas ancestor and backbone.
The difference is that s3.to_csv takes the dataframe as an argument, rather than being a function of a dataframe.
see the code
# where will the file get stored?
s3_target = 's3://prod-datalytics/playground/wine_list.tsv.gz'
We can now use our filtered dataset, to write a new file to s3.
Using Pandas to_csv args, we have a lot of control of the output format.
s3.to_csv(df_alcoholic, s3_target, sep='\t',
index=False, compression='gzip')
We can send local files to s3 too, first let's write a file to local disk using the built-in Pandas to_csv()
.
local_file = 'wine_list.tsv.gz'
df_alcoholic.to_csv(local_file, sep='\t', index=False, compression='gzip')
s3.disk_2_s3(file=local_file,
s3_path=s3_target)
# purge it!
os.remove(local_file)
If you're into machine learning, you're in luck!
see the code
from sklearn.ensemble import RandomForestClassifier
for the example let's just use a vanilla Random Forest Model
clf = RandomForestClassifier()
clf
Here is where we'd train and evaluate the model...
# fit the model!
# clf.fit(X, y)
My first run (not shown) I got a an test set accuracy of only 61%, which is pretty bad.
You should try to beat that score!
'''
write some code here:
look into train_test_split, gridsearchCV, and kfolds from Scikit-Learn.
This is also a great dataset to practice:
scaling values (see standardScaler)
dimensionality reduction (see PCA)
and a linear model (see Lasso or Logistic Regression)
'''
Once you're happy with the performance, we can persist the model as a pickle file.
s3.dump_clf(clf, 's3://prod-datalytics/playground/models/clf.pkl')
And re-use it when the time is right!
s3.load_clf('s3://prod-datalytics/playground/models/clf.pkl')
In the interest of good file-keeping let's move our saved classifier to it's own special folder (key).
s3.cp(old_path='s3://prod-datalytics/playground/models/clf.pkl',
new_path='s3://prod-datalytics/production_space/models/clf.pkl',)
to move the file (and delete the old instance) we use mv
, instead of cp
.
s3.mv(old_path='s3://prod-datalytics/playground/models/clf.pkl',
new_path='s3://prod-datalytics/production_space/models/clf.pkl',)