mlcourse.ai – Open Machine Learning Course

Author: Irina Knyazeva, ODS Slack nickname : iknyazeva

Tutorial

"HANDLE DIFFERENT DATASET WITH DASK AND TRYING A LITTLE DASK ML"

WHY DO I NEED DASK?

Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don’t fit into main memory. Dask’s high-level collections are alternatives to NumPy and Pandas for large datasets.

YOU DEFINITELY NEED DASK IF

if problem size close to limits of RAM, but fits to disk

Reading list

This notebook based mainly based on this three sources

In [1]:
import psutil, os
import numpy as np
import pandas as pd
from dask import delayed
import gc
import time
import warnings
warnings.filterwarnings("ignore")

Let's write a little function for tracking memory that takes python process

In [2]:
def memory_footprint():
    mem = psutil.Process(os.getpid()).memory_info().rss
    return (mem / 1024 ** 2)
In [3]:
before = memory_footprint()
print(f'Memory used before is {round(before,2)} MB')
Memory used before is 77.39 MB
In [4]:
N = (1024 ** 2) // 8
x = np.random.randn(50*N)
after = memory_footprint()
print(f'Memory used after is {round(after,2)} MB')
Memory used after is 127.43 MB

Computes, but doesn't bind result to a variable allocate extra memory

In [5]:
x ** 2
after1 = memory_footprint()
print(f' Extra memory obtained after computation {round(after1,2)} MB')
 Extra memory obtained after computation 177.43 MB

Dask arrays

Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs.dask array documentation

In dask there is three main structures: dask array (based on numpy array), dask dataframe (based on pandas dataframe) and dask bags (for unstructured data as text).

In [6]:
import dask.array as da
y = da.from_array(x, chunks=len(x)//4)
print('Dask arrays require little memory:', memory_footprint()-after1)
Dask arrays require little memory: 2.984375
In [7]:
import time
t_start = time.time()
x.mean()
t_end = time.time()
print('Compute mean value of this numpy array \n')
print('Elapsed time for compute mean of numpy array (ms):', round((t_end - t_start) * 1000))
Compute mean value of this numpy array 

Elapsed time for compute mean of numpy array (ms): 4
In [8]:
t_start = time.time()
y.mean().compute()
t_end = time.time()
print('Compute the same with dask \n')
print('Elapsed time for compute mean of dask array (ms):', round((t_end - t_start) * 1000))
Compute the same with dask 

Elapsed time for compute mean of dask array (ms): 21

Actually, this example will never be used in practice, because if your numpy already in memory, any partitioning will always raise computational time. But if you need to process data from HDF5, NetCDF or bulk of numpy files from disk it could be extremely useful

Delayed operations with dask

But dask could be useful for small data with delayed computation. It could easily parallelize computation. Let's see the example with our previous numpy array

In [69]:
def f(z):
    return np.sqrt(z + 4)
def g(y):
    return y - 3
def h(x):
    return x ** 2

time_start = time.time()
x = np.random.randn(50*N)
y=h(x);z=g(x); w=f(z+y);
time_end = time.time()
print('Elapsed time for compute complex functions with numpy array (ms):', round((time_end - time_start) * 1000))
Elapsed time for compute complex functions with numpy array (ms): 426
In [10]:
y = delayed(h)(x)
z = delayed(g)(x)
w = delayed(f)(z+y)
print('After we get dask delayed object', w)
time_start = time.time()
w.compute()
time_end = time.time()
print('Elapsed time for compute complex functions with numpy array with dask delayed (ms):', round((time_end - time_start) * 1000))
After we get dask delayed object Delayed('f-10fe1849-e5f7-4f12-97df-e728a4123d43')
Elapsed time for compute complex functions with numpy array with dask delayed (ms): 98

It is easily understood why computation time decreased with the computational graph. Let's do this with the second way of introducing delay functions

In [11]:
@delayed
def f(z):
    return np.sqrt(z + 4)
@delayed
def g(y):
    return y - 3
@delayed
def h(x):
    return x ** 2

y = h(x); z = g(x)
w = f(z+y)
w.visualize()
Out[11]:

Dask dataframe

Dask DataFrames coordinate many Pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These Pandas objects may live on disk or on other machines. (See documentation)[http://docs.dask.org/en/latest/dataframe.html]

<img src="http://docs.dask.org/en/latest/_images/dask-dataframe.svg" width="40%" height="40% />

In [12]:
import dask.dataframe as dd
In [13]:
print('Let\'s return to start of our ML journey\n')
print('Load olympic dataset \n')
PATH = '../../data/athlete_events.csv'
Let's return to start of our ML journey

Load olympic dataset 

In [14]:
df = pd.read_csv(PATH)
df.head()
Out[14]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN
In [15]:
m1=memory_footprint()
dask_df = dd.read_csv(PATH)
m2 = memory_footprint()
print('Dask do not allocate memory after creation:', m2-m1)
Dask do not allocate memory after creation: -5.16015625
In [16]:
print('But we could see data as in pandas dataframe:')
dask_df.head()
But we could see data as in pandas dataframe:
Out[16]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN
In [17]:
# building delayed  computation
print('We can do many operation the same way as in pandas, but without loading all data in memory \n ')
sex_distr = dask_df.loc[dask_df['Games'].str.contains('1996')].groupby('Sex')['Age'].min()
We can do many operation the same way as in pandas, but without loading all data in memory 
 
In [18]:
print('Here we done selecting and aggregation exactly the same way as we did in pandas \n')
print('But there is not any computation, we create dask structure \ n')
sex_distr
Here we done selecting and aggregation exactly the same way as we did in pandas 

But there is not any computation, we create dask structure \ n
Out[18]:
Dask Series Structure:
npartitions=1
    float64
        ...
Name: Age, dtype: float64
Dask Name: series-groupby-min-agg, 8 tasks
In [19]:
%%time
print('Computation is time consuming, but we remember that we dont\'t need to load all data in memory for this computation \n')
print(sex_distr.compute())
Computation is time consuming, but we remember that we dont't need to load data in memory for this computation 

Sex
F    12.0
M    14.0
Name: Age, dtype: float64
CPU times: user 665 ms, sys: 82.8 ms, total: 748 ms
Wall time: 746 ms
In [20]:
%%time
print('Pandas of course more effective \n')
print(df.loc[df['Games'].str.contains('1996')].groupby('Sex')['Age'].min())
Pandas of course more effective 

Sex
F    12.0
M    14.0
Name: Age, dtype: float64
CPU times: user 156 ms, sys: 3.07 ms, total: 159 ms
Wall time: 158 ms

Compatibility with Pandas API

  • Unavailable in dask.dataframe:
    • some unsupported file formats (e.g., .xls, .zip,...)
    • sorting
  • Available in dask.dataframe:
    • indexing, selection, & reindexing
    • aggregations: .sum(), .mean(), .std(), .min(), .max() etc.
    • grouping with .groupby()
    • datetime conversion with dd.to_datetime()

Read collections of files to dask dataframe

For example I've taken Alica Project. Capstone_user_identification archive link (~7 Mb, unziped data ~60 Mb).

In [21]:
PATH_TO_DATA = '../../data/capstone_user_identification'
In [22]:
print('We can load all files in single dataframe \n')
print('Your dont\'t need this in Alica project, just an example \n ')
user10dask = dd.read_csv(os.path.join(PATH_TO_DATA, 
                                       '10users/*.csv'))
We can load all files in single dataframe 

Your dont't need this in Alica project, just an example 
 
In [23]:
print('We can look at the data')
print(user10dask)
user10dask.tail()
We can look at the data
Dask DataFrame Structure:
               timestamp    site
npartitions=10                  
                  object  object
                     ...     ...
...                  ...     ...
                     ...     ...
                     ...     ...
Dask Name: from-delayed, 30 tasks
Out[23]:
timestamp site
5327 2014-03-26 15:43:56 www.google.com
5328 2014-03-26 15:43:57 plus.google.com
5329 2014-03-26 15:43:57 mail.google.com
5330 2014-03-26 15:43:58 accounts.google.com
5331 2014-03-26 15:43:58 accounts.youtube.com
In [24]:
print('Let\'s see what happens if we want to count all sites (it could seen as a one more way for dictionary creation) \n')
count_sites = user10dask.groupby('site')['site'].count()
Let's see what happens if we want to count all sites (it could seen as a one more way for dictionary creation) 

In [25]:
print('If we visualize this structure we\'ll see the picture of computation \n')
count_sites.visualize()
If we visualize this structure we'll see the picture of computation 

Out[25]:
In [26]:
%%time
count_sites.compute().sort_values(ascending=False)[:20]
CPU times: user 196 ms, sys: 43.8 ms, total: 240 ms
Wall time: 177 ms
Out[26]:
site
s.youtube.com                           8300
www.google.fr                           7813
www.google.com                          5441
mail.google.com                         4158
www.facebook.com                        4141
apis.google.com                         3758
r3---sn-gxo5uxg-jqbe.googlevideo.com    3244
r1---sn-gxo5uxg-jqbe.googlevideo.com    3094
plus.google.com                         2630
accounts.google.com                     2089
r2---sn-gxo5uxg-jqbe.googlevideo.com    1939
fr-mg42.mail.yahoo.com                  1868
www.youtube.com                         1804
r4---sn-gxo5uxg-jqbe.googlevideo.com    1702
clients1.google.com                     1493
download.jboss.org                      1441
s-static.ak.facebook.com                1388
static.ak.facebook.com                  1265
i1.ytimg.com                            1232
twitter.com                             1204
Name: site, dtype: int64

JSON Files into Dask Bags

Dask Bag implements operations like map, filter, fold, and groupby on collections of Python objects. It does this in parallel with a small memory footprint using Python iterators. It is similar to a parallel version of PyToolz or a Pythonic version of the PySpark RDD.Dask bag documentation

Dask bags are often used to parallelize simple computations on unstructured or semi-structured data like text data, log files, JSON records, or user defined Python objects.

Let's see example with our Medium data

In [27]:
import dask.bag as db
import json
In [28]:
print('Path to our medium data \n')
PATH = '../../data/kaggle_medium'
print(PATH)
Path to our medium data 

../../data/kaggle_medium
In [29]:
print('Wrap train json to dask bag format \n')
items = db.read_text(os.path.join(PATH,'train.json'))
items
Wrap train json to dask bag format 

Out[29]:
dask.bag<bag-fro..., npartitions=1>
In [30]:
%%time
print('Let\'s look at one example \n')
print(items.take(1))
Let's look at one example 

('{"_id": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "_timestamp": 1520035195.282891, "_spider": "medium", "url": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "domain": "medium.com", "published": {"$date": "2012-08-13T22:54:53.510Z"}, "title": "Medium Terms of Service \\u2013 Medium Policy \\u2013 Medium", "content": "<div><header class=\\"container u-maxWidth740\\"><div class=\\"uiScale uiScale-ui--regular uiScale-caption--regular postMetaHeader u-paddingBottom10 row\\"><div class=\\"col u-size12of12 js-postMetaLockup\\"><div class=\\"uiScale uiScale-ui--regular uiScale-caption--regular postMetaLockup postMetaLockup--authorWithBio u-flexCenter js-postMetaLockup\\"><div class=\\"u-flex0\\"><a class=\\"link u-baseColor--link avatar\\" href=\\"https://medium.com/@Medium?source=post_header_lockup\\" data-action=\\"show-user-card\\" data-action-source=\\"post_header_lockup\\" data-action-value=\\"504c7870fdb6\\" data-action-type=\\"hover\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\"><div class=\\"u-relative u-inlineBlock u-flex0\\"><img src=\\"https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\\" class=\\"avatar-image avatar-image--small\\" alt=\\"Go to the profile of Medium\\"><div class=\\"avatar-halo u-absolute u-textColorGreenNormal svgIcon\\" style=\\"width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px\\"><svg viewbox=\\"0 0 114 114\\" xmlns=\\"http://www.w3.org/2000/svg\\"><path d=\\"M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z\\"></path><path d=\\"M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z\\"></path></svg></div></div></a></div><div class=\\"u-flex1 u-paddingLeft15 u-overflowHidden\\"><div class=\\"u-lineHeightTightest\\"><a class=\\"ds-link ds-link--styleSubtle ui-captionStrong u-inlineBlock link link--darken link--darker\\" href=\\"https://medium.com/@Medium?source=post_header_lockup\\" data-action=\\"show-user-card\\" data-action-source=\\"post_header_lockup\\" data-action-value=\\"504c7870fdb6\\" data-action-type=\\"hover\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\">Medium</a><span class=\\"followState js-followState\\" data-user-id=\\"504c7870fdb6\\"></span></div><div class=\\"ui-caption ui-xs-clamp2 postMetaInline\\">Everyone\\u2019s stories and ideas</div><div class=\\"ui-caption postMetaInline js-testPostMetaInlineSupplemental\\"><time datetime=\\"2012-08-13T22:54:53.510Z\\">Aug 13, 2012</time><span class=\\"middotDivider u-fontSize12\\"></span><span class=\\"readingTime\\" title=\\"5 min read\\"></span></div></div></div></div></div></header><div class=\\"postArticle-content js-postField js-notesSource js-trackedPost\\" data-post-id=\\"9db0094a1e0f\\" data-source=\\"post_page\\" data-collection-id=\\"675ebe56ac25\\" data-tracking-context=\\"postPage\\"><section name=\\"bb8c\\" class=\\"section section--body section--first section--last\\"><div class=\\"section-divider\\"><hr class=\\"section-divider\\"></div><div class=\\"section-content\\"><div class=\\"section-inner sectionLayout--insetColumn\\"><h1 name=\\"title\\" id=\\"title\\" class=\\"graf graf--h2 graf--leading graf--title\\">Medium Terms of\\u00a0Service</h1><p name=\\"571b\\" id=\\"571b\\" class=\\"graf graf--p graf-after--h2\\"><strong class=\\"markup--strong markup--p-strong\\">Effective: March 7, 2016</strong></p><p name=\\"c90b\\" id=\\"c90b\\" class=\\"graf graf--p graf-after--p\\">These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d).</p><p name=\\"238b\\" id=\\"238b\\" class=\\"graf graf--p graf-after--p\\">By using Medium, you agree to these Terms. If you don\\u2019t agree to any of the Terms, you can\\u2019t use Medium.</p><p name=\\"7769\\" id=\\"7769\\" class=\\"graf graf--p graf-after--p\\">We can change these Terms at any time. We keep a <a href=\\"https://github.com/Medium/medium-policy\\" data-href=\\"https://github.com/Medium/medium-policy\\" class=\\"markup--anchor markup--p-anchor\\" rel=\\"nofollow noopener\\" target=\\"_blank\\">historical</a> record of all changes to our Terms on GitHub. If a change is material, we\\u2019ll let you know before they take effect. By using Medium on or after that effective date, you agree to the new Terms. If you don\\u2019t agree to them, you should delete your account before they take effect, otherwise your use of the site and content will be subject to the new Terms.</p><h4 name=\\"8c81\\" id=\\"8c81\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Content rights &amp; responsibilities</strong></h4><p name=\\"ac74\\" id=\\"ac74\\" class=\\"graf graf--p graf-after--h4\\">You own the rights to the content you create and post on Medium.</p><p name=\\"651b\\" id=\\"651b\\" class=\\"graf graf--p graf-after--p\\">By posting content to Medium, you give us a nonexclusive license to publish it on Medium Services, including anything reasonably related to publishing it (like storing, displaying, reformatting, and distributing it). In consideration for Medium granting you access to and use of the Services, you agree that Medium may enable advertising on the Services, including in connection with the display of your content or other information. We may also use your content to promote Medium, including its products and content. We will never sell your content to third parties without your explicit permission.</p><p name=\\"2584\\" id=\\"2584\\" class=\\"graf graf--p graf-after--p\\">You\\u2019re responsible for the content you post. This means you assume all risks related to it, including someone else\\u2019s reliance on its accuracy, or claims relating to intellectual property or other legal rights.</p><p name=\\"c207\\" id=\\"c207\\" class=\\"graf graf--p graf-after--p\\">You\\u2019re welcome to post content on Medium that you\\u2019ve published elsewhere, as long as you have the rights you need to do so. By posting content to Medium, you represent that doing so doesn\\u2019t conflict with any other agreement you\\u2019ve made.</p><p name=\\"0372\\" id=\\"0372\\" class=\\"graf graf--p graf-after--p\\">By posting content you didn\\u2019t create to Medium, you are representing that you have the right to do so. For example, you are posting a work that\\u2019s in the public domain, used under license (including a free license, such as <a href=\\"https://creativecommons.org/licenses/\\" data-href=\\"https://creativecommons.org/licenses/\\" class=\\"markup--anchor markup--p-anchor\\" rel=\\"nofollow noopener\\" target=\\"_blank\\">Creative Commons</a>), or a fair use.</p><p name=\\"0472\\" id=\\"0472\\" class=\\"graf graf--p graf-after--p\\">We can remove any content you post for any reason.</p><p name=\\"db2b\\" id=\\"db2b\\" class=\\"graf graf--p graf-after--p\\">You can delete any of your posts, or your account, anytime. Processing the deletion may take a little time, but we\\u2019ll do it as quickly as possible. We may keep backup copies of your deleted post or account on our servers for up to 14 days after you delete it.</p><h4 name=\\"baf1\\" id=\\"baf1\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Our content and\\u00a0services</strong></h4><p name=\\"adc7\\" id=\\"adc7\\" class=\\"graf graf--p graf-after--h4\\">We reserve all rights in Medium\\u2019s look and feel. Some parts of Medium are licensed under third-party open source licenses. We also make some of our own code available under open source licenses. As for other parts of Medium, you may not copy or adapt any portion of our code or visual design elements (including logos) without express written permission from Medium unless otherwise permitted by law.</p><p name=\\"20e4\\" id=\\"20e4\\" class=\\"graf graf--p graf-after--p\\">You may not do, or try to do, the following: (1) access or tamper with non-public areas of the Services, our computer systems, or the systems of our technical providers; (2) access or search the Services by any means other than the currently available, published interfaces (e.g., APIs) that we provide; (3) forge any TCP/IP packet header or any part of the header information in any email or posting, or in any way use the Services to send altered, deceptive, or false source-identifying information; or (4) interfere with, or disrupt, the access of any user, host, or network, including sending a virus, overloading, flooding, spamming, mail-bombing the Services, or by scripting the creation of content or accounts in such a manner as to interfere with or create an undue burden on the Services.</p><p name=\\"f5dd\\" id=\\"f5dd\\" class=\\"graf graf--p graf-after--p\\">Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.</p><p name=\\"71a8\\" id=\\"71a8\\" class=\\"graf graf--p graf-after--p\\">We may change, terminate, or restrict access to any aspect of the service, at any time, without notice.</p><h4 name=\\"12f1\\" id=\\"12f1\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">No children</strong></h4><p name=\\"2ce7\\" id=\\"2ce7\\" class=\\"graf graf--p graf-after--h4\\">Medium is only for people 13 years old and over. By using Medium, you affirm that you are over 13. If we learn someone under 13 is using Medium, we\\u2019ll terminate their account.</p><h4 name=\\"531c\\" id=\\"531c\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Security</strong></h4><p name=\\"3155\\" id=\\"3155\\" class=\\"graf graf--p graf-after--h4\\">If you find a security vulnerability on Medium, tell us. We have a <a href=\\"https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2\\" data-href=\\"https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">bug bounty disclosure program</a>.</p><h4 name=\\"05cc\\" id=\\"05cc\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Incorporated rules and\\u00a0policies</strong></h4><p name=\\"5207\\" id=\\"5207\\" class=\\"graf graf--p graf-after--h4\\">By using the Services, you agree to let Medium collect and use information as detailed in our <a href=\\"https://medium.com/p/f03bf92035c9\\" data-href=\\"https://medium.com/p/f03bf92035c9\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Privacy Policy</a>. If you\\u2019re outside the United States, you consent to letting Medium transfer, store, and process your information (including your personal information and content) in and out of the United States.</p><p name=\\"6230\\" id=\\"6230\\" class=\\"graf graf--p graf-after--p\\">To enable a functioning community, we have <a href=\\"https://medium.com/policy/medium-rules-30e5502c4eb4\\" data-href=\\"https://medium.com/policy/medium-rules-30e5502c4eb4\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Rules</a>. To ensure usernames are distributed and used fairly, we have a <a href=\\"https://medium.com/@Medium/medium-username-policy-7054a77fb04f\\" data-href=\\"https://medium.com/@Medium/medium-username-policy-7054a77fb04f\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Username Policy</a>. Under our <a href=\\"https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695\\" data-href=\\"https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">DMCA Policy</a>, we\\u2019ll remove material after receiving a valid takedown notice. Under our <a href=\\"https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7\\" data-href=\\"https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Trademark Policy</a>, we\\u2019ll investigate any use of another\\u2019s trademark and respond appropriately.</p><p name=\\"21ad\\" id=\\"21ad\\" class=\\"graf graf--p graf-after--p\\">By using Medium, you agree to follow these Rules and Policies. If you don\\u2019t, we may remove content, or suspend or delete your account.</p><h4 name=\\"a2a2\\" id=\\"a2a2\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Miscellaneous</strong></h4><p name=\\"b7da\\" id=\\"b7da\\" class=\\"graf graf--p graf-after--h4\\"><em class=\\"markup--em markup--p-em\\">Disclaimer of warranty.</em> Medium provides the Services to you as is. You use them at your own risk and discretion. That means they don\\u2019t come with any warranty. None express, none implied. No implied warranty of merchantability, fitness for a particular purpose, availability, security, title or non-infringement.</p><p name=\\"7073\\" id=\\"7073\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Limitation of Liability</em>. Medium won\\u2019t be liable to you for any damages that arise from your using the Services. This includes if the Services are hacked or unavailable. This includes all types of damages (indirect, incidental, consequential, special or exemplary). And it includes all kinds of legal claims, such as breach of contract, breach of warranty, tort, or any other loss.</p><p name=\\"3d70\\" id=\\"3d70\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">No waiver.</em> If Medium doesn\\u2019t exercise a particular right under these Terms, that doesn\\u2019t waive it.</p><p name=\\"ab04\\" id=\\"ab04\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Severability</em>. If any provision of these terms is found invalid by a court of competent jurisdiction, you agree that the court should try to give effect to the parties\\u2019 intentions as reflected in the provision and that other provisions of the Terms will remain in full effect.</p><p name=\\"bde8\\" id=\\"bde8\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Choice of law and jurisdiction.</em> These Terms are governed by California law, without reference to its conflict of laws provisions. You agree that any suit arising from the Services must take place in a court located in San Francisco, California.</p><p name=\\"bbb3\\" id=\\"bbb3\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Entire agreement.</em> These Terms (including any document incorporated by reference into them) are the whole agreement between Medium and you concerning the Services.</p><p name=\\"dbf1\\" id=\\"dbf1\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Government use.</em> If you\\u2019re \\u200busing \\u200bMedium for the U.S. Government, <a href=\\"https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7\\" data-href=\\"https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">this Amendment</a> to \\u200bMedium\\u2019s Terms of Service \\u200bapplies to you\\u200b.</p><p name=\\"3318\\" id=\\"3318\\" class=\\"graf graf--p graf-after--p graf--trailing\\">Questions? Let us know at <a href=\\"mailto:%20legal@medium.com\\" data-href=\\"mailto:%20legal@medium.com\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">legal@medium.com</a>.</p></div></div></section></div><footer class=\\"u-paddingTop10\\"><div class=\\"container u-maxWidth740\\"><div class=\\"row\\"><div class=\\"col u-size12of12\\"></div></div><div class=\\"row\\"><div class=\\"col u-size12of12 js-postTags\\"><div class=\\"u-paddingBottom10\\"><ul class=\\"tags tags--postTags tags--borderless\\"><li><a class=\\"link u-baseColor--link\\" href=\\"https://medium.com/tag/terms-and-conditions?source=post\\" data-action-source=\\"post\\">Terms And Conditions</a></li><li><a class=\\"link u-baseColor--link\\" href=\\"https://medium.com/tag/terms?source=post\\" data-action-source=\\"post\\">Terms</a></li><li><a class=\\"link u-baseColor--link\\" href=\\"https://medium.com/tag/medium?source=post\\" data-action-source=\\"post\\">Medium</a></li></ul></div></div></div><section class=\\"uiScale uiScale-ui--small uiScale-caption--regular u-borderTopLightest u-marginTop10 u-paddingTop20\\"><div class=\\"ui-h3 u-textColorDarker u-fontSize22\\">One clap, two clap, three clap, forty?</div><p class=\\"ui-body u-marginBottom20 u-textColorDark u-fontSize16\\">By clapping more or less, you can signal to us which stories really stand out.</p></section><div class=\\"postActions js-postActionsFooter\\"><div class=\\"u-flexCenter\\"><div class=\\"u-flex1\\"><div class=\\"multirecommend js-actionMultirecommend u-flexCenter u-width60\\" data-post-id=\\"9db0094a1e0f\\" data-is-icon-29px=\\"true\\" data-is-circle=\\"true\\" data-has-recommend-list=\\"true\\" data-source=\\"post_actions_footer-----9db0094a1e0f---------------------clap_footer\\"><div class=\\"u-relative u-foreground\\"><div class=\\"clapUndo u-width60 u-round u-height32 u-absolute u-borderBox u-paddingRight5 u-transition--transform200Spring u-background--brandSageLighter js-clapUndo\\" style=\\"top: 14px; padding: 2px;\\"></div></div><span class=\\"u-textAlignCenter u-relative u-background js-actionMultirecommendCount u-marginLeft10\\"></span></div></div><div class=\\"buttonSet u-flex0\\"></div></div></div></div><div class=\\"u-maxWidth740 u-paddingTop20 u-marginTop20 u-borderTopLightest container u-paddingBottom20 u-xs-paddingBottom10 js-postAttributionFooterContainer\\"><div class=\\"row js-postFooterInfo\\"><div class=\\"col u-size6of12 u-xs-size12of12\\"><li class=\\"uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardUser\\"><div class=\\"u-marginLeft20 u-floatRight\\"><span class=\\"followState js-followState\\" data-user-id=\\"504c7870fdb6\\"></span></div><div class=\\"u-tableCell\\"><a class=\\"link u-baseColor--link avatar\\" href=\\"https://medium.com/@Medium?source=footer_card\\" title=\\"Go to the profile of Medium\\" aria-label=\\"Go to the profile of Medium\\" data-action-source=\\"footer_card\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\"><div class=\\"u-relative u-inlineBlock u-flex0\\"><img src=\\"https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\\" class=\\"avatar-image avatar-image--small\\" alt=\\"Go to the profile of Medium\\"><div class=\\"avatar-halo u-absolute u-textColorGreenNormal svgIcon\\" style=\\"width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px\\"><svg viewbox=\\"0 0 114 114\\" xmlns=\\"http://www.w3.org/2000/svg\\"><path d=\\"M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z\\"></path><path d=\\"M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z\\"></path></svg></div></div></a></div><div class=\\"u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15\\"><h3 class=\\"ui-h3 u-fontSize18 u-lineHeightTighter\\"><a class=\\"link link--primary u-accentColor--hoverTextNormal\\" href=\\"https://medium.com/@Medium\\" property=\\"cc:attributionName\\" title=\\"Go to the profile of Medium\\" aria-label=\\"Go to the profile of Medium\\" rel=\\"author cc:attributionUrl\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\">Medium</a></h3><div class=\\"ui-caption u-textColorGreenNormal u-fontSize13 u-tintSpectrum u-accentColor--textNormal u-marginBottom7\\">Medium member since Aug 2017</div><p class=\\"ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4\\">Everyone\\u2019s stories and ideas</p></div></li></div><div class=\\"col u-size6of12 u-xs-size12of12 u-xs-marginTop30\\"><li class=\\"uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardCollection\\"><div class=\\"u-marginLeft20 u-floatRight\\"></div><div class=\\"u-tableCell \\"><a class=\\"link u-baseColor--link avatar avatar--roundedRectangle\\" href=\\"https://medium.com/policy?source=footer_card\\" title=\\"Go to Medium Policy\\" aria-label=\\"Go to Medium Policy\\" data-action-source=\\"footer_card\\"><img src=\\"https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\\" class=\\"avatar-image u-size60x60\\" alt=\\"Medium Policy\\"></a></div><div class=\\"u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15\\"><h3 class=\\"ui-h3 u-fontSize18 u-lineHeightTighter u-marginBottom4\\"><a class=\\"link link--primary u-accentColor--hoverTextNormal\\" href=\\"https://medium.com/policy?source=footer_card\\" rel=\\"collection\\" data-action-source=\\"footer_card\\">Medium Policy</a></h3><p class=\\"ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4\\">The Fine Print</p><div class=\\"buttonSet\\"></div></div></li></div></div></div><div class=\\"js-postFooterPlacements\\"></div><div class=\\"u-padding0 u-clearfix u-backgroundGrayLightest u-print-hide supplementalPostContent js-responsesWrapper\\"></div><div class=\\"supplementalPostContent js-heroPromo\\"></div></footer></div>", "author": {"name": null, "url": "https://medium.com/@Medium", "twitter": "@Medium"}, "image_url": null, "tags": [], "link_tags": {"canonical": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "publisher": "https://plus.google.com/103654360130207659246", "author": "https://medium.com/@Medium", "search": "/osd.xml", "alternate": "android-app://com.medium.reader/https/medium.com/p/9db0094a1e0f", "stylesheet": "https://cdn-static-1.medium.com/_/fp/css/main-branding-base.Ch8g7KPCoGXbtKfJaVXo_w.css", "icon": "https://cdn-static-1.medium.com/_/fp/icons/favicon-rebrand-medium.3Y6xpZ-0FSdWDnPM3hSBIA.ico", "apple-touch-icon": "https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png", "mask-icon": "https://cdn-static-1.medium.com/_/fp/icons/monogram-mask.KPLCSFEZviQN0jQ7veN2RQ.svg"}, "meta_tags": {"viewport": "width=device-width, initial-scale=1", "title": "Medium Terms of Service \\u2013 Medium Policy \\u2013 Medium", "referrer": "unsafe-url", "description": "These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d). By using\\u2026", "theme-color": "#000000", "og:title": "Medium Terms of Service \\u2013 Medium Policy \\u2013 Medium", "og:url": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "fb:app_id": "542599432471018", "og:description": "These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d). By using\\u2026", "twitter:description": "These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d). By using\\u2026", "author": "Medium", "og:type": "article", "twitter:card": "summary", "article:publisher": "https://www.facebook.com/medium", "article:author": "https://medium.com/@Medium", "robots": "index, follow", "article:published_time": "2012-08-13T22:54:53.510Z", "twitter:creator": "@Medium", "twitter:site": "@Medium", "og:site_name": "Medium", "twitter:label1": "Reading time", "twitter:data1": "5 min read", "twitter:app:name:iphone": "Medium", "twitter:app:id:iphone": "828256236", "twitter:app:url:iphone": "medium://p/9db0094a1e0f", "al:ios:app_name": "Medium", "al:ios:app_store_id": "828256236", "al:android:package": "com.medium.reader", "al:android:app_name": "Medium", "al:ios:url": "medium://p/9db0094a1e0f", "al:android:url": "medium://p/9db0094a1e0f", "al:web:url": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f"}}\n',)
CPU times: user 16.9 ms, sys: 26.1 ms, total: 43 ms
Wall time: 42.7 ms
In [31]:
print('We can parse date with json library and get dict like object \n')
dict_items = items.map(json.loads)
print(type(dict_items))
We can parse date with json library and get dict like object 

<class 'dask.bag.core.Bag'>
In [32]:
dict_items.take(1)
Out[32]:
({'_id': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f',
  '_timestamp': 1520035195.282891,
  '_spider': 'medium',
  'url': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f',
  'domain': 'medium.com',
  'published': {'$date': '2012-08-13T22:54:53.510Z'},
  'title': 'Medium Terms of Service – Medium Policy – Medium',
  'content': '<div><header class="container u-maxWidth740"><div class="uiScale uiScale-ui--regular uiScale-caption--regular postMetaHeader u-paddingBottom10 row"><div class="col u-size12of12 js-postMetaLockup"><div class="uiScale uiScale-ui--regular uiScale-caption--regular postMetaLockup postMetaLockup--authorWithBio u-flexCenter js-postMetaLockup"><div class="u-flex0"><a class="link u-baseColor--link avatar" href="https://medium.com/@Medium?source=post_header_lockup" data-action="show-user-card" data-action-source="post_header_lockup" data-action-value="504c7870fdb6" data-action-type="hover" data-user-id="504c7870fdb6" dir="auto"><div class="u-relative u-inlineBlock u-flex0"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png" class="avatar-image avatar-image--small" alt="Go to the profile of Medium"><div class="avatar-halo u-absolute u-textColorGreenNormal svgIcon" style="width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px"><svg viewbox="0 0 114 114" xmlns="http://www.w3.org/2000/svg"><path d="M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z"></path><path d="M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z"></path></svg></div></div></a></div><div class="u-flex1 u-paddingLeft15 u-overflowHidden"><div class="u-lineHeightTightest"><a class="ds-link ds-link--styleSubtle ui-captionStrong u-inlineBlock link link--darken link--darker" href="https://medium.com/@Medium?source=post_header_lockup" data-action="show-user-card" data-action-source="post_header_lockup" data-action-value="504c7870fdb6" data-action-type="hover" data-user-id="504c7870fdb6" dir="auto">Medium</a><span class="followState js-followState" data-user-id="504c7870fdb6"></span></div><div class="ui-caption ui-xs-clamp2 postMetaInline">Everyone’s stories and ideas</div><div class="ui-caption postMetaInline js-testPostMetaInlineSupplemental"><time datetime="2012-08-13T22:54:53.510Z">Aug 13, 2012</time><span class="middotDivider u-fontSize12"></span><span class="readingTime" title="5 min read"></span></div></div></div></div></div></header><div class="postArticle-content js-postField js-notesSource js-trackedPost" data-post-id="9db0094a1e0f" data-source="post_page" data-collection-id="675ebe56ac25" data-tracking-context="postPage"><section name="bb8c" class="section section--body section--first section--last"><div class="section-divider"><hr class="section-divider"></div><div class="section-content"><div class="section-inner sectionLayout--insetColumn"><h1 name="title" id="title" class="graf graf--h2 graf--leading graf--title">Medium Terms of\xa0Service</h1><p name="571b" id="571b" class="graf graf--p graf-after--h2"><strong class="markup--strong markup--p-strong">Effective: March 7, 2016</strong></p><p name="c90b" id="c90b" class="graf graf--p graf-after--p">These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”).</p><p name="238b" id="238b" class="graf graf--p graf-after--p">By using Medium, you agree to these Terms. If you don’t agree to any of the Terms, you can’t use Medium.</p><p name="7769" id="7769" class="graf graf--p graf-after--p">We can change these Terms at any time. We keep a <a href="https://github.com/Medium/medium-policy" data-href="https://github.com/Medium/medium-policy" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank">historical</a> record of all changes to our Terms on GitHub. If a change is material, we’ll let you know before they take effect. By using Medium on or after that effective date, you agree to the new Terms. If you don’t agree to them, you should delete your account before they take effect, otherwise your use of the site and content will be subject to the new Terms.</p><h4 name="8c81" id="8c81" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Content rights &amp; responsibilities</strong></h4><p name="ac74" id="ac74" class="graf graf--p graf-after--h4">You own the rights to the content you create and post on Medium.</p><p name="651b" id="651b" class="graf graf--p graf-after--p">By posting content to Medium, you give us a nonexclusive license to publish it on Medium Services, including anything reasonably related to publishing it (like storing, displaying, reformatting, and distributing it). In consideration for Medium granting you access to and use of the Services, you agree that Medium may enable advertising on the Services, including in connection with the display of your content or other information. We may also use your content to promote Medium, including its products and content. We will never sell your content to third parties without your explicit permission.</p><p name="2584" id="2584" class="graf graf--p graf-after--p">You’re responsible for the content you post. This means you assume all risks related to it, including someone else’s reliance on its accuracy, or claims relating to intellectual property or other legal rights.</p><p name="c207" id="c207" class="graf graf--p graf-after--p">You’re welcome to post content on Medium that you’ve published elsewhere, as long as you have the rights you need to do so. By posting content to Medium, you represent that doing so doesn’t conflict with any other agreement you’ve made.</p><p name="0372" id="0372" class="graf graf--p graf-after--p">By posting content you didn’t create to Medium, you are representing that you have the right to do so. For example, you are posting a work that’s in the public domain, used under license (including a free license, such as <a href="https://creativecommons.org/licenses/" data-href="https://creativecommons.org/licenses/" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank">Creative Commons</a>), or a fair use.</p><p name="0472" id="0472" class="graf graf--p graf-after--p">We can remove any content you post for any reason.</p><p name="db2b" id="db2b" class="graf graf--p graf-after--p">You can delete any of your posts, or your account, anytime. Processing the deletion may take a little time, but we’ll do it as quickly as possible. We may keep backup copies of your deleted post or account on our servers for up to 14 days after you delete it.</p><h4 name="baf1" id="baf1" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Our content and\xa0services</strong></h4><p name="adc7" id="adc7" class="graf graf--p graf-after--h4">We reserve all rights in Medium’s look and feel. Some parts of Medium are licensed under third-party open source licenses. We also make some of our own code available under open source licenses. As for other parts of Medium, you may not copy or adapt any portion of our code or visual design elements (including logos) without express written permission from Medium unless otherwise permitted by law.</p><p name="20e4" id="20e4" class="graf graf--p graf-after--p">You may not do, or try to do, the following: (1) access or tamper with non-public areas of the Services, our computer systems, or the systems of our technical providers; (2) access or search the Services by any means other than the currently available, published interfaces (e.g., APIs) that we provide; (3) forge any TCP/IP packet header or any part of the header information in any email or posting, or in any way use the Services to send altered, deceptive, or false source-identifying information; or (4) interfere with, or disrupt, the access of any user, host, or network, including sending a virus, overloading, flooding, spamming, mail-bombing the Services, or by scripting the creation of content or accounts in such a manner as to interfere with or create an undue burden on the Services.</p><p name="f5dd" id="f5dd" class="graf graf--p graf-after--p">Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.</p><p name="71a8" id="71a8" class="graf graf--p graf-after--p">We may change, terminate, or restrict access to any aspect of the service, at any time, without notice.</p><h4 name="12f1" id="12f1" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">No children</strong></h4><p name="2ce7" id="2ce7" class="graf graf--p graf-after--h4">Medium is only for people 13 years old and over. By using Medium, you affirm that you are over 13. If we learn someone under 13 is using Medium, we’ll terminate their account.</p><h4 name="531c" id="531c" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Security</strong></h4><p name="3155" id="3155" class="graf graf--p graf-after--h4">If you find a security vulnerability on Medium, tell us. We have a <a href="https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2" data-href="https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2" class="markup--anchor markup--p-anchor" target="_blank">bug bounty disclosure program</a>.</p><h4 name="05cc" id="05cc" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Incorporated rules and\xa0policies</strong></h4><p name="5207" id="5207" class="graf graf--p graf-after--h4">By using the Services, you agree to let Medium collect and use information as detailed in our <a href="https://medium.com/p/f03bf92035c9" data-href="https://medium.com/p/f03bf92035c9" class="markup--anchor markup--p-anchor" target="_blank">Privacy Policy</a>. If you’re outside the United States, you consent to letting Medium transfer, store, and process your information (including your personal information and content) in and out of the United States.</p><p name="6230" id="6230" class="graf graf--p graf-after--p">To enable a functioning community, we have <a href="https://medium.com/policy/medium-rules-30e5502c4eb4" data-href="https://medium.com/policy/medium-rules-30e5502c4eb4" class="markup--anchor markup--p-anchor" target="_blank">Rules</a>. To ensure usernames are distributed and used fairly, we have a <a href="https://medium.com/@Medium/medium-username-policy-7054a77fb04f" data-href="https://medium.com/@Medium/medium-username-policy-7054a77fb04f" class="markup--anchor markup--p-anchor" target="_blank">Username Policy</a>. Under our <a href="https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695" data-href="https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695" class="markup--anchor markup--p-anchor" target="_blank">DMCA Policy</a>, we’ll remove material after receiving a valid takedown notice. Under our <a href="https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7" data-href="https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7" class="markup--anchor markup--p-anchor" target="_blank">Trademark Policy</a>, we’ll investigate any use of another’s trademark and respond appropriately.</p><p name="21ad" id="21ad" class="graf graf--p graf-after--p">By using Medium, you agree to follow these Rules and Policies. If you don’t, we may remove content, or suspend or delete your account.</p><h4 name="a2a2" id="a2a2" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Miscellaneous</strong></h4><p name="b7da" id="b7da" class="graf graf--p graf-after--h4"><em class="markup--em markup--p-em">Disclaimer of warranty.</em> Medium provides the Services to you as is. You use them at your own risk and discretion. That means they don’t come with any warranty. None express, none implied. No implied warranty of merchantability, fitness for a particular purpose, availability, security, title or non-infringement.</p><p name="7073" id="7073" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Limitation of Liability</em>. Medium won’t be liable to you for any damages that arise from your using the Services. This includes if the Services are hacked or unavailable. This includes all types of damages (indirect, incidental, consequential, special or exemplary). And it includes all kinds of legal claims, such as breach of contract, breach of warranty, tort, or any other loss.</p><p name="3d70" id="3d70" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">No waiver.</em> If Medium doesn’t exercise a particular right under these Terms, that doesn’t waive it.</p><p name="ab04" id="ab04" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Severability</em>. If any provision of these terms is found invalid by a court of competent jurisdiction, you agree that the court should try to give effect to the parties’ intentions as reflected in the provision and that other provisions of the Terms will remain in full effect.</p><p name="bde8" id="bde8" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Choice of law and jurisdiction.</em> These Terms are governed by California law, without reference to its conflict of laws provisions. You agree that any suit arising from the Services must take place in a court located in San Francisco, California.</p><p name="bbb3" id="bbb3" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Entire agreement.</em> These Terms (including any document incorporated by reference into them) are the whole agreement between Medium and you concerning the Services.</p><p name="dbf1" id="dbf1" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Government use.</em> If you’re \u200busing \u200bMedium for the U.S. Government, <a href="https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7" data-href="https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7" class="markup--anchor markup--p-anchor" target="_blank">this Amendment</a> to \u200bMedium’s Terms of Service \u200bapplies to you\u200b.</p><p name="3318" id="3318" class="graf graf--p graf-after--p graf--trailing">Questions? Let us know at <a href="mailto:%20legal@medium.com" data-href="mailto:%20legal@medium.com" class="markup--anchor markup--p-anchor" target="_blank">legal@medium.com</a>.</p></div></div></section></div><footer class="u-paddingTop10"><div class="container u-maxWidth740"><div class="row"><div class="col u-size12of12"></div></div><div class="row"><div class="col u-size12of12 js-postTags"><div class="u-paddingBottom10"><ul class="tags tags--postTags tags--borderless"><li><a class="link u-baseColor--link" href="https://medium.com/tag/terms-and-conditions?source=post" data-action-source="post">Terms And Conditions</a></li><li><a class="link u-baseColor--link" href="https://medium.com/tag/terms?source=post" data-action-source="post">Terms</a></li><li><a class="link u-baseColor--link" href="https://medium.com/tag/medium?source=post" data-action-source="post">Medium</a></li></ul></div></div></div><section class="uiScale uiScale-ui--small uiScale-caption--regular u-borderTopLightest u-marginTop10 u-paddingTop20"><div class="ui-h3 u-textColorDarker u-fontSize22">One clap, two clap, three clap, forty?</div><p class="ui-body u-marginBottom20 u-textColorDark u-fontSize16">By clapping more or less, you can signal to us which stories really stand out.</p></section><div class="postActions js-postActionsFooter"><div class="u-flexCenter"><div class="u-flex1"><div class="multirecommend js-actionMultirecommend u-flexCenter u-width60" data-post-id="9db0094a1e0f" data-is-icon-29px="true" data-is-circle="true" data-has-recommend-list="true" data-source="post_actions_footer-----9db0094a1e0f---------------------clap_footer"><div class="u-relative u-foreground"><div class="clapUndo u-width60 u-round u-height32 u-absolute u-borderBox u-paddingRight5 u-transition--transform200Spring u-background--brandSageLighter js-clapUndo" style="top: 14px; padding: 2px;"></div></div><span class="u-textAlignCenter u-relative u-background js-actionMultirecommendCount u-marginLeft10"></span></div></div><div class="buttonSet u-flex0"></div></div></div></div><div class="u-maxWidth740 u-paddingTop20 u-marginTop20 u-borderTopLightest container u-paddingBottom20 u-xs-paddingBottom10 js-postAttributionFooterContainer"><div class="row js-postFooterInfo"><div class="col u-size6of12 u-xs-size12of12"><li class="uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardUser"><div class="u-marginLeft20 u-floatRight"><span class="followState js-followState" data-user-id="504c7870fdb6"></span></div><div class="u-tableCell"><a class="link u-baseColor--link avatar" href="https://medium.com/@Medium?source=footer_card" title="Go to the profile of Medium" aria-label="Go to the profile of Medium" data-action-source="footer_card" data-user-id="504c7870fdb6" dir="auto"><div class="u-relative u-inlineBlock u-flex0"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png" class="avatar-image avatar-image--small" alt="Go to the profile of Medium"><div class="avatar-halo u-absolute u-textColorGreenNormal svgIcon" style="width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px"><svg viewbox="0 0 114 114" xmlns="http://www.w3.org/2000/svg"><path d="M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z"></path><path d="M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z"></path></svg></div></div></a></div><div class="u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15"><h3 class="ui-h3 u-fontSize18 u-lineHeightTighter"><a class="link link--primary u-accentColor--hoverTextNormal" href="https://medium.com/@Medium" property="cc:attributionName" title="Go to the profile of Medium" aria-label="Go to the profile of Medium" rel="author cc:attributionUrl" data-user-id="504c7870fdb6" dir="auto">Medium</a></h3><div class="ui-caption u-textColorGreenNormal u-fontSize13 u-tintSpectrum u-accentColor--textNormal u-marginBottom7">Medium member since Aug 2017</div><p class="ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4">Everyone’s stories and ideas</p></div></li></div><div class="col u-size6of12 u-xs-size12of12 u-xs-marginTop30"><li class="uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardCollection"><div class="u-marginLeft20 u-floatRight"></div><div class="u-tableCell "><a class="link u-baseColor--link avatar avatar--roundedRectangle" href="https://medium.com/policy?source=footer_card" title="Go to Medium Policy" aria-label="Go to Medium Policy" data-action-source="footer_card"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png" class="avatar-image u-size60x60" alt="Medium Policy"></a></div><div class="u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15"><h3 class="ui-h3 u-fontSize18 u-lineHeightTighter u-marginBottom4"><a class="link link--primary u-accentColor--hoverTextNormal" href="https://medium.com/policy?source=footer_card" rel="collection" data-action-source="footer_card">Medium Policy</a></h3><p class="ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4">The Fine Print</p><div class="buttonSet"></div></div></li></div></div></div><div class="js-postFooterPlacements"></div><div class="u-padding0 u-clearfix u-backgroundGrayLightest u-print-hide supplementalPostContent js-responsesWrapper"></div><div class="supplementalPostContent js-heroPromo"></div></footer></div>',
  'author': {'name': None,
   'url': 'https://medium.com/@Medium',
   'twitter': '@Medium'},
  'image_url': None,
  'tags': [],
  'link_tags': {'canonical': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f',
   'publisher': 'https://plus.google.com/103654360130207659246',
   'author': 'https://medium.com/@Medium',
   'search': '/osd.xml',
   'alternate': 'android-app://com.medium.reader/https/medium.com/p/9db0094a1e0f',
   'stylesheet': 'https://cdn-static-1.medium.com/_/fp/css/main-branding-base.Ch8g7KPCoGXbtKfJaVXo_w.css',
   'icon': 'https://cdn-static-1.medium.com/_/fp/icons/favicon-rebrand-medium.3Y6xpZ-0FSdWDnPM3hSBIA.ico',
   'apple-touch-icon': 'https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png',
   'mask-icon': 'https://cdn-static-1.medium.com/_/fp/icons/monogram-mask.KPLCSFEZviQN0jQ7veN2RQ.svg'},
  'meta_tags': {'viewport': 'width=device-width, initial-scale=1',
   'title': 'Medium Terms of Service – Medium Policy – Medium',
   'referrer': 'unsafe-url',
   'description': 'These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…',
   'theme-color': '#000000',
   'og:title': 'Medium Terms of Service – Medium Policy – Medium',
   'og:url': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f',
   'fb:app_id': '542599432471018',
   'og:description': 'These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…',
   'twitter:description': 'These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…',
   'author': 'Medium',
   'og:type': 'article',
   'twitter:card': 'summary',
   'article:publisher': 'https://www.facebook.com/medium',
   'article:author': 'https://medium.com/@Medium',
   'robots': 'index, follow',
   'article:published_time': '2012-08-13T22:54:53.510Z',
   'twitter:creator': '@Medium',
   'twitter:site': '@Medium',
   'og:site_name': 'Medium',
   'twitter:label1': 'Reading time',
   'twitter:data1': '5 min read',
   'twitter:app:name:iphone': 'Medium',
   'twitter:app:id:iphone': '828256236',
   'twitter:app:url:iphone': 'medium://p/9db0094a1e0f',
   'al:ios:app_name': 'Medium',
   'al:ios:app_store_id': '828256236',
   'al:android:package': 'com.medium.reader',
   'al:android:app_name': 'Medium',
   'al:ios:url': 'medium://p/9db0094a1e0f',
   'al:android:url': 'medium://p/9db0094a1e0f',
   'al:web:url': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f'}},)
In [33]:
print('We can take any key from all records \n')
title_bag  = dict_items.pluck('title')
print('With take method we received tuple of objects \n')
print(title_bag.take(3))
We can take any key from all records 

With take method we received tuple of objects 

('Medium Terms of Service – Medium Policy – Medium', 'Amendment to Medium Terms of Service Applicable to U.S. Government Users', '走入山與海之間:閩東大刀會和兩岸走私 – Yun-Chen Chien(簡韻真) – Medium')

We can write any function for processing data and apply it with map function

In [34]:
def clean_title(text):
    
    import string
    cut_set = set(string.punctuation)
    cut_set.update(['”','—','…', "“",'⌘','❤','+','®','➜','¬','–'])
    text = text.translate(text.maketrans(''.join(cut_set)," " * len(cut_set)))
    text = text.lower()
    return text
In [35]:
title_bag  = dict_items.pluck('title').map(clean_title)
In [36]:
title_bag.take(3)
Out[36]:
('medium terms of service   medium policy   medium',
 'amendment to medium terms of service applicable to u s  government users',
 '走入山與海之間:閩東大刀會和兩岸走私   yun chen chien(簡韻真)   medium')

Process meta_tags

In [37]:
meta_tags_bag  = dict_items.pluck('meta_tags')
test_meta = meta_tags_bag.take(3)
In [38]:
test_meta[1]
Out[38]:
{'viewport': 'width=device-width, initial-scale=1',
 'title': 'Amendment to Medium Terms of Service Applicable to U.S. Government Users',
 'referrer': 'origin',
 'description': 'This agreement (“Amendment”) is an amendment to Medium’s Terms. It is between Medium and the U.S. Government and applies to the use of Medium Services by the Government. The reason for this Amendment…',
 'theme-color': '#000000',
 'og:title': 'Amendment to Medium Terms of Service Applicable to U.S. Government Users',
 'og:url': 'https://medium.com/policy/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7',
 'fb:app_id': '542599432471018',
 'og:description': 'This agreement (“Amendment”) is an amendment to Medium’s Terms. It is between Medium and the U.S. Government and applies to the use of…',
 'twitter:description': 'This agreement (“Amendment”) is an amendment to Medium’s Terms. It is between Medium and the U.S. Government and applies to the use of…',
 'author': 'Medium',
 'og:type': 'article',
 'twitter:card': 'summary',
 'article:publisher': 'https://www.facebook.com/medium',
 'article:author': 'https://medium.com/@Medium',
 'robots': 'noindex, follow',
 'article:published_time': '2015-08-03T07:44:50.331Z',
 'twitter:creator': '@Medium',
 'twitter:site': '@Medium',
 'og:site_name': 'Medium',
 'twitter:label1': 'Reading time',
 'twitter:data1': '7 min read',
 'twitter:app:name:iphone': 'Medium',
 'twitter:app:id:iphone': '828256236',
 'twitter:app:url:iphone': 'medium://p/fccb00db67d7',
 'al:ios:app_name': 'Medium',
 'al:ios:app_store_id': '828256236',
 'al:android:package': 'com.medium.reader',
 'al:android:app_name': 'Medium',
 'al:ios:url': 'medium://p/fccb00db67d7',
 'al:android:url': 'medium://p/fccb00db67d7',
 'al:web:url': 'https://medium.com/policy/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7'}
In [39]:
def clean_meta_tags(meta):
    author = meta['author'].strip()
    min_reads = int(meta['twitter:data1'].split()[0])
    return {'author':author, 'min_reads':min_reads}
In [40]:
meta_tags_bag= meta_tags_bag.map(clean_meta_tags)
In [41]:
meta_tags_bag.take(1)
Out[41]:
({'author': 'Medium', 'min_reads': 5},)

Combine all together

In [42]:
%%time
#content_bag = dict_items.pluck('content').map(clean_content)
title_bag  = dict_items.pluck('title').map(clean_title)
published_bag  = dict_items.pluck('published').map(lambda x: x['$date'])
meta_bag = dict_items.pluck('meta_tags').map(clean_meta_tags)
domain_bag = dict_items.pluck('domain')
CPU times: user 779 µs, sys: 248 µs, total: 1.03 ms
Wall time: 1.03 ms
In [43]:
@delayed
def combine_to_df(list_dict):
    
    list_df = [pd.DataFrame(dict_) for dict_ in list_dict]
    return pd.concat(list_df, axis=1)
In [44]:
combined = combine_to_df([published_bag, meta_bag, domain_bag])
combined.visualize()
Out[44]:
In [45]:
# It takes time, around a minute 
from dask.diagnostics import ProgressBar
with ProgressBar():
    df = combined.compute()
df.columns = ['published', 'Author','min_reads','domain']
df.head()
[########################################] | 100% Completed | 59.9s
Out[45]:
published Author min_reads domain
0 2012-08-13T22:54:53.510Z Medium 5 medium.com
1 2015-08-03T07:44:50.331Z Medium 7 medium.com
2 2017-02-05T13:08:17.410Z Yun-Chen Chien(簡韻真) 2 medium.com
3 2017-05-06T08:16:30.776Z Vaibhav Khulbe 3 medium.com
4 2017-06-04T14:46:25.772Z Vaibhav Khulbe 4 medium.com
In [46]:
print('We can create dask dataframe from pandas \n')
dd_no_content = dd.from_pandas(df, npartitions=4)
We can create dask dataframe from pandas 

In [47]:
dd_no_content
Out[47]:
Dask DataFrame Structure:
published Author min_reads domain
npartitions=4
0 object object int64 object
15579 ... ... ... ...
31158 ... ... ... ...
46737 ... ... ... ...
62312 ... ... ... ...
Dask Name: from_pandas, 4 tasks
In [48]:
%%time
print('Transform published column to datetime as we did with pandas, it will by slightly slowly than in pandas \n')
df['published'] = pd.to_datetime(df.published, format='%Y-%m-%dT%H:%M:%S.%fZ')
Transform published column to datetime as we did with pandas, it will by slightly slowly than in pandas 

CPU times: user 277 ms, sys: 2.14 ms, total: 279 ms
Wall time: 277 ms
In [49]:
%%time
print('Transform published column to datetime  with pandas, \n')
dd_no_content['published'] = dd.to_datetime(dd_no_content.published, format='%Y-%m-%dT%H:%M:%S.%fZ').compute()
Transform published column to datetime  with pandas, 

CPU times: user 273 ms, sys: 6.49 ms, total: 279 ms
Wall time: 274 ms
In [50]:
dd_no_content.head()
Out[50]:
published Author min_reads domain
0 2012-08-13 22:54:53.510 Medium 5 medium.com
1 2015-08-03 07:44:50.331 Medium 7 medium.com
2 2017-02-05 13:08:17.410 Yun-Chen Chien(簡韻真) 2 medium.com
3 2017-05-06 08:16:30.776 Vaibhav Khulbe 3 medium.com
4 2017-06-04 14:46:25.772 Vaibhav Khulbe 4 medium.com
In [51]:
print('We can apply function with mixed transformation to dask dataframe written for pandas df without changes \n')
def additional_time_features_df(df, to_cat_cols = ['Author','domain', 'month', 'year', 'day_of_week']):
    
    df['month'] = df['published'].apply(lambda ts: ts.month)
    df['year'] = df['published'].apply(lambda ts: ts.year)
    hour = df['published'].apply(lambda ts: ts.hour)
    df['hour'] = hour
    df['morning'] = ((hour >= 7) & (hour <= 11)).astype('float64')
    df['day'] = ((hour >= 12) & (hour <= 18)).astype('int')
    df['evening'] = ((hour >= 19) & (hour <= 23)).astype('int')
    df['night'] = ((hour >= 0) & (hour <= 6)).astype('int')
    df['sin_hour'] = np.sin(2*np.pi*df['hour']/24)
    df['cos_hour'] = np.cos(2*np.pi*df['hour']/24)
    df = df.drop(["hour"], axis=1)
    day_of_week = df['published'].dt.dayofweek.astype('int')
    df['day_of_week']=day_of_week
    df['weekend'] = (day_of_week >= 5).astype('int')
    # turn to categorical 
    df[to_cat_cols] = df[to_cat_cols].astype('category')
    
    return df
We can apply function with mixed transformation to dask dataframe written for pandas df without changes 

In [52]:
%%time
df_medium_train = additional_time_features_df(df.copy())
CPU times: user 694 ms, sys: 15.2 ms, total: 709 ms
Wall time: 707 ms
In [53]:
dd_medium_train = additional_time_features_df(dd_no_content)
In [54]:
%%time
dd_medium_train.compute()
CPU times: user 884 ms, sys: 52.9 ms, total: 937 ms
Wall time: 861 ms
Out[54]:
published Author min_reads domain month year morning day evening night sin_hour cos_hour day_of_week weekend
0 2012-08-13 22:54:53.510 Medium 5 medium.com 8 2012 0.0 0 1 0 -5.000000e-01 8.660254e-01 0 0
1 2015-08-03 07:44:50.331 Medium 7 medium.com 8 2015 1.0 0 0 0 9.659258e-01 -2.588190e-01 0 0
2 2017-02-05 13:08:17.410 Yun-Chen Chien(簡韻真) 2 medium.com 2 2017 0.0 1 0 0 -2.588190e-01 -9.659258e-01 6 1
3 2017-05-06 08:16:30.776 Vaibhav Khulbe 3 medium.com 5 2017 1.0 0 0 0 8.660254e-01 -5.000000e-01 5 1
4 2017-06-04 14:46:25.772 Vaibhav Khulbe 4 medium.com 6 2017 0.0 1 0 0 -5.000000e-01 -8.660254e-01 6 1
5 2017-04-02 16:21:15.171 Kate Reed Petty 7 medium.com 4 2017 0.0 1 0 0 -8.660254e-01 -5.000000e-01 6 1
6 2016-08-15 04:16:02.103 exedre 12 medium.com 8 2016 0.0 0 0 1 8.660254e-01 5.000000e-01 0 0
7 2015-01-14 21:31:07.568 Raghav Haran 5 medium.com 1 2015 0.0 0 1 0 -7.071068e-01 7.071068e-01 2 0
8 2014-02-11 04:11:54.771 Francine Lee 4 medium.com 2 2014 0.0 0 0 1 8.660254e-01 5.000000e-01 1 0
9 2015-10-25 02:58:05.551 Raghav Haran 8 medium.com 10 2015 0.0 0 0 1 5.000000e-01 8.660254e-01 6 1
10 2016-08-15 15:31:13.601 4 medium.com 8 2016 0.0 1 0 0 -7.071068e-01 -7.071068e-01 0 0
11 2016-08-09 21:01:06.303 One Month 9 medium.com 8 2016 0.0 0 1 0 -7.071068e-01 7.071068e-01 1 0
12 2016-09-08 15:47:57.336 Frank DeGeorge 7 hackernoon.com 9 2016 0.0 1 0 0 -7.071068e-01 -7.071068e-01 3 0
13 2016-09-30 18:05:35.950 Gregório Jung 8 medium.com 9 2016 0.0 1 0 0 -1.000000e+00 -1.836970e-16 4 0
14 2017-06-27 15:49:22.909 Stephen Hays 7 hackernoon.com 6 2017 0.0 1 0 0 -7.071068e-01 -7.071068e-01 1 0
15 2015-07-13 06:52:44.618 Andy Raskin 5 medium.com 7 2015 0.0 0 0 1 1.000000e+00 6.123234e-17 0 0
16 2017-05-01 13:22:43.785 Stephen Hays 8 hackernoon.com 5 2017 0.0 1 0 0 -2.588190e-01 -9.659258e-01 0 0
17 2016-08-31 17:11:24.263 Andy Raskin 7 medium.com 8 2016 0.0 1 0 0 -9.659258e-01 -2.588190e-01 2 0
18 2017-06-30 07:55:55.103 Mohit Mamoria 16 hackernoon.com 6 2017 1.0 0 0 0 9.659258e-01 -2.588190e-01 4 0
19 2016-12-13 23:29:35.556 Oscar Boyson 6 medium.com 12 2016 0.0 0 1 0 -2.588190e-01 9.659258e-01 1 0
20 2016-01-27 22:19:05.027 Brian Verne 5 hackernoon.com 1 2016 0.0 0 1 0 -5.000000e-01 8.660254e-01 2 0
21 2016-12-14 01:15:02.122 Morgan Courtney 11 hackernoon.com 12 2016 0.0 0 0 1 2.588190e-01 9.659258e-01 2 0
22 2016-09-05 22:02:40.326 Jarrett Carter Sr. 4 medium.com 9 2016 0.0 0 1 0 -5.000000e-01 8.660254e-01 0 0
23 2016-12-13 17:59:40.527 thrace 8 medium.com 12 2016 0.0 1 0 0 -9.659258e-01 -2.588190e-01 1 0
24 2017-05-02 17:28:39.120 JakeElman 8 medium.com 5 2017 0.0 1 0 0 -9.659258e-01 -2.588190e-01 1 0
25 2016-08-30 23:43:24.940 Hanna Fogel 2 medium.com 8 2016 0.0 0 1 0 -2.588190e-01 9.659258e-01 1 0
26 2017-04-26 02:50:29.511 Asaeda 9 medium.com 4 2017 0.0 0 0 1 5.000000e-01 8.660254e-01 2 0
27 2016-06-18 06:54:10.331 Dr. Syed Jamal Hasan 11 medium.com 6 2016 0.0 0 0 1 1.000000e+00 6.123234e-17 5 1
28 2016-05-17 17:52:00.960 tiffany jernigan 7 medium.com 5 2016 0.0 1 0 0 -9.659258e-01 -2.588190e-01 1 0
29 2017-04-17 16:29:28.306 Richy Chacon 4 medium.com 4 2017 0.0 1 0 0 -8.660254e-01 -5.000000e-01 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
62283 2017-06-26 20:07:57.240 Jacqueline Bashaw 4 NaN 6 2017 0.0 0 1 0 -8.660254e-01 5.000000e-01 0 0
62284 2017-05-23 23:24:15.931 Lily Herman 5 NaN 5 2017 0.0 0 1 0 -2.588190e-01 9.659258e-01 1 0
62285 2017-02-02 17:52:00.430 Angel Powell 5 NaN 2 2017 0.0 1 0 0 -9.659258e-01 -2.588190e-01 3 0
62286 2017-06-14 15:17:18.712 Lily Herman 6 NaN 6 2017 0.0 1 0 0 -7.071068e-01 -7.071068e-01 2 0
62287 2014-11-12 19:00:00.000 The Hairpin 12 NaN 11 2014 0.0 0 1 0 -9.659258e-01 2.588190e-01 2 0
62288 2014-03-18 10:55:54.000 The Billfold 14 NaN 3 2014 1.0 0 0 0 5.000000e-01 -8.660254e-01 1 0
62289 2012-05-03 19:00:03.000 The Awl 4 NaN 5 2012 0.0 0 1 0 -9.659258e-01 2.588190e-01 3 0
62290 2015-11-02 06:12:22.782 Josh Fruhlinger 18 NaN 11 2015 0.0 0 0 1 1.000000e+00 6.123234e-17 0 0
62291 2012-11-02 00:00:54.000 The Awl 9 NaN 11 2012 0.0 0 0 1 0.000000e+00 1.000000e+00 4 0
62292 2012-11-29 20:00:42.000 David Roth 6 NaN 11 2012 0.0 0 1 0 -8.660254e-01 5.000000e-01 3 0
62293 2012-11-28 18:00:10.000 The Awl 8 NaN 11 2012 0.0 1 0 0 -1.000000e+00 -1.836970e-16 2 0
62294 2016-06-09 16:19:34.121 Cecília Olliveira 3 NaN 6 2016 0.0 1 0 0 -8.660254e-01 -5.000000e-01 3 0
62295 2016-06-23 17:39:16.171 Amy Hawman 8 NaN 6 2016 0.0 1 0 0 -9.659258e-01 -2.588190e-01 3 0
62296 2016-08-23 00:33:48.276 Orlando Trott 5 NaN 8 2016 0.0 0 0 1 0.000000e+00 1.000000e+00 1 0
62297 2015-07-20 15:16:40.169 Transifex 6 NaN 7 2015 0.0 1 0 0 -7.071068e-01 -7.071068e-01 0 0
62298 2015-12-31 22:06:54.772 LA BioMed 3 NaN 12 2015 0.0 0 1 0 -5.000000e-01 8.660254e-01 3 0
62299 2017-01-05 16:19:59.807 Jessica Chen Riolfi 7 NaN 1 2017 0.0 1 0 0 -8.660254e-01 -5.000000e-01 3 0
62300 2016-03-21 18:48:18.079 Pierre @ L’Escapadou 7 NaN 3 2016 0.0 1 0 0 -1.000000e+00 -1.836970e-16 0 0
62301 2017-02-07 18:34:31.427 Nick Troiano 6 NaN 2 2017 0.0 1 0 0 -1.000000e+00 -1.836970e-16 1 0
62302 2016-06-29 02:49:57.853 Amanda L. 9 NaN 6 2016 0.0 0 0 1 5.000000e-01 8.660254e-01 2 0
62303 2016-10-04 12:22:51.674 Mayank Agarwal 4 NaN 10 2016 0.0 1 0 0 1.224647e-16 -1.000000e+00 1 0
62304 2016-10-10 04:17:03.477 Mayank Agarwal 9 NaN 10 2016 0.0 0 0 1 8.660254e-01 5.000000e-01 0 0
62305 2016-10-21 06:30:55.281 Mayank Agarwal 5 NaN 10 2016 0.0 0 0 1 1.000000e+00 6.123234e-17 4 0
62306 2017-05-23 04:37:28.709 Randi Gloss 7 NaN 5 2017 0.0 0 0 1 8.660254e-01 5.000000e-01 1 0
62307 2016-04-05 23:01:22.486 Heather Nann 3 NaN 4 2016 0.0 0 1 0 -2.588190e-01 9.659258e-01 1 0
62308 2016-01-28 01:03:08.798 Heather Nann 4 NaN 1 2016 0.0 0 0 1 2.588190e-01 9.659258e-01 3 0
62309 2016-01-14 13:28:30.277 Heather Nann 5 NaN 1 2016 0.0 1 0 0 -2.588190e-01 -9.659258e-01 3 0
62310 2016-03-06 06:51:45.307 Heather Nann 3 NaN 3 2016 0.0 0 0 1 1.000000e+00 6.123234e-17 6 1
62311 2017-01-15 17:45:22.836 Nick Todorov 7 NaN 1 2017 0.0 1 0 0 -9.659258e-01 -2.588190e-01 6 1
62312 2016-01-25 03:20:33.005 Heather Nann 5 NaN 1 2016 0.0 0 0 1 7.071068e-01 7.071068e-01 0 0

62313 rows × 14 columns

Dask ML

Dask ML provides scalable machine learning algorithms in python which are compatible with scikit-learn. Let us first understand how scikit-learn handles the computations and then we will look at how Dask performs these operations differently. See dask-ml tutorials: Examples from dask ml

You need to install dask-ml at first

There are two main parts in dask ml:

- approaches to handle big datasets 
- approaches to handle big models

Handle big model with dask distributed

The biggest model from our course was a random forest on text data in the week with Random Forest assignment. Below I just reproduce part of our assignment, but I reduced nrows and max features in Count vectorizer, but you can check with original parameters

In [55]:
# Download data
df = pd.read_csv("../../data/movie_reviews_train.csv", nrows=5000)

# Split data to train and test
X_text = df["text"]
y_text = df["label"]

# Classes counts
df.label.value_counts()
Out[55]:
1    3060
0    1940
Name: label, dtype: int64
In [56]:
from sklearn.model_selection import StratifiedKFold,GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Split on 3 folds
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)

# In Pipeline we will modify the text and train logistic regression
classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=500, ngram_range=(1, 3))),
    ('clf', LogisticRegression(random_state=17))])
In [57]:
%%time
parameters = {'clf__C': (0.1, 1, 10, 100)}
grid_search = GridSearchCV(classifier, parameters, scoring ='roc_auc', cv=skf)
grid_search = grid_search.fit(X_text, y_text)
CPU times: user 8.34 s, sys: 139 ms, total: 8.47 s
Wall time: 8.47 s
In [58]:
grid_search.best_score_
Out[58]:
0.7042233630808542

Replace joblib with dask

In this approach all we need to do is replace joblib to dask distributed. We need to initialize distributed client, and change backend

In [59]:
%%time
from sklearn.externals import joblib
from dask.distributed import Client
client = Client()
parameters = {'clf__C': (0.1, 1, 10, 100)}
grid_search = GridSearchCV(classifier, parameters, scoring ='roc_auc', cv=skf)

t_start = time.time()

with joblib.parallel_backend('dask'):
    grid_search.fit(X_text, y_text)
t_end = time.time()
print('Elapsed time for grid_search with joblib replace (s):', round((t_end - t_start)))    
Elapsed time for grid_search with joblib replace (s): 5
CPU times: user 1.39 s, sys: 142 ms, total: 1.53 s
Wall time: 5.87 s
In [60]:
grid_search.best_score_
Out[60]:
0.7042233630808542

Replace Grid search with dask

Parallel to Gridsearch CV in sklearn, Dask provides a library called Dask-search CV (Dask-search CV is now included in Dask ML). It merges steps so that there are less repetitions. Below are the installation steps for Dask-search. We need to install it separately

In [61]:
#pip3 install dask-searchcv
import dask_searchcv as dcv

We can use a pipelines in dask grid search, and according the documentation we should use dask with pipelines with many opeations which could be parallelized, especially included feature union, but I've tried and get an error as a result... Anyway time consuming operations as CountVectorizer couldn't be parallelized, so here gridsearch from dask only for classifier documentation.

In [62]:
%%time
vect = CountVectorizer(max_features=500, ngram_range=(1, 3))
Xvect = vect.fit_transform(X_text)
CPU times: user 762 ms, sys: 30.8 ms, total: 793 ms
Wall time: 788 ms
In [63]:
lr = LogisticRegression()
parameters = {'C': (0.1, 1, 10, 100)}
t_start = time.time()
grid_search = dcv.GridSearchCV(lr, parameters, scoring ='roc_auc', cv=skf)
grid_search.fit(Xvect, y_text)
t_end = time.time()
print(f'Elapsed time for grid_search (without time spended to vectorization) {round((t_end - t_start))} (s):')
Elapsed time for grid_search (without time spended to vectorization) 0 (s):
In [64]:
grid_search.best_score_
Out[64]:
0.7020017187686919

I tried to see how good dask will be with random forest with original parameters, but sometimes this raise en error get "(OSError: [Errno 24] Too many open files) after execution, and I couldn't fix it...." Sometimes it works ok, for small data it works in most cases, but if you re-run this notebook several times there is a big chance to get such an error. So, I believe that dask-ml very usefull, but for know I definitely don't know how it should be used properly.

In [65]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=17)
min_samples_leaf = [1, 2, 3]
max_features = [0.3, 0.5, 0.7]
max_depth = [None]

parameters = {'max_features': max_features,
              'min_samples_leaf': min_samples_leaf,
              'max_depth': max_depth}
grid_search = dcv.GridSearchCV(rf, parameters,  scoring ='roc_auc', cv=skf)
t_start = time.time()
grid_search.fit(Xvect, y_text)
t_end = time.time()
print(f'Elapsed time for dask grid_search for Random Forest {round((t_end - t_start))} (s):')
Elapsed time for dask grid_search for Random Forest 3 (s):

Handle model with big data

There are number of models rewritten in dask, which could take dask object (huge arrays) and compute models on them. You could read more in dask documentation. Below an example with KMeans, but also there are dask version of linear models, processing functions. The notation is very similar to scikit-learn, and it should be easy to use.

In [66]:
from dask_ml import datasets
from dask_ml.cluster import KMeans
In [67]:
X, y = datasets.make_blobs(n_samples=10000000,
                                   chunks=1000000,
                                   random_state=0,
                                   centers=3)
# Persist will give you back a lazy dask.delayed object 
X = X.persist()
X
Out[67]:
dask.array<concatenate, shape=(10000000, 2), dtype=float64, chunksize=(1000000, 2)>
In [68]:
km = KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
km.fit(X)
Out[68]:
KMeans(algorithm='full', copy_x=True, init='k-means||', init_max_iter=2,
    max_iter=300, n_clusters=3, n_jobs=1, oversampling_factor=10,
    precompute_distances='auto', random_state=None, tol=0.0001)

Actually I read the article about dask couple of days ago and I've decided that task with tutorial a good way to get acquainted with the library. So I ask you not to be very strict if I misunderstood something:))