Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don’t fit into main memory. Dask’s high-level collections are alternatives to NumPy and Pandas for large datasets.
if problem size close to limits of RAM, but fits to disk
This notebook based mainly based on this three sources
import psutil, os
import numpy as np
import pandas as pd
from dask import delayed
import gc
import time
import warnings
warnings.filterwarnings("ignore")
Let's write a little function for tracking memory that takes python process
def memory_footprint():
mem = psutil.Process(os.getpid()).memory_info().rss
return (mem / 1024 ** 2)
before = memory_footprint()
print(f'Memory used before is {round(before,2)} MB')
Memory used before is 77.39 MB
N = (1024 ** 2) // 8
x = np.random.randn(50*N)
after = memory_footprint()
print(f'Memory used after is {round(after,2)} MB')
Memory used after is 127.43 MB
Computes, but doesn't bind result to a variable allocate extra memory
x ** 2
after1 = memory_footprint()
print(f' Extra memory obtained after computation {round(after1,2)} MB')
Extra memory obtained after computation 177.43 MB
Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs.dask array documentation
import dask.array as da
y = da.from_array(x, chunks=len(x)//4)
print('Dask arrays require little memory:', memory_footprint()-after1)
Dask arrays require little memory: 2.984375
import time
t_start = time.time()
x.mean()
t_end = time.time()
print('Compute mean value of this numpy array \n')
print('Elapsed time for compute mean of numpy array (ms):', round((t_end - t_start) * 1000))
Compute mean value of this numpy array Elapsed time for compute mean of numpy array (ms): 4
t_start = time.time()
y.mean().compute()
t_end = time.time()
print('Compute the same with dask \n')
print('Elapsed time for compute mean of dask array (ms):', round((t_end - t_start) * 1000))
Compute the same with dask Elapsed time for compute mean of dask array (ms): 21
Actually, this example will never be used in practice, because if your numpy already in memory, any partitioning will always raise computational time. But if you need to process data from HDF5, NetCDF or bulk of numpy files from disk it could be extremely useful
But dask could be useful for small data with delayed computation. It could easily parallelize computation. Let's see the example with our previous numpy array
def f(z):
return np.sqrt(z + 4)
def g(y):
return y - 3
def h(x):
return x ** 2
time_start = time.time()
x = np.random.randn(50*N)
y=h(x);z=g(x); w=f(z+y);
time_end = time.time()
print('Elapsed time for compute complex functions with numpy array (ms):', round((time_end - time_start) * 1000))
Elapsed time for compute complex functions with numpy array (ms): 426
y = delayed(h)(x)
z = delayed(g)(x)
w = delayed(f)(z+y)
print('After we get dask delayed object', w)
time_start = time.time()
w.compute()
time_end = time.time()
print('Elapsed time for compute complex functions with numpy array with dask delayed (ms):', round((time_end - time_start) * 1000))
After we get dask delayed object Delayed('f-10fe1849-e5f7-4f12-97df-e728a4123d43') Elapsed time for compute complex functions with numpy array with dask delayed (ms): 98
It is easily understood why computation time decreased with the computational graph. Let's do this with the second way of introducing delay functions
@delayed
def f(z):
return np.sqrt(z + 4)
@delayed
def g(y):
return y - 3
@delayed
def h(x):
return x ** 2
y = h(x); z = g(x)
w = f(z+y)
w.visualize()
Dask DataFrames coordinate many Pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These Pandas objects may live on disk or on other machines. (See documentation)[http://docs.dask.org/en/latest/dataframe.html]
import dask.dataframe as dd
print('Let\'s return to start of our ML journey\n')
print('Load olympic dataset \n')
PATH = '../../data/athlete_events.csv'
Let's return to start of our ML journey Load olympic dataset
df = pd.read_csv(PATH)
df.head()
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN |
3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
m1=memory_footprint()
dask_df = dd.read_csv(PATH)
m2 = memory_footprint()
print('Dask do not allocate memory after creation:', m2-m1)
Dask do not allocate memory after creation: -5.16015625
print('But we could see data as in pandas dataframe:')
dask_df.head()
But we could see data as in pandas dataframe:
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN |
3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
# building delayed computation
print('We can do many operation the same way as in pandas, but without loading all data in memory \n ')
sex_distr = dask_df.loc[dask_df['Games'].str.contains('1996')].groupby('Sex')['Age'].min()
We can do many operation the same way as in pandas, but without loading all data in memory
print('Here we done selecting and aggregation exactly the same way as we did in pandas \n')
print('But there is not any computation, we create dask structure \ n')
sex_distr
Here we done selecting and aggregation exactly the same way as we did in pandas But there is not any computation, we create dask structure \ n
Dask Series Structure: npartitions=1 float64 ... Name: Age, dtype: float64 Dask Name: series-groupby-min-agg, 8 tasks
%%time
print('Computation is time consuming, but we remember that we dont\'t need to load all data in memory for this computation \n')
print(sex_distr.compute())
Computation is time consuming, but we remember that we dont't need to load data in memory for this computation Sex F 12.0 M 14.0 Name: Age, dtype: float64 CPU times: user 665 ms, sys: 82.8 ms, total: 748 ms Wall time: 746 ms
%%time
print('Pandas of course more effective \n')
print(df.loc[df['Games'].str.contains('1996')].groupby('Sex')['Age'].min())
Pandas of course more effective Sex F 12.0 M 14.0 Name: Age, dtype: float64 CPU times: user 156 ms, sys: 3.07 ms, total: 159 ms Wall time: 158 ms
PATH_TO_DATA = '../../data/capstone_user_identification'
print('We can load all files in single dataframe \n')
print('Your dont\'t need this in Alica project, just an example \n ')
user10dask = dd.read_csv(os.path.join(PATH_TO_DATA,
'10users/*.csv'))
We can load all files in single dataframe Your dont't need this in Alica project, just an example
print('We can look at the data')
print(user10dask)
user10dask.tail()
We can look at the data Dask DataFrame Structure: timestamp site npartitions=10 object object ... ... ... ... ... ... ... ... ... Dask Name: from-delayed, 30 tasks
timestamp | site | |
---|---|---|
5327 | 2014-03-26 15:43:56 | www.google.com |
5328 | 2014-03-26 15:43:57 | plus.google.com |
5329 | 2014-03-26 15:43:57 | mail.google.com |
5330 | 2014-03-26 15:43:58 | accounts.google.com |
5331 | 2014-03-26 15:43:58 | accounts.youtube.com |
print('Let\'s see what happens if we want to count all sites (it could seen as a one more way for dictionary creation) \n')
count_sites = user10dask.groupby('site')['site'].count()
Let's see what happens if we want to count all sites (it could seen as a one more way for dictionary creation)
print('If we visualize this structure we\'ll see the picture of computation \n')
count_sites.visualize()
If we visualize this structure we'll see the picture of computation
%%time
count_sites.compute().sort_values(ascending=False)[:20]
CPU times: user 196 ms, sys: 43.8 ms, total: 240 ms Wall time: 177 ms
site s.youtube.com 8300 www.google.fr 7813 www.google.com 5441 mail.google.com 4158 www.facebook.com 4141 apis.google.com 3758 r3---sn-gxo5uxg-jqbe.googlevideo.com 3244 r1---sn-gxo5uxg-jqbe.googlevideo.com 3094 plus.google.com 2630 accounts.google.com 2089 r2---sn-gxo5uxg-jqbe.googlevideo.com 1939 fr-mg42.mail.yahoo.com 1868 www.youtube.com 1804 r4---sn-gxo5uxg-jqbe.googlevideo.com 1702 clients1.google.com 1493 download.jboss.org 1441 s-static.ak.facebook.com 1388 static.ak.facebook.com 1265 i1.ytimg.com 1232 twitter.com 1204 Name: site, dtype: int64
Dask Bag implements operations like map, filter, fold, and groupby on collections of Python objects. It does this in parallel with a small memory footprint using Python iterators. It is similar to a parallel version of PyToolz or a Pythonic version of the PySpark RDD.Dask bag documentation
Dask bags are often used to parallelize simple computations on unstructured or semi-structured data like text data, log files, JSON records, or user defined Python objects.
Let's see example with our Medium data
import dask.bag as db
import json
print('Path to our medium data \n')
PATH = '../../data/kaggle_medium'
print(PATH)
Path to our medium data ../../data/kaggle_medium
print('Wrap train json to dask bag format \n')
items = db.read_text(os.path.join(PATH,'train.json'))
items
Wrap train json to dask bag format
dask.bag<bag-fro..., npartitions=1>
%%time
print('Let\'s look at one example \n')
print(items.take(1))
Let's look at one example ('{"_id": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "_timestamp": 1520035195.282891, "_spider": "medium", "url": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "domain": "medium.com", "published": {"$date": "2012-08-13T22:54:53.510Z"}, "title": "Medium Terms of Service \\u2013 Medium Policy \\u2013 Medium", "content": "<div><header class=\\"container u-maxWidth740\\"><div class=\\"uiScale uiScale-ui--regular uiScale-caption--regular postMetaHeader u-paddingBottom10 row\\"><div class=\\"col u-size12of12 js-postMetaLockup\\"><div class=\\"uiScale uiScale-ui--regular uiScale-caption--regular postMetaLockup postMetaLockup--authorWithBio u-flexCenter js-postMetaLockup\\"><div class=\\"u-flex0\\"><a class=\\"link u-baseColor--link avatar\\" href=\\"https://medium.com/@Medium?source=post_header_lockup\\" data-action=\\"show-user-card\\" data-action-source=\\"post_header_lockup\\" data-action-value=\\"504c7870fdb6\\" data-action-type=\\"hover\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\"><div class=\\"u-relative u-inlineBlock u-flex0\\"><img src=\\"https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\\" class=\\"avatar-image avatar-image--small\\" alt=\\"Go to the profile of Medium\\"><div class=\\"avatar-halo u-absolute u-textColorGreenNormal svgIcon\\" style=\\"width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px\\"><svg viewbox=\\"0 0 114 114\\" xmlns=\\"http://www.w3.org/2000/svg\\"><path d=\\"M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z\\"></path><path d=\\"M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z\\"></path></svg></div></div></a></div><div class=\\"u-flex1 u-paddingLeft15 u-overflowHidden\\"><div class=\\"u-lineHeightTightest\\"><a class=\\"ds-link ds-link--styleSubtle ui-captionStrong u-inlineBlock link link--darken link--darker\\" href=\\"https://medium.com/@Medium?source=post_header_lockup\\" data-action=\\"show-user-card\\" data-action-source=\\"post_header_lockup\\" data-action-value=\\"504c7870fdb6\\" data-action-type=\\"hover\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\">Medium</a><span class=\\"followState js-followState\\" data-user-id=\\"504c7870fdb6\\"></span></div><div class=\\"ui-caption ui-xs-clamp2 postMetaInline\\">Everyone\\u2019s stories and ideas</div><div class=\\"ui-caption postMetaInline js-testPostMetaInlineSupplemental\\"><time datetime=\\"2012-08-13T22:54:53.510Z\\">Aug 13, 2012</time><span class=\\"middotDivider u-fontSize12\\"></span><span class=\\"readingTime\\" title=\\"5 min read\\"></span></div></div></div></div></div></header><div class=\\"postArticle-content js-postField js-notesSource js-trackedPost\\" data-post-id=\\"9db0094a1e0f\\" data-source=\\"post_page\\" data-collection-id=\\"675ebe56ac25\\" data-tracking-context=\\"postPage\\"><section name=\\"bb8c\\" class=\\"section section--body section--first section--last\\"><div class=\\"section-divider\\"><hr class=\\"section-divider\\"></div><div class=\\"section-content\\"><div class=\\"section-inner sectionLayout--insetColumn\\"><h1 name=\\"title\\" id=\\"title\\" class=\\"graf graf--h2 graf--leading graf--title\\">Medium Terms of\\u00a0Service</h1><p name=\\"571b\\" id=\\"571b\\" class=\\"graf graf--p graf-after--h2\\"><strong class=\\"markup--strong markup--p-strong\\">Effective: March 7, 2016</strong></p><p name=\\"c90b\\" id=\\"c90b\\" class=\\"graf graf--p graf-after--p\\">These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d).</p><p name=\\"238b\\" id=\\"238b\\" class=\\"graf graf--p graf-after--p\\">By using Medium, you agree to these Terms. If you don\\u2019t agree to any of the Terms, you can\\u2019t use Medium.</p><p name=\\"7769\\" id=\\"7769\\" class=\\"graf graf--p graf-after--p\\">We can change these Terms at any time. We keep a <a href=\\"https://github.com/Medium/medium-policy\\" data-href=\\"https://github.com/Medium/medium-policy\\" class=\\"markup--anchor markup--p-anchor\\" rel=\\"nofollow noopener\\" target=\\"_blank\\">historical</a> record of all changes to our Terms on GitHub. If a change is material, we\\u2019ll let you know before they take effect. By using Medium on or after that effective date, you agree to the new Terms. If you don\\u2019t agree to them, you should delete your account before they take effect, otherwise your use of the site and content will be subject to the new Terms.</p><h4 name=\\"8c81\\" id=\\"8c81\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Content rights & responsibilities</strong></h4><p name=\\"ac74\\" id=\\"ac74\\" class=\\"graf graf--p graf-after--h4\\">You own the rights to the content you create and post on Medium.</p><p name=\\"651b\\" id=\\"651b\\" class=\\"graf graf--p graf-after--p\\">By posting content to Medium, you give us a nonexclusive license to publish it on Medium Services, including anything reasonably related to publishing it (like storing, displaying, reformatting, and distributing it). In consideration for Medium granting you access to and use of the Services, you agree that Medium may enable advertising on the Services, including in connection with the display of your content or other information. We may also use your content to promote Medium, including its products and content. We will never sell your content to third parties without your explicit permission.</p><p name=\\"2584\\" id=\\"2584\\" class=\\"graf graf--p graf-after--p\\">You\\u2019re responsible for the content you post. This means you assume all risks related to it, including someone else\\u2019s reliance on its accuracy, or claims relating to intellectual property or other legal rights.</p><p name=\\"c207\\" id=\\"c207\\" class=\\"graf graf--p graf-after--p\\">You\\u2019re welcome to post content on Medium that you\\u2019ve published elsewhere, as long as you have the rights you need to do so. By posting content to Medium, you represent that doing so doesn\\u2019t conflict with any other agreement you\\u2019ve made.</p><p name=\\"0372\\" id=\\"0372\\" class=\\"graf graf--p graf-after--p\\">By posting content you didn\\u2019t create to Medium, you are representing that you have the right to do so. For example, you are posting a work that\\u2019s in the public domain, used under license (including a free license, such as <a href=\\"https://creativecommons.org/licenses/\\" data-href=\\"https://creativecommons.org/licenses/\\" class=\\"markup--anchor markup--p-anchor\\" rel=\\"nofollow noopener\\" target=\\"_blank\\">Creative Commons</a>), or a fair use.</p><p name=\\"0472\\" id=\\"0472\\" class=\\"graf graf--p graf-after--p\\">We can remove any content you post for any reason.</p><p name=\\"db2b\\" id=\\"db2b\\" class=\\"graf graf--p graf-after--p\\">You can delete any of your posts, or your account, anytime. Processing the deletion may take a little time, but we\\u2019ll do it as quickly as possible. We may keep backup copies of your deleted post or account on our servers for up to 14 days after you delete it.</p><h4 name=\\"baf1\\" id=\\"baf1\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Our content and\\u00a0services</strong></h4><p name=\\"adc7\\" id=\\"adc7\\" class=\\"graf graf--p graf-after--h4\\">We reserve all rights in Medium\\u2019s look and feel. Some parts of Medium are licensed under third-party open source licenses. We also make some of our own code available under open source licenses. As for other parts of Medium, you may not copy or adapt any portion of our code or visual design elements (including logos) without express written permission from Medium unless otherwise permitted by law.</p><p name=\\"20e4\\" id=\\"20e4\\" class=\\"graf graf--p graf-after--p\\">You may not do, or try to do, the following: (1) access or tamper with non-public areas of the Services, our computer systems, or the systems of our technical providers; (2) access or search the Services by any means other than the currently available, published interfaces (e.g., APIs) that we provide; (3) forge any TCP/IP packet header or any part of the header information in any email or posting, or in any way use the Services to send altered, deceptive, or false source-identifying information; or (4) interfere with, or disrupt, the access of any user, host, or network, including sending a virus, overloading, flooding, spamming, mail-bombing the Services, or by scripting the creation of content or accounts in such a manner as to interfere with or create an undue burden on the Services.</p><p name=\\"f5dd\\" id=\\"f5dd\\" class=\\"graf graf--p graf-after--p\\">Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.</p><p name=\\"71a8\\" id=\\"71a8\\" class=\\"graf graf--p graf-after--p\\">We may change, terminate, or restrict access to any aspect of the service, at any time, without notice.</p><h4 name=\\"12f1\\" id=\\"12f1\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">No children</strong></h4><p name=\\"2ce7\\" id=\\"2ce7\\" class=\\"graf graf--p graf-after--h4\\">Medium is only for people 13 years old and over. By using Medium, you affirm that you are over 13. If we learn someone under 13 is using Medium, we\\u2019ll terminate their account.</p><h4 name=\\"531c\\" id=\\"531c\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Security</strong></h4><p name=\\"3155\\" id=\\"3155\\" class=\\"graf graf--p graf-after--h4\\">If you find a security vulnerability on Medium, tell us. We have a <a href=\\"https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2\\" data-href=\\"https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">bug bounty disclosure program</a>.</p><h4 name=\\"05cc\\" id=\\"05cc\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Incorporated rules and\\u00a0policies</strong></h4><p name=\\"5207\\" id=\\"5207\\" class=\\"graf graf--p graf-after--h4\\">By using the Services, you agree to let Medium collect and use information as detailed in our <a href=\\"https://medium.com/p/f03bf92035c9\\" data-href=\\"https://medium.com/p/f03bf92035c9\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Privacy Policy</a>. If you\\u2019re outside the United States, you consent to letting Medium transfer, store, and process your information (including your personal information and content) in and out of the United States.</p><p name=\\"6230\\" id=\\"6230\\" class=\\"graf graf--p graf-after--p\\">To enable a functioning community, we have <a href=\\"https://medium.com/policy/medium-rules-30e5502c4eb4\\" data-href=\\"https://medium.com/policy/medium-rules-30e5502c4eb4\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Rules</a>. To ensure usernames are distributed and used fairly, we have a <a href=\\"https://medium.com/@Medium/medium-username-policy-7054a77fb04f\\" data-href=\\"https://medium.com/@Medium/medium-username-policy-7054a77fb04f\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Username Policy</a>. Under our <a href=\\"https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695\\" data-href=\\"https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">DMCA Policy</a>, we\\u2019ll remove material after receiving a valid takedown notice. Under our <a href=\\"https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7\\" data-href=\\"https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">Trademark Policy</a>, we\\u2019ll investigate any use of another\\u2019s trademark and respond appropriately.</p><p name=\\"21ad\\" id=\\"21ad\\" class=\\"graf graf--p graf-after--p\\">By using Medium, you agree to follow these Rules and Policies. If you don\\u2019t, we may remove content, or suspend or delete your account.</p><h4 name=\\"a2a2\\" id=\\"a2a2\\" class=\\"graf graf--h4 graf-after--p\\"><strong class=\\"markup--strong markup--h4-strong\\">Miscellaneous</strong></h4><p name=\\"b7da\\" id=\\"b7da\\" class=\\"graf graf--p graf-after--h4\\"><em class=\\"markup--em markup--p-em\\">Disclaimer of warranty.</em> Medium provides the Services to you as is. You use them at your own risk and discretion. That means they don\\u2019t come with any warranty. None express, none implied. No implied warranty of merchantability, fitness for a particular purpose, availability, security, title or non-infringement.</p><p name=\\"7073\\" id=\\"7073\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Limitation of Liability</em>. Medium won\\u2019t be liable to you for any damages that arise from your using the Services. This includes if the Services are hacked or unavailable. This includes all types of damages (indirect, incidental, consequential, special or exemplary). And it includes all kinds of legal claims, such as breach of contract, breach of warranty, tort, or any other loss.</p><p name=\\"3d70\\" id=\\"3d70\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">No waiver.</em> If Medium doesn\\u2019t exercise a particular right under these Terms, that doesn\\u2019t waive it.</p><p name=\\"ab04\\" id=\\"ab04\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Severability</em>. If any provision of these terms is found invalid by a court of competent jurisdiction, you agree that the court should try to give effect to the parties\\u2019 intentions as reflected in the provision and that other provisions of the Terms will remain in full effect.</p><p name=\\"bde8\\" id=\\"bde8\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Choice of law and jurisdiction.</em> These Terms are governed by California law, without reference to its conflict of laws provisions. You agree that any suit arising from the Services must take place in a court located in San Francisco, California.</p><p name=\\"bbb3\\" id=\\"bbb3\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Entire agreement.</em> These Terms (including any document incorporated by reference into them) are the whole agreement between Medium and you concerning the Services.</p><p name=\\"dbf1\\" id=\\"dbf1\\" class=\\"graf graf--p graf-after--p\\"><em class=\\"markup--em markup--p-em\\">Government use.</em> If you\\u2019re \\u200busing \\u200bMedium for the U.S. Government, <a href=\\"https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7\\" data-href=\\"https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">this Amendment</a> to \\u200bMedium\\u2019s Terms of Service \\u200bapplies to you\\u200b.</p><p name=\\"3318\\" id=\\"3318\\" class=\\"graf graf--p graf-after--p graf--trailing\\">Questions? Let us know at <a href=\\"mailto:%20legal@medium.com\\" data-href=\\"mailto:%20legal@medium.com\\" class=\\"markup--anchor markup--p-anchor\\" target=\\"_blank\\">legal@medium.com</a>.</p></div></div></section></div><footer class=\\"u-paddingTop10\\"><div class=\\"container u-maxWidth740\\"><div class=\\"row\\"><div class=\\"col u-size12of12\\"></div></div><div class=\\"row\\"><div class=\\"col u-size12of12 js-postTags\\"><div class=\\"u-paddingBottom10\\"><ul class=\\"tags tags--postTags tags--borderless\\"><li><a class=\\"link u-baseColor--link\\" href=\\"https://medium.com/tag/terms-and-conditions?source=post\\" data-action-source=\\"post\\">Terms And Conditions</a></li><li><a class=\\"link u-baseColor--link\\" href=\\"https://medium.com/tag/terms?source=post\\" data-action-source=\\"post\\">Terms</a></li><li><a class=\\"link u-baseColor--link\\" href=\\"https://medium.com/tag/medium?source=post\\" data-action-source=\\"post\\">Medium</a></li></ul></div></div></div><section class=\\"uiScale uiScale-ui--small uiScale-caption--regular u-borderTopLightest u-marginTop10 u-paddingTop20\\"><div class=\\"ui-h3 u-textColorDarker u-fontSize22\\">One clap, two clap, three clap, forty?</div><p class=\\"ui-body u-marginBottom20 u-textColorDark u-fontSize16\\">By clapping more or less, you can signal to us which stories really stand out.</p></section><div class=\\"postActions js-postActionsFooter\\"><div class=\\"u-flexCenter\\"><div class=\\"u-flex1\\"><div class=\\"multirecommend js-actionMultirecommend u-flexCenter u-width60\\" data-post-id=\\"9db0094a1e0f\\" data-is-icon-29px=\\"true\\" data-is-circle=\\"true\\" data-has-recommend-list=\\"true\\" data-source=\\"post_actions_footer-----9db0094a1e0f---------------------clap_footer\\"><div class=\\"u-relative u-foreground\\"><div class=\\"clapUndo u-width60 u-round u-height32 u-absolute u-borderBox u-paddingRight5 u-transition--transform200Spring u-background--brandSageLighter js-clapUndo\\" style=\\"top: 14px; padding: 2px;\\"></div></div><span class=\\"u-textAlignCenter u-relative u-background js-actionMultirecommendCount u-marginLeft10\\"></span></div></div><div class=\\"buttonSet u-flex0\\"></div></div></div></div><div class=\\"u-maxWidth740 u-paddingTop20 u-marginTop20 u-borderTopLightest container u-paddingBottom20 u-xs-paddingBottom10 js-postAttributionFooterContainer\\"><div class=\\"row js-postFooterInfo\\"><div class=\\"col u-size6of12 u-xs-size12of12\\"><li class=\\"uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardUser\\"><div class=\\"u-marginLeft20 u-floatRight\\"><span class=\\"followState js-followState\\" data-user-id=\\"504c7870fdb6\\"></span></div><div class=\\"u-tableCell\\"><a class=\\"link u-baseColor--link avatar\\" href=\\"https://medium.com/@Medium?source=footer_card\\" title=\\"Go to the profile of Medium\\" aria-label=\\"Go to the profile of Medium\\" data-action-source=\\"footer_card\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\"><div class=\\"u-relative u-inlineBlock u-flex0\\"><img src=\\"https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\\" class=\\"avatar-image avatar-image--small\\" alt=\\"Go to the profile of Medium\\"><div class=\\"avatar-halo u-absolute u-textColorGreenNormal svgIcon\\" style=\\"width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px\\"><svg viewbox=\\"0 0 114 114\\" xmlns=\\"http://www.w3.org/2000/svg\\"><path d=\\"M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z\\"></path><path d=\\"M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z\\"></path></svg></div></div></a></div><div class=\\"u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15\\"><h3 class=\\"ui-h3 u-fontSize18 u-lineHeightTighter\\"><a class=\\"link link--primary u-accentColor--hoverTextNormal\\" href=\\"https://medium.com/@Medium\\" property=\\"cc:attributionName\\" title=\\"Go to the profile of Medium\\" aria-label=\\"Go to the profile of Medium\\" rel=\\"author cc:attributionUrl\\" data-user-id=\\"504c7870fdb6\\" dir=\\"auto\\">Medium</a></h3><div class=\\"ui-caption u-textColorGreenNormal u-fontSize13 u-tintSpectrum u-accentColor--textNormal u-marginBottom7\\">Medium member since Aug 2017</div><p class=\\"ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4\\">Everyone\\u2019s stories and ideas</p></div></li></div><div class=\\"col u-size6of12 u-xs-size12of12 u-xs-marginTop30\\"><li class=\\"uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardCollection\\"><div class=\\"u-marginLeft20 u-floatRight\\"></div><div class=\\"u-tableCell \\"><a class=\\"link u-baseColor--link avatar avatar--roundedRectangle\\" href=\\"https://medium.com/policy?source=footer_card\\" title=\\"Go to Medium Policy\\" aria-label=\\"Go to Medium Policy\\" data-action-source=\\"footer_card\\"><img src=\\"https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png\\" class=\\"avatar-image u-size60x60\\" alt=\\"Medium Policy\\"></a></div><div class=\\"u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15\\"><h3 class=\\"ui-h3 u-fontSize18 u-lineHeightTighter u-marginBottom4\\"><a class=\\"link link--primary u-accentColor--hoverTextNormal\\" href=\\"https://medium.com/policy?source=footer_card\\" rel=\\"collection\\" data-action-source=\\"footer_card\\">Medium Policy</a></h3><p class=\\"ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4\\">The Fine Print</p><div class=\\"buttonSet\\"></div></div></li></div></div></div><div class=\\"js-postFooterPlacements\\"></div><div class=\\"u-padding0 u-clearfix u-backgroundGrayLightest u-print-hide supplementalPostContent js-responsesWrapper\\"></div><div class=\\"supplementalPostContent js-heroPromo\\"></div></footer></div>", "author": {"name": null, "url": "https://medium.com/@Medium", "twitter": "@Medium"}, "image_url": null, "tags": [], "link_tags": {"canonical": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "publisher": "https://plus.google.com/103654360130207659246", "author": "https://medium.com/@Medium", "search": "/osd.xml", "alternate": "android-app://com.medium.reader/https/medium.com/p/9db0094a1e0f", "stylesheet": "https://cdn-static-1.medium.com/_/fp/css/main-branding-base.Ch8g7KPCoGXbtKfJaVXo_w.css", "icon": "https://cdn-static-1.medium.com/_/fp/icons/favicon-rebrand-medium.3Y6xpZ-0FSdWDnPM3hSBIA.ico", "apple-touch-icon": "https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png", "mask-icon": "https://cdn-static-1.medium.com/_/fp/icons/monogram-mask.KPLCSFEZviQN0jQ7veN2RQ.svg"}, "meta_tags": {"viewport": "width=device-width, initial-scale=1", "title": "Medium Terms of Service \\u2013 Medium Policy \\u2013 Medium", "referrer": "unsafe-url", "description": "These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d). By using\\u2026", "theme-color": "#000000", "og:title": "Medium Terms of Service \\u2013 Medium Policy \\u2013 Medium", "og:url": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f", "fb:app_id": "542599432471018", "og:description": "These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d). By using\\u2026", "twitter:description": "These Terms of Service (\\u201cTerms\\u201d) are a contract between you and A Medium Corporation. They govern your use of Medium\\u2019s sites, services, mobile apps, products, and content (\\u201cServices\\u201d). By using\\u2026", "author": "Medium", "og:type": "article", "twitter:card": "summary", "article:publisher": "https://www.facebook.com/medium", "article:author": "https://medium.com/@Medium", "robots": "index, follow", "article:published_time": "2012-08-13T22:54:53.510Z", "twitter:creator": "@Medium", "twitter:site": "@Medium", "og:site_name": "Medium", "twitter:label1": "Reading time", "twitter:data1": "5 min read", "twitter:app:name:iphone": "Medium", "twitter:app:id:iphone": "828256236", "twitter:app:url:iphone": "medium://p/9db0094a1e0f", "al:ios:app_name": "Medium", "al:ios:app_store_id": "828256236", "al:android:package": "com.medium.reader", "al:android:app_name": "Medium", "al:ios:url": "medium://p/9db0094a1e0f", "al:android:url": "medium://p/9db0094a1e0f", "al:web:url": "https://medium.com/policy/medium-terms-of-service-9db0094a1e0f"}}\n',) CPU times: user 16.9 ms, sys: 26.1 ms, total: 43 ms Wall time: 42.7 ms
print('We can parse date with json library and get dict like object \n')
dict_items = items.map(json.loads)
print(type(dict_items))
We can parse date with json library and get dict like object <class 'dask.bag.core.Bag'>
dict_items.take(1)
({'_id': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f', '_timestamp': 1520035195.282891, '_spider': 'medium', 'url': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f', 'domain': 'medium.com', 'published': {'$date': '2012-08-13T22:54:53.510Z'}, 'title': 'Medium Terms of Service – Medium Policy – Medium', 'content': '<div><header class="container u-maxWidth740"><div class="uiScale uiScale-ui--regular uiScale-caption--regular postMetaHeader u-paddingBottom10 row"><div class="col u-size12of12 js-postMetaLockup"><div class="uiScale uiScale-ui--regular uiScale-caption--regular postMetaLockup postMetaLockup--authorWithBio u-flexCenter js-postMetaLockup"><div class="u-flex0"><a class="link u-baseColor--link avatar" href="https://medium.com/@Medium?source=post_header_lockup" data-action="show-user-card" data-action-source="post_header_lockup" data-action-value="504c7870fdb6" data-action-type="hover" data-user-id="504c7870fdb6" dir="auto"><div class="u-relative u-inlineBlock u-flex0"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png" class="avatar-image avatar-image--small" alt="Go to the profile of Medium"><div class="avatar-halo u-absolute u-textColorGreenNormal svgIcon" style="width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px"><svg viewbox="0 0 114 114" xmlns="http://www.w3.org/2000/svg"><path d="M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z"></path><path d="M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z"></path></svg></div></div></a></div><div class="u-flex1 u-paddingLeft15 u-overflowHidden"><div class="u-lineHeightTightest"><a class="ds-link ds-link--styleSubtle ui-captionStrong u-inlineBlock link link--darken link--darker" href="https://medium.com/@Medium?source=post_header_lockup" data-action="show-user-card" data-action-source="post_header_lockup" data-action-value="504c7870fdb6" data-action-type="hover" data-user-id="504c7870fdb6" dir="auto">Medium</a><span class="followState js-followState" data-user-id="504c7870fdb6"></span></div><div class="ui-caption ui-xs-clamp2 postMetaInline">Everyone’s stories and ideas</div><div class="ui-caption postMetaInline js-testPostMetaInlineSupplemental"><time datetime="2012-08-13T22:54:53.510Z">Aug 13, 2012</time><span class="middotDivider u-fontSize12"></span><span class="readingTime" title="5 min read"></span></div></div></div></div></div></header><div class="postArticle-content js-postField js-notesSource js-trackedPost" data-post-id="9db0094a1e0f" data-source="post_page" data-collection-id="675ebe56ac25" data-tracking-context="postPage"><section name="bb8c" class="section section--body section--first section--last"><div class="section-divider"><hr class="section-divider"></div><div class="section-content"><div class="section-inner sectionLayout--insetColumn"><h1 name="title" id="title" class="graf graf--h2 graf--leading graf--title">Medium Terms of\xa0Service</h1><p name="571b" id="571b" class="graf graf--p graf-after--h2"><strong class="markup--strong markup--p-strong">Effective: March 7, 2016</strong></p><p name="c90b" id="c90b" class="graf graf--p graf-after--p">These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”).</p><p name="238b" id="238b" class="graf graf--p graf-after--p">By using Medium, you agree to these Terms. If you don’t agree to any of the Terms, you can’t use Medium.</p><p name="7769" id="7769" class="graf graf--p graf-after--p">We can change these Terms at any time. We keep a <a href="https://github.com/Medium/medium-policy" data-href="https://github.com/Medium/medium-policy" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank">historical</a> record of all changes to our Terms on GitHub. If a change is material, we’ll let you know before they take effect. By using Medium on or after that effective date, you agree to the new Terms. If you don’t agree to them, you should delete your account before they take effect, otherwise your use of the site and content will be subject to the new Terms.</p><h4 name="8c81" id="8c81" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Content rights & responsibilities</strong></h4><p name="ac74" id="ac74" class="graf graf--p graf-after--h4">You own the rights to the content you create and post on Medium.</p><p name="651b" id="651b" class="graf graf--p graf-after--p">By posting content to Medium, you give us a nonexclusive license to publish it on Medium Services, including anything reasonably related to publishing it (like storing, displaying, reformatting, and distributing it). In consideration for Medium granting you access to and use of the Services, you agree that Medium may enable advertising on the Services, including in connection with the display of your content or other information. We may also use your content to promote Medium, including its products and content. We will never sell your content to third parties without your explicit permission.</p><p name="2584" id="2584" class="graf graf--p graf-after--p">You’re responsible for the content you post. This means you assume all risks related to it, including someone else’s reliance on its accuracy, or claims relating to intellectual property or other legal rights.</p><p name="c207" id="c207" class="graf graf--p graf-after--p">You’re welcome to post content on Medium that you’ve published elsewhere, as long as you have the rights you need to do so. By posting content to Medium, you represent that doing so doesn’t conflict with any other agreement you’ve made.</p><p name="0372" id="0372" class="graf graf--p graf-after--p">By posting content you didn’t create to Medium, you are representing that you have the right to do so. For example, you are posting a work that’s in the public domain, used under license (including a free license, such as <a href="https://creativecommons.org/licenses/" data-href="https://creativecommons.org/licenses/" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank">Creative Commons</a>), or a fair use.</p><p name="0472" id="0472" class="graf graf--p graf-after--p">We can remove any content you post for any reason.</p><p name="db2b" id="db2b" class="graf graf--p graf-after--p">You can delete any of your posts, or your account, anytime. Processing the deletion may take a little time, but we’ll do it as quickly as possible. We may keep backup copies of your deleted post or account on our servers for up to 14 days after you delete it.</p><h4 name="baf1" id="baf1" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Our content and\xa0services</strong></h4><p name="adc7" id="adc7" class="graf graf--p graf-after--h4">We reserve all rights in Medium’s look and feel. Some parts of Medium are licensed under third-party open source licenses. We also make some of our own code available under open source licenses. As for other parts of Medium, you may not copy or adapt any portion of our code or visual design elements (including logos) without express written permission from Medium unless otherwise permitted by law.</p><p name="20e4" id="20e4" class="graf graf--p graf-after--p">You may not do, or try to do, the following: (1) access or tamper with non-public areas of the Services, our computer systems, or the systems of our technical providers; (2) access or search the Services by any means other than the currently available, published interfaces (e.g., APIs) that we provide; (3) forge any TCP/IP packet header or any part of the header information in any email or posting, or in any way use the Services to send altered, deceptive, or false source-identifying information; or (4) interfere with, or disrupt, the access of any user, host, or network, including sending a virus, overloading, flooding, spamming, mail-bombing the Services, or by scripting the creation of content or accounts in such a manner as to interfere with or create an undue burden on the Services.</p><p name="f5dd" id="f5dd" class="graf graf--p graf-after--p">Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.</p><p name="71a8" id="71a8" class="graf graf--p graf-after--p">We may change, terminate, or restrict access to any aspect of the service, at any time, without notice.</p><h4 name="12f1" id="12f1" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">No children</strong></h4><p name="2ce7" id="2ce7" class="graf graf--p graf-after--h4">Medium is only for people 13 years old and over. By using Medium, you affirm that you are over 13. If we learn someone under 13 is using Medium, we’ll terminate their account.</p><h4 name="531c" id="531c" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Security</strong></h4><p name="3155" id="3155" class="graf graf--p graf-after--h4">If you find a security vulnerability on Medium, tell us. We have a <a href="https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2" data-href="https://medium.com/policy/medium-s-bug-bounty-disclosure-program-34b1c80764c2" class="markup--anchor markup--p-anchor" target="_blank">bug bounty disclosure program</a>.</p><h4 name="05cc" id="05cc" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Incorporated rules and\xa0policies</strong></h4><p name="5207" id="5207" class="graf graf--p graf-after--h4">By using the Services, you agree to let Medium collect and use information as detailed in our <a href="https://medium.com/p/f03bf92035c9" data-href="https://medium.com/p/f03bf92035c9" class="markup--anchor markup--p-anchor" target="_blank">Privacy Policy</a>. If you’re outside the United States, you consent to letting Medium transfer, store, and process your information (including your personal information and content) in and out of the United States.</p><p name="6230" id="6230" class="graf graf--p graf-after--p">To enable a functioning community, we have <a href="https://medium.com/policy/medium-rules-30e5502c4eb4" data-href="https://medium.com/policy/medium-rules-30e5502c4eb4" class="markup--anchor markup--p-anchor" target="_blank">Rules</a>. To ensure usernames are distributed and used fairly, we have a <a href="https://medium.com/@Medium/medium-username-policy-7054a77fb04f" data-href="https://medium.com/@Medium/medium-username-policy-7054a77fb04f" class="markup--anchor markup--p-anchor" target="_blank">Username Policy</a>. Under our <a href="https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695" data-href="https://medium.com/policy/mediums-copyright-and-dmca-policy-d126f73695" class="markup--anchor markup--p-anchor" target="_blank">DMCA Policy</a>, we’ll remove material after receiving a valid takedown notice. Under our <a href="https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7" data-href="https://medium.com/policy/mediums-trademark-policy-e3bb53df59a7" class="markup--anchor markup--p-anchor" target="_blank">Trademark Policy</a>, we’ll investigate any use of another’s trademark and respond appropriately.</p><p name="21ad" id="21ad" class="graf graf--p graf-after--p">By using Medium, you agree to follow these Rules and Policies. If you don’t, we may remove content, or suspend or delete your account.</p><h4 name="a2a2" id="a2a2" class="graf graf--h4 graf-after--p"><strong class="markup--strong markup--h4-strong">Miscellaneous</strong></h4><p name="b7da" id="b7da" class="graf graf--p graf-after--h4"><em class="markup--em markup--p-em">Disclaimer of warranty.</em> Medium provides the Services to you as is. You use them at your own risk and discretion. That means they don’t come with any warranty. None express, none implied. No implied warranty of merchantability, fitness for a particular purpose, availability, security, title or non-infringement.</p><p name="7073" id="7073" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Limitation of Liability</em>. Medium won’t be liable to you for any damages that arise from your using the Services. This includes if the Services are hacked or unavailable. This includes all types of damages (indirect, incidental, consequential, special or exemplary). And it includes all kinds of legal claims, such as breach of contract, breach of warranty, tort, or any other loss.</p><p name="3d70" id="3d70" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">No waiver.</em> If Medium doesn’t exercise a particular right under these Terms, that doesn’t waive it.</p><p name="ab04" id="ab04" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Severability</em>. If any provision of these terms is found invalid by a court of competent jurisdiction, you agree that the court should try to give effect to the parties’ intentions as reflected in the provision and that other provisions of the Terms will remain in full effect.</p><p name="bde8" id="bde8" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Choice of law and jurisdiction.</em> These Terms are governed by California law, without reference to its conflict of laws provisions. You agree that any suit arising from the Services must take place in a court located in San Francisco, California.</p><p name="bbb3" id="bbb3" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Entire agreement.</em> These Terms (including any document incorporated by reference into them) are the whole agreement between Medium and you concerning the Services.</p><p name="dbf1" id="dbf1" class="graf graf--p graf-after--p"><em class="markup--em markup--p-em">Government use.</em> If you’re \u200busing \u200bMedium for the U.S. Government, <a href="https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7" data-href="https://medium.com/@Medium/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7" class="markup--anchor markup--p-anchor" target="_blank">this Amendment</a> to \u200bMedium’s Terms of Service \u200bapplies to you\u200b.</p><p name="3318" id="3318" class="graf graf--p graf-after--p graf--trailing">Questions? Let us know at <a href="mailto:%20legal@medium.com" data-href="mailto:%20legal@medium.com" class="markup--anchor markup--p-anchor" target="_blank">legal@medium.com</a>.</p></div></div></section></div><footer class="u-paddingTop10"><div class="container u-maxWidth740"><div class="row"><div class="col u-size12of12"></div></div><div class="row"><div class="col u-size12of12 js-postTags"><div class="u-paddingBottom10"><ul class="tags tags--postTags tags--borderless"><li><a class="link u-baseColor--link" href="https://medium.com/tag/terms-and-conditions?source=post" data-action-source="post">Terms And Conditions</a></li><li><a class="link u-baseColor--link" href="https://medium.com/tag/terms?source=post" data-action-source="post">Terms</a></li><li><a class="link u-baseColor--link" href="https://medium.com/tag/medium?source=post" data-action-source="post">Medium</a></li></ul></div></div></div><section class="uiScale uiScale-ui--small uiScale-caption--regular u-borderTopLightest u-marginTop10 u-paddingTop20"><div class="ui-h3 u-textColorDarker u-fontSize22">One clap, two clap, three clap, forty?</div><p class="ui-body u-marginBottom20 u-textColorDark u-fontSize16">By clapping more or less, you can signal to us which stories really stand out.</p></section><div class="postActions js-postActionsFooter"><div class="u-flexCenter"><div class="u-flex1"><div class="multirecommend js-actionMultirecommend u-flexCenter u-width60" data-post-id="9db0094a1e0f" data-is-icon-29px="true" data-is-circle="true" data-has-recommend-list="true" data-source="post_actions_footer-----9db0094a1e0f---------------------clap_footer"><div class="u-relative u-foreground"><div class="clapUndo u-width60 u-round u-height32 u-absolute u-borderBox u-paddingRight5 u-transition--transform200Spring u-background--brandSageLighter js-clapUndo" style="top: 14px; padding: 2px;"></div></div><span class="u-textAlignCenter u-relative u-background js-actionMultirecommendCount u-marginLeft10"></span></div></div><div class="buttonSet u-flex0"></div></div></div></div><div class="u-maxWidth740 u-paddingTop20 u-marginTop20 u-borderTopLightest container u-paddingBottom20 u-xs-paddingBottom10 js-postAttributionFooterContainer"><div class="row js-postFooterInfo"><div class="col u-size6of12 u-xs-size12of12"><li class="uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardUser"><div class="u-marginLeft20 u-floatRight"><span class="followState js-followState" data-user-id="504c7870fdb6"></span></div><div class="u-tableCell"><a class="link u-baseColor--link avatar" href="https://medium.com/@Medium?source=footer_card" title="Go to the profile of Medium" aria-label="Go to the profile of Medium" data-action-source="footer_card" data-user-id="504c7870fdb6" dir="auto"><div class="u-relative u-inlineBlock u-flex0"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png" class="avatar-image avatar-image--small" alt="Go to the profile of Medium"><div class="avatar-halo u-absolute u-textColorGreenNormal svgIcon" style="width: calc(100% + 12px); height: calc(100% + 12px); top:-6px; left:-6px"><svg viewbox="0 0 114 114" xmlns="http://www.w3.org/2000/svg"><path d="M7.66922967,32.092726 C17.0070768,13.6353618 35.9421928,1.75 57,1.75 C78.0578072,1.75 96.9929232,13.6353618 106.33077,32.092726 L107.66923,31.4155801 C98.0784505,12.4582656 78.6289015,0.25 57,0.25 C35.3710985,0.25 15.9215495,12.4582656 6.33077033,31.4155801 L7.66922967,32.092726 Z"></path><path d="M106.33077,81.661427 C96.9929232,100.118791 78.0578072,112.004153 57,112.004153 C35.9421928,112.004153 17.0070768,100.118791 7.66922967,81.661427 L6.33077033,82.338573 C15.9215495,101.295887 35.3710985,113.504153 57,113.504153 C78.6289015,113.504153 98.0784505,101.295887 107.66923,82.338573 L106.33077,81.661427 Z"></path></svg></div></div></a></div><div class="u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15"><h3 class="ui-h3 u-fontSize18 u-lineHeightTighter"><a class="link link--primary u-accentColor--hoverTextNormal" href="https://medium.com/@Medium" property="cc:attributionName" title="Go to the profile of Medium" aria-label="Go to the profile of Medium" rel="author cc:attributionUrl" data-user-id="504c7870fdb6" dir="auto">Medium</a></h3><div class="ui-caption u-textColorGreenNormal u-fontSize13 u-tintSpectrum u-accentColor--textNormal u-marginBottom7">Medium member since Aug 2017</div><p class="ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4">Everyone’s stories and ideas</p></div></li></div><div class="col u-size6of12 u-xs-size12of12 u-xs-marginTop30"><li class="uiScale uiScale-ui--small uiScale-caption--regular u-block u-paddingBottom18 js-cardCollection"><div class="u-marginLeft20 u-floatRight"></div><div class="u-tableCell "><a class="link u-baseColor--link avatar avatar--roundedRectangle" href="https://medium.com/policy?source=footer_card" title="Go to Medium Policy" aria-label="Go to Medium Policy" data-action-source="footer_card"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png" class="avatar-image u-size60x60" alt="Medium Policy"></a></div><div class="u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15"><h3 class="ui-h3 u-fontSize18 u-lineHeightTighter u-marginBottom4"><a class="link link--primary u-accentColor--hoverTextNormal" href="https://medium.com/policy?source=footer_card" rel="collection" data-action-source="footer_card">Medium Policy</a></h3><p class="ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4">The Fine Print</p><div class="buttonSet"></div></div></li></div></div></div><div class="js-postFooterPlacements"></div><div class="u-padding0 u-clearfix u-backgroundGrayLightest u-print-hide supplementalPostContent js-responsesWrapper"></div><div class="supplementalPostContent js-heroPromo"></div></footer></div>', 'author': {'name': None, 'url': 'https://medium.com/@Medium', 'twitter': '@Medium'}, 'image_url': None, 'tags': [], 'link_tags': {'canonical': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f', 'publisher': 'https://plus.google.com/103654360130207659246', 'author': 'https://medium.com/@Medium', 'search': '/osd.xml', 'alternate': 'android-app://com.medium.reader/https/medium.com/p/9db0094a1e0f', 'stylesheet': 'https://cdn-static-1.medium.com/_/fp/css/main-branding-base.Ch8g7KPCoGXbtKfJaVXo_w.css', 'icon': 'https://cdn-static-1.medium.com/_/fp/icons/favicon-rebrand-medium.3Y6xpZ-0FSdWDnPM3hSBIA.ico', 'apple-touch-icon': 'https://cdn-images-1.medium.com/fit/c/120/120/1*6_fgYnisCa9V21mymySIvA.png', 'mask-icon': 'https://cdn-static-1.medium.com/_/fp/icons/monogram-mask.KPLCSFEZviQN0jQ7veN2RQ.svg'}, 'meta_tags': {'viewport': 'width=device-width, initial-scale=1', 'title': 'Medium Terms of Service – Medium Policy – Medium', 'referrer': 'unsafe-url', 'description': 'These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…', 'theme-color': '#000000', 'og:title': 'Medium Terms of Service – Medium Policy – Medium', 'og:url': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f', 'fb:app_id': '542599432471018', 'og:description': 'These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…', 'twitter:description': 'These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using…', 'author': 'Medium', 'og:type': 'article', 'twitter:card': 'summary', 'article:publisher': 'https://www.facebook.com/medium', 'article:author': 'https://medium.com/@Medium', 'robots': 'index, follow', 'article:published_time': '2012-08-13T22:54:53.510Z', 'twitter:creator': '@Medium', 'twitter:site': '@Medium', 'og:site_name': 'Medium', 'twitter:label1': 'Reading time', 'twitter:data1': '5 min read', 'twitter:app:name:iphone': 'Medium', 'twitter:app:id:iphone': '828256236', 'twitter:app:url:iphone': 'medium://p/9db0094a1e0f', 'al:ios:app_name': 'Medium', 'al:ios:app_store_id': '828256236', 'al:android:package': 'com.medium.reader', 'al:android:app_name': 'Medium', 'al:ios:url': 'medium://p/9db0094a1e0f', 'al:android:url': 'medium://p/9db0094a1e0f', 'al:web:url': 'https://medium.com/policy/medium-terms-of-service-9db0094a1e0f'}},)
print('We can take any key from all records \n')
title_bag = dict_items.pluck('title')
print('With take method we received tuple of objects \n')
print(title_bag.take(3))
We can take any key from all records With take method we received tuple of objects ('Medium Terms of Service – Medium Policy – Medium', 'Amendment to Medium Terms of Service Applicable to U.S. Government Users', '走入山與海之間:閩東大刀會和兩岸走私 – Yun-Chen Chien(簡韻真) – Medium')
We can write any function for processing data and apply it with map function
def clean_title(text):
import string
cut_set = set(string.punctuation)
cut_set.update(['”','—','…', "“",'⌘','❤','+','®','➜','¬','–'])
text = text.translate(text.maketrans(''.join(cut_set)," " * len(cut_set)))
text = text.lower()
return text
title_bag = dict_items.pluck('title').map(clean_title)
title_bag.take(3)
('medium terms of service medium policy medium', 'amendment to medium terms of service applicable to u s government users', '走入山與海之間:閩東大刀會和兩岸走私 yun chen chien(簡韻真) medium')
Process meta_tags
meta_tags_bag = dict_items.pluck('meta_tags')
test_meta = meta_tags_bag.take(3)
test_meta[1]
{'viewport': 'width=device-width, initial-scale=1', 'title': 'Amendment to Medium Terms of Service Applicable to U.S. Government Users', 'referrer': 'origin', 'description': 'This agreement (“Amendment”) is an amendment to Medium’s Terms. It is between Medium and the U.S. Government and applies to the use of Medium Services by the Government. The reason for this Amendment…', 'theme-color': '#000000', 'og:title': 'Amendment to Medium Terms of Service Applicable to U.S. Government Users', 'og:url': 'https://medium.com/policy/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7', 'fb:app_id': '542599432471018', 'og:description': 'This agreement (“Amendment”) is an amendment to Medium’s Terms. It is between Medium and the U.S. Government and applies to the use of…', 'twitter:description': 'This agreement (“Amendment”) is an amendment to Medium’s Terms. It is between Medium and the U.S. Government and applies to the use of…', 'author': 'Medium', 'og:type': 'article', 'twitter:card': 'summary', 'article:publisher': 'https://www.facebook.com/medium', 'article:author': 'https://medium.com/@Medium', 'robots': 'noindex, follow', 'article:published_time': '2015-08-03T07:44:50.331Z', 'twitter:creator': '@Medium', 'twitter:site': '@Medium', 'og:site_name': 'Medium', 'twitter:label1': 'Reading time', 'twitter:data1': '7 min read', 'twitter:app:name:iphone': 'Medium', 'twitter:app:id:iphone': '828256236', 'twitter:app:url:iphone': 'medium://p/fccb00db67d7', 'al:ios:app_name': 'Medium', 'al:ios:app_store_id': '828256236', 'al:android:package': 'com.medium.reader', 'al:android:app_name': 'Medium', 'al:ios:url': 'medium://p/fccb00db67d7', 'al:android:url': 'medium://p/fccb00db67d7', 'al:web:url': 'https://medium.com/policy/amendment-to-medium-terms-of-service-applicable-to-u-s-government-users-fccb00db67d7'}
def clean_meta_tags(meta):
author = meta['author'].strip()
min_reads = int(meta['twitter:data1'].split()[0])
return {'author':author, 'min_reads':min_reads}
meta_tags_bag= meta_tags_bag.map(clean_meta_tags)
meta_tags_bag.take(1)
({'author': 'Medium', 'min_reads': 5},)
%%time
#content_bag = dict_items.pluck('content').map(clean_content)
title_bag = dict_items.pluck('title').map(clean_title)
published_bag = dict_items.pluck('published').map(lambda x: x['$date'])
meta_bag = dict_items.pluck('meta_tags').map(clean_meta_tags)
domain_bag = dict_items.pluck('domain')
CPU times: user 779 µs, sys: 248 µs, total: 1.03 ms Wall time: 1.03 ms
@delayed
def combine_to_df(list_dict):
list_df = [pd.DataFrame(dict_) for dict_ in list_dict]
return pd.concat(list_df, axis=1)
combined = combine_to_df([published_bag, meta_bag, domain_bag])
combined.visualize()
# It takes time, around a minute
from dask.diagnostics import ProgressBar
with ProgressBar():
df = combined.compute()
df.columns = ['published', 'Author','min_reads','domain']
df.head()
[########################################] | 100% Completed | 59.9s
published | Author | min_reads | domain | |
---|---|---|---|---|
0 | 2012-08-13T22:54:53.510Z | Medium | 5 | medium.com |
1 | 2015-08-03T07:44:50.331Z | Medium | 7 | medium.com |
2 | 2017-02-05T13:08:17.410Z | Yun-Chen Chien(簡韻真) | 2 | medium.com |
3 | 2017-05-06T08:16:30.776Z | Vaibhav Khulbe | 3 | medium.com |
4 | 2017-06-04T14:46:25.772Z | Vaibhav Khulbe | 4 | medium.com |
print('We can create dask dataframe from pandas \n')
dd_no_content = dd.from_pandas(df, npartitions=4)
We can create dask dataframe from pandas
dd_no_content
published | Author | min_reads | domain | |
---|---|---|---|---|
npartitions=4 | ||||
0 | object | object | int64 | object |
15579 | ... | ... | ... | ... |
31158 | ... | ... | ... | ... |
46737 | ... | ... | ... | ... |
62312 | ... | ... | ... | ... |
%%time
print('Transform published column to datetime as we did with pandas, it will by slightly slowly than in pandas \n')
df['published'] = pd.to_datetime(df.published, format='%Y-%m-%dT%H:%M:%S.%fZ')
Transform published column to datetime as we did with pandas, it will by slightly slowly than in pandas CPU times: user 277 ms, sys: 2.14 ms, total: 279 ms Wall time: 277 ms
%%time
print('Transform published column to datetime with pandas, \n')
dd_no_content['published'] = dd.to_datetime(dd_no_content.published, format='%Y-%m-%dT%H:%M:%S.%fZ').compute()
Transform published column to datetime with pandas, CPU times: user 273 ms, sys: 6.49 ms, total: 279 ms Wall time: 274 ms
dd_no_content.head()
published | Author | min_reads | domain | |
---|---|---|---|---|
0 | 2012-08-13 22:54:53.510 | Medium | 5 | medium.com |
1 | 2015-08-03 07:44:50.331 | Medium | 7 | medium.com |
2 | 2017-02-05 13:08:17.410 | Yun-Chen Chien(簡韻真) | 2 | medium.com |
3 | 2017-05-06 08:16:30.776 | Vaibhav Khulbe | 3 | medium.com |
4 | 2017-06-04 14:46:25.772 | Vaibhav Khulbe | 4 | medium.com |
print('We can apply function with mixed transformation to dask dataframe written for pandas df without changes \n')
def additional_time_features_df(df, to_cat_cols = ['Author','domain', 'month', 'year', 'day_of_week']):
df['month'] = df['published'].apply(lambda ts: ts.month)
df['year'] = df['published'].apply(lambda ts: ts.year)
hour = df['published'].apply(lambda ts: ts.hour)
df['hour'] = hour
df['morning'] = ((hour >= 7) & (hour <= 11)).astype('float64')
df['day'] = ((hour >= 12) & (hour <= 18)).astype('int')
df['evening'] = ((hour >= 19) & (hour <= 23)).astype('int')
df['night'] = ((hour >= 0) & (hour <= 6)).astype('int')
df['sin_hour'] = np.sin(2*np.pi*df['hour']/24)
df['cos_hour'] = np.cos(2*np.pi*df['hour']/24)
df = df.drop(["hour"], axis=1)
day_of_week = df['published'].dt.dayofweek.astype('int')
df['day_of_week']=day_of_week
df['weekend'] = (day_of_week >= 5).astype('int')
# turn to categorical
df[to_cat_cols] = df[to_cat_cols].astype('category')
return df
We can apply function with mixed transformation to dask dataframe written for pandas df without changes
%%time
df_medium_train = additional_time_features_df(df.copy())
CPU times: user 694 ms, sys: 15.2 ms, total: 709 ms Wall time: 707 ms
dd_medium_train = additional_time_features_df(dd_no_content)
%%time
dd_medium_train.compute()
CPU times: user 884 ms, sys: 52.9 ms, total: 937 ms Wall time: 861 ms
published | Author | min_reads | domain | month | year | morning | day | evening | night | sin_hour | cos_hour | day_of_week | weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-08-13 22:54:53.510 | Medium | 5 | medium.com | 8 | 2012 | 0.0 | 0 | 1 | 0 | -5.000000e-01 | 8.660254e-01 | 0 | 0 |
1 | 2015-08-03 07:44:50.331 | Medium | 7 | medium.com | 8 | 2015 | 1.0 | 0 | 0 | 0 | 9.659258e-01 | -2.588190e-01 | 0 | 0 |
2 | 2017-02-05 13:08:17.410 | Yun-Chen Chien(簡韻真) | 2 | medium.com | 2 | 2017 | 0.0 | 1 | 0 | 0 | -2.588190e-01 | -9.659258e-01 | 6 | 1 |
3 | 2017-05-06 08:16:30.776 | Vaibhav Khulbe | 3 | medium.com | 5 | 2017 | 1.0 | 0 | 0 | 0 | 8.660254e-01 | -5.000000e-01 | 5 | 1 |
4 | 2017-06-04 14:46:25.772 | Vaibhav Khulbe | 4 | medium.com | 6 | 2017 | 0.0 | 1 | 0 | 0 | -5.000000e-01 | -8.660254e-01 | 6 | 1 |
5 | 2017-04-02 16:21:15.171 | Kate Reed Petty | 7 | medium.com | 4 | 2017 | 0.0 | 1 | 0 | 0 | -8.660254e-01 | -5.000000e-01 | 6 | 1 |
6 | 2016-08-15 04:16:02.103 | exedre | 12 | medium.com | 8 | 2016 | 0.0 | 0 | 0 | 1 | 8.660254e-01 | 5.000000e-01 | 0 | 0 |
7 | 2015-01-14 21:31:07.568 | Raghav Haran | 5 | medium.com | 1 | 2015 | 0.0 | 0 | 1 | 0 | -7.071068e-01 | 7.071068e-01 | 2 | 0 |
8 | 2014-02-11 04:11:54.771 | Francine Lee | 4 | medium.com | 2 | 2014 | 0.0 | 0 | 0 | 1 | 8.660254e-01 | 5.000000e-01 | 1 | 0 |
9 | 2015-10-25 02:58:05.551 | Raghav Haran | 8 | medium.com | 10 | 2015 | 0.0 | 0 | 0 | 1 | 5.000000e-01 | 8.660254e-01 | 6 | 1 |
10 | 2016-08-15 15:31:13.601 | E² | 4 | medium.com | 8 | 2016 | 0.0 | 1 | 0 | 0 | -7.071068e-01 | -7.071068e-01 | 0 | 0 |
11 | 2016-08-09 21:01:06.303 | One Month | 9 | medium.com | 8 | 2016 | 0.0 | 0 | 1 | 0 | -7.071068e-01 | 7.071068e-01 | 1 | 0 |
12 | 2016-09-08 15:47:57.336 | Frank DeGeorge | 7 | hackernoon.com | 9 | 2016 | 0.0 | 1 | 0 | 0 | -7.071068e-01 | -7.071068e-01 | 3 | 0 |
13 | 2016-09-30 18:05:35.950 | Gregório Jung | 8 | medium.com | 9 | 2016 | 0.0 | 1 | 0 | 0 | -1.000000e+00 | -1.836970e-16 | 4 | 0 |
14 | 2017-06-27 15:49:22.909 | Stephen Hays | 7 | hackernoon.com | 6 | 2017 | 0.0 | 1 | 0 | 0 | -7.071068e-01 | -7.071068e-01 | 1 | 0 |
15 | 2015-07-13 06:52:44.618 | Andy Raskin | 5 | medium.com | 7 | 2015 | 0.0 | 0 | 0 | 1 | 1.000000e+00 | 6.123234e-17 | 0 | 0 |
16 | 2017-05-01 13:22:43.785 | Stephen Hays | 8 | hackernoon.com | 5 | 2017 | 0.0 | 1 | 0 | 0 | -2.588190e-01 | -9.659258e-01 | 0 | 0 |
17 | 2016-08-31 17:11:24.263 | Andy Raskin | 7 | medium.com | 8 | 2016 | 0.0 | 1 | 0 | 0 | -9.659258e-01 | -2.588190e-01 | 2 | 0 |
18 | 2017-06-30 07:55:55.103 | Mohit Mamoria | 16 | hackernoon.com | 6 | 2017 | 1.0 | 0 | 0 | 0 | 9.659258e-01 | -2.588190e-01 | 4 | 0 |
19 | 2016-12-13 23:29:35.556 | Oscar Boyson | 6 | medium.com | 12 | 2016 | 0.0 | 0 | 1 | 0 | -2.588190e-01 | 9.659258e-01 | 1 | 0 |
20 | 2016-01-27 22:19:05.027 | Brian Verne | 5 | hackernoon.com | 1 | 2016 | 0.0 | 0 | 1 | 0 | -5.000000e-01 | 8.660254e-01 | 2 | 0 |
21 | 2016-12-14 01:15:02.122 | Morgan Courtney | 11 | hackernoon.com | 12 | 2016 | 0.0 | 0 | 0 | 1 | 2.588190e-01 | 9.659258e-01 | 2 | 0 |
22 | 2016-09-05 22:02:40.326 | Jarrett Carter Sr. | 4 | medium.com | 9 | 2016 | 0.0 | 0 | 1 | 0 | -5.000000e-01 | 8.660254e-01 | 0 | 0 |
23 | 2016-12-13 17:59:40.527 | thrace | 8 | medium.com | 12 | 2016 | 0.0 | 1 | 0 | 0 | -9.659258e-01 | -2.588190e-01 | 1 | 0 |
24 | 2017-05-02 17:28:39.120 | JakeElman | 8 | medium.com | 5 | 2017 | 0.0 | 1 | 0 | 0 | -9.659258e-01 | -2.588190e-01 | 1 | 0 |
25 | 2016-08-30 23:43:24.940 | Hanna Fogel | 2 | medium.com | 8 | 2016 | 0.0 | 0 | 1 | 0 | -2.588190e-01 | 9.659258e-01 | 1 | 0 |
26 | 2017-04-26 02:50:29.511 | Asaeda | 9 | medium.com | 4 | 2017 | 0.0 | 0 | 0 | 1 | 5.000000e-01 | 8.660254e-01 | 2 | 0 |
27 | 2016-06-18 06:54:10.331 | Dr. Syed Jamal Hasan | 11 | medium.com | 6 | 2016 | 0.0 | 0 | 0 | 1 | 1.000000e+00 | 6.123234e-17 | 5 | 1 |
28 | 2016-05-17 17:52:00.960 | tiffany jernigan | 7 | medium.com | 5 | 2016 | 0.0 | 1 | 0 | 0 | -9.659258e-01 | -2.588190e-01 | 1 | 0 |
29 | 2017-04-17 16:29:28.306 | Richy Chacon | 4 | medium.com | 4 | 2017 | 0.0 | 1 | 0 | 0 | -8.660254e-01 | -5.000000e-01 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
62283 | 2017-06-26 20:07:57.240 | Jacqueline Bashaw | 4 | NaN | 6 | 2017 | 0.0 | 0 | 1 | 0 | -8.660254e-01 | 5.000000e-01 | 0 | 0 |
62284 | 2017-05-23 23:24:15.931 | Lily Herman | 5 | NaN | 5 | 2017 | 0.0 | 0 | 1 | 0 | -2.588190e-01 | 9.659258e-01 | 1 | 0 |
62285 | 2017-02-02 17:52:00.430 | Angel Powell | 5 | NaN | 2 | 2017 | 0.0 | 1 | 0 | 0 | -9.659258e-01 | -2.588190e-01 | 3 | 0 |
62286 | 2017-06-14 15:17:18.712 | Lily Herman | 6 | NaN | 6 | 2017 | 0.0 | 1 | 0 | 0 | -7.071068e-01 | -7.071068e-01 | 2 | 0 |
62287 | 2014-11-12 19:00:00.000 | The Hairpin | 12 | NaN | 11 | 2014 | 0.0 | 0 | 1 | 0 | -9.659258e-01 | 2.588190e-01 | 2 | 0 |
62288 | 2014-03-18 10:55:54.000 | The Billfold | 14 | NaN | 3 | 2014 | 1.0 | 0 | 0 | 0 | 5.000000e-01 | -8.660254e-01 | 1 | 0 |
62289 | 2012-05-03 19:00:03.000 | The Awl | 4 | NaN | 5 | 2012 | 0.0 | 0 | 1 | 0 | -9.659258e-01 | 2.588190e-01 | 3 | 0 |
62290 | 2015-11-02 06:12:22.782 | Josh Fruhlinger | 18 | NaN | 11 | 2015 | 0.0 | 0 | 0 | 1 | 1.000000e+00 | 6.123234e-17 | 0 | 0 |
62291 | 2012-11-02 00:00:54.000 | The Awl | 9 | NaN | 11 | 2012 | 0.0 | 0 | 0 | 1 | 0.000000e+00 | 1.000000e+00 | 4 | 0 |
62292 | 2012-11-29 20:00:42.000 | David Roth | 6 | NaN | 11 | 2012 | 0.0 | 0 | 1 | 0 | -8.660254e-01 | 5.000000e-01 | 3 | 0 |
62293 | 2012-11-28 18:00:10.000 | The Awl | 8 | NaN | 11 | 2012 | 0.0 | 1 | 0 | 0 | -1.000000e+00 | -1.836970e-16 | 2 | 0 |
62294 | 2016-06-09 16:19:34.121 | Cecília Olliveira | 3 | NaN | 6 | 2016 | 0.0 | 1 | 0 | 0 | -8.660254e-01 | -5.000000e-01 | 3 | 0 |
62295 | 2016-06-23 17:39:16.171 | Amy Hawman | 8 | NaN | 6 | 2016 | 0.0 | 1 | 0 | 0 | -9.659258e-01 | -2.588190e-01 | 3 | 0 |
62296 | 2016-08-23 00:33:48.276 | Orlando Trott | 5 | NaN | 8 | 2016 | 0.0 | 0 | 0 | 1 | 0.000000e+00 | 1.000000e+00 | 1 | 0 |
62297 | 2015-07-20 15:16:40.169 | Transifex | 6 | NaN | 7 | 2015 | 0.0 | 1 | 0 | 0 | -7.071068e-01 | -7.071068e-01 | 0 | 0 |
62298 | 2015-12-31 22:06:54.772 | LA BioMed | 3 | NaN | 12 | 2015 | 0.0 | 0 | 1 | 0 | -5.000000e-01 | 8.660254e-01 | 3 | 0 |
62299 | 2017-01-05 16:19:59.807 | Jessica Chen Riolfi | 7 | NaN | 1 | 2017 | 0.0 | 1 | 0 | 0 | -8.660254e-01 | -5.000000e-01 | 3 | 0 |
62300 | 2016-03-21 18:48:18.079 | Pierre @ L’Escapadou | 7 | NaN | 3 | 2016 | 0.0 | 1 | 0 | 0 | -1.000000e+00 | -1.836970e-16 | 0 | 0 |
62301 | 2017-02-07 18:34:31.427 | Nick Troiano | 6 | NaN | 2 | 2017 | 0.0 | 1 | 0 | 0 | -1.000000e+00 | -1.836970e-16 | 1 | 0 |
62302 | 2016-06-29 02:49:57.853 | Amanda L. | 9 | NaN | 6 | 2016 | 0.0 | 0 | 0 | 1 | 5.000000e-01 | 8.660254e-01 | 2 | 0 |
62303 | 2016-10-04 12:22:51.674 | Mayank Agarwal | 4 | NaN | 10 | 2016 | 0.0 | 1 | 0 | 0 | 1.224647e-16 | -1.000000e+00 | 1 | 0 |
62304 | 2016-10-10 04:17:03.477 | Mayank Agarwal | 9 | NaN | 10 | 2016 | 0.0 | 0 | 0 | 1 | 8.660254e-01 | 5.000000e-01 | 0 | 0 |
62305 | 2016-10-21 06:30:55.281 | Mayank Agarwal | 5 | NaN | 10 | 2016 | 0.0 | 0 | 0 | 1 | 1.000000e+00 | 6.123234e-17 | 4 | 0 |
62306 | 2017-05-23 04:37:28.709 | Randi Gloss | 7 | NaN | 5 | 2017 | 0.0 | 0 | 0 | 1 | 8.660254e-01 | 5.000000e-01 | 1 | 0 |
62307 | 2016-04-05 23:01:22.486 | Heather Nann | 3 | NaN | 4 | 2016 | 0.0 | 0 | 1 | 0 | -2.588190e-01 | 9.659258e-01 | 1 | 0 |
62308 | 2016-01-28 01:03:08.798 | Heather Nann | 4 | NaN | 1 | 2016 | 0.0 | 0 | 0 | 1 | 2.588190e-01 | 9.659258e-01 | 3 | 0 |
62309 | 2016-01-14 13:28:30.277 | Heather Nann | 5 | NaN | 1 | 2016 | 0.0 | 1 | 0 | 0 | -2.588190e-01 | -9.659258e-01 | 3 | 0 |
62310 | 2016-03-06 06:51:45.307 | Heather Nann | 3 | NaN | 3 | 2016 | 0.0 | 0 | 0 | 1 | 1.000000e+00 | 6.123234e-17 | 6 | 1 |
62311 | 2017-01-15 17:45:22.836 | Nick Todorov | 7 | NaN | 1 | 2017 | 0.0 | 1 | 0 | 0 | -9.659258e-01 | -2.588190e-01 | 6 | 1 |
62312 | 2016-01-25 03:20:33.005 | Heather Nann | 5 | NaN | 1 | 2016 | 0.0 | 0 | 0 | 1 | 7.071068e-01 | 7.071068e-01 | 0 | 0 |
62313 rows × 14 columns
Dask ML provides scalable machine learning algorithms in python which are compatible with scikit-learn. Let us first understand how scikit-learn handles the computations and then we will look at how Dask performs these operations differently. See dask-ml tutorials: Examples from dask ml
You need to install dask-ml at first
There are two main parts in dask ml: - approaches to handle big datasets - approaches to handle big models
The biggest model from our course was a random forest on text data in the week with Random Forest assignment. Below I just reproduce part of our assignment, but I reduced nrows and max features in Count vectorizer, but you can check with original parameters
# Download data
df = pd.read_csv("../../data/movie_reviews_train.csv", nrows=5000)
# Split data to train and test
X_text = df["text"]
y_text = df["label"]
# Classes counts
df.label.value_counts()
1 3060 0 1940 Name: label, dtype: int64
from sklearn.model_selection import StratifiedKFold,GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# Split on 3 folds
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)
# In Pipeline we will modify the text and train logistic regression
classifier = Pipeline([
('vectorizer', CountVectorizer(max_features=500, ngram_range=(1, 3))),
('clf', LogisticRegression(random_state=17))])
%%time
parameters = {'clf__C': (0.1, 1, 10, 100)}
grid_search = GridSearchCV(classifier, parameters, scoring ='roc_auc', cv=skf)
grid_search = grid_search.fit(X_text, y_text)
CPU times: user 8.34 s, sys: 139 ms, total: 8.47 s Wall time: 8.47 s
grid_search.best_score_
0.7042233630808542
In this approach all we need to do is replace joblib to dask distributed. We need to initialize distributed client, and change backend
%%time
from sklearn.externals import joblib
from dask.distributed import Client
client = Client()
parameters = {'clf__C': (0.1, 1, 10, 100)}
grid_search = GridSearchCV(classifier, parameters, scoring ='roc_auc', cv=skf)
t_start = time.time()
with joblib.parallel_backend('dask'):
grid_search.fit(X_text, y_text)
t_end = time.time()
print('Elapsed time for grid_search with joblib replace (s):', round((t_end - t_start)))
Elapsed time for grid_search with joblib replace (s): 5 CPU times: user 1.39 s, sys: 142 ms, total: 1.53 s Wall time: 5.87 s
grid_search.best_score_
0.7042233630808542
Parallel to Gridsearch CV in sklearn, Dask provides a library called Dask-search CV (Dask-search CV is now included in Dask ML). It merges steps so that there are less repetitions. Below are the installation steps for Dask-search. We need to install it separately
#pip3 install dask-searchcv
import dask_searchcv as dcv
We can use a pipelines in dask grid search, and according the documentation we should use dask with pipelines with many opeations which could be parallelized, especially included feature union, but I've tried and get an error as a result... Anyway time consuming operations as CountVectorizer couldn't be parallelized, so here gridsearch from dask only for classifier documentation.
%%time
vect = CountVectorizer(max_features=500, ngram_range=(1, 3))
Xvect = vect.fit_transform(X_text)
CPU times: user 762 ms, sys: 30.8 ms, total: 793 ms Wall time: 788 ms
lr = LogisticRegression()
parameters = {'C': (0.1, 1, 10, 100)}
t_start = time.time()
grid_search = dcv.GridSearchCV(lr, parameters, scoring ='roc_auc', cv=skf)
grid_search.fit(Xvect, y_text)
t_end = time.time()
print(f'Elapsed time for grid_search (without time spended to vectorization) {round((t_end - t_start))} (s):')
Elapsed time for grid_search (without time spended to vectorization) 0 (s):
grid_search.best_score_
0.7020017187686919
I tried to see how good dask will be with random forest with original parameters, but sometimes this raise en error get "(OSError: [Errno 24] Too many open files) after execution, and I couldn't fix it...." Sometimes it works ok, for small data it works in most cases, but if you re-run this notebook several times there is a big chance to get such an error. So, I believe that dask-ml very usefull, but for know I definitely don't know how it should be used properly.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=17)
min_samples_leaf = [1, 2, 3]
max_features = [0.3, 0.5, 0.7]
max_depth = [None]
parameters = {'max_features': max_features,
'min_samples_leaf': min_samples_leaf,
'max_depth': max_depth}
grid_search = dcv.GridSearchCV(rf, parameters, scoring ='roc_auc', cv=skf)
t_start = time.time()
grid_search.fit(Xvect, y_text)
t_end = time.time()
print(f'Elapsed time for dask grid_search for Random Forest {round((t_end - t_start))} (s):')
Elapsed time for dask grid_search for Random Forest 3 (s):
There are number of models rewritten in dask, which could take dask object (huge arrays) and compute models on them. You could read more in dask documentation. Below an example with KMeans, but also there are dask version of linear models, processing functions. The notation is very similar to scikit-learn, and it should be easy to use.
from dask_ml import datasets
from dask_ml.cluster import KMeans
X, y = datasets.make_blobs(n_samples=10000000,
chunks=1000000,
random_state=0,
centers=3)
# Persist will give you back a lazy dask.delayed object
X = X.persist()
X
dask.array<concatenate, shape=(10000000, 2), dtype=float64, chunksize=(1000000, 2)>
km = KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
km.fit(X)
KMeans(algorithm='full', copy_x=True, init='k-means||', init_max_iter=2, max_iter=300, n_clusters=3, n_jobs=1, oversampling_factor=10, precompute_distances='auto', random_state=None, tol=0.0001)
Actually I read the article about dask couple of days ago and I've decided that task with tutorial a good way to get acquainted with the library. So I ask you not to be very strict if I misunderstood something:))