Time to Post on Hacker News

by Anton Tarasenko

Starting Up

In [1]:
import pandas as pd
import numpy as np
import json, datetime
from urllib.parse import urlparse

def get_domain(url):
    try:
        parsed_url = urlparse(url)
        domain = '{url.netloc}'.format(url=parsed_url)
        domain = domain.replace('www.', '')
        domain = '.'.join(domain.split('.')[-3:])
    except:
        domain = ''
    return domain

Data

You can jump to the next section if you're not interested in reproducing the results.

alexa_top.csv

File alexa_top.csv is the 500 most visited websites by Alexa.

It comes from Alexa's Top One Million Websites (top-1m.csv.zip). Copying the first 500 lines from that csv file:

$ sed -n '1,500p; 501q' top-1m.csv > alexa_top.csv

hn_yr.csv

File hn_ys.csv is a one-year sample from Hacker News threads.

You'll need file HNStoriesAll.1perline.stripped.json.bz2 from https://mega.co.nz/#F!YohlwD7R!wec0yNO86SeaNGIYQBOR0A. Extract HNStoriesAll.1perline.stripped.json and export all records from 2013 with jq:

$ cat HNStoriesAll.1perline.stripped.json | jq -c -r '. | select(.created_at | startswith("2013")) | [.[]] | @csv' | > hn_yr.csv

Hat tips to redditors from this thread. In particular, w3m2d, who cleaned up and repacked the Hacker News dataset.

In [2]:
ALEXA_TOP_CSV = 'alexa_top.csv'
CSV_INPUT = 'hn_yr.csv'
CSV_HEADER = ["author",
              "created_at",
              "created_at_i",
              "num_comments",
              "objectID",
              "points",
              "title",
              "url"]
In [3]:
alexa_top = pd.read_csv(ALEXA_TOP_CSV, header=None)[1].tolist()
threads = pd.read_csv(CSV_INPUT, header=None, names=CSV_HEADER)

threads['ln_points'] = threads['points'].map(np.log)

threads['year'] = threads['created_at_i'].map(lambda x: datetime.datetime.utcfromtimestamp(x).year)
threads['month'] = threads['created_at_i'].map(lambda x: datetime.datetime.utcfromtimestamp(x).month)
threads['day'] = threads['created_at_i'].map(lambda x: datetime.datetime.utcfromtimestamp(x).day)
threads['hour'] = threads['created_at_i'].map(lambda x: datetime.datetime.utcfromtimestamp(x).hour)
threads['dow'] = threads['created_at_i'].map(lambda x: datetime.datetime.utcfromtimestamp(x).weekday())

threads['len_title'] = threads['title'].map(len)

threads['domain'] = threads['url'].map(get_domain)
threads['is_top'] = threads['domain'].map(lambda x: x in alexa_top)

Results

Important notes:

  • The data covers all 2013 Hacker News threads submissions (without comments). It's the last year for which the dataset has entire coverage.
  • Variable dow stands for "day of week". 0 is Monday, 6 is Sunday.
  • Dates and time refer to UTC (GMT). EST is UTC-05, PST is UTC-08. Disclaimer about daylight saving time should be here.
In [4]:
%matplotlib inline

Upvotes

Hard work on weekdays: 50% more submissions than on weekends

In [5]:
threads.groupby(['dow', 'hour'])['objectID'].count().plot(title='Total number of threads by hour and day of week')
Out[5]:
<matplotlib.axes.AxesSubplot at 0x11c29a780>

Perhaps, this makes harder to earn upvotes on weekdays

In [6]:
threads.groupby(['dow', 'hour'])['points'].mean().plot(title='Upvotes per thread by hour and day of week')
Out[6]:
<matplotlib.axes.AxesSubplot at 0x11487ce80>

Weekend upvoting is different: Because of the timing or the quality of submissions?

In [7]:
threads[threads['dow'] >= 5].groupby(['dow', 'hour'])['points'].\
    agg('mean').plot(kind='kde', alpha=0.7, color='blue')
threads[threads['dow'] < 5].groupby(['dow', 'hour'])['points'].\
    agg('mean').plot(kind='kde', alpha=0.7, color='green',
                     title='Upvotes density for weekday (green) and weekend (blue) threads')
Out[7]:
<matplotlib.axes.AxesSubplot at 0x11491ecc0>

Good mood by weekend's evening leads to upvoting? Not so far: HN has users from different timezones

In [8]:
threads[threads['dow'] >= 5].groupby(['hour'])['points'].\
    agg('mean').plot()
threads[threads['dow'] < 5].groupby(['hour'])['points'].\
    agg('mean').plot(title='Upvotes per thread for weekday (green) and weekend (blue) hours')
Out[8]:
<matplotlib.axes.AxesSubplot at 0x11dd18518>

Source of the materials doesn't matter for upvotes

In [9]:
threads[threads['is_top'] == True].groupby(['dow', 'hour'])['points'].\
    agg('mean').plot(alpha=0.5, color='blue')
threads[threads['is_top'] == False].groupby(['dow', 'hour'])['points'].\
    agg('mean').plot(color='red',
                     title='Upvotes per thread for Top 500 Alexa (blue) and the rest (red)')
Out[9]:
<matplotlib.axes.AxesSubplot at 0x114597978>

As well as the title length

In [10]:
threads[threads['len_title'] < 160].groupby('len_title')['len_title'].\
    agg('count').plot(title='Total threads by title length')
Out[10]:
<matplotlib.axes.AxesSubplot at 0x114593940>

Though a small subset of short titles is upvoted more frequently

In [11]:
threads[threads['len_title'] < 160].groupby('len_title')['points'].\
    agg('mean').plot(title='Upvoted per thread by title length')
Out[11]:
<matplotlib.axes.AxesSubplot at 0x1156fce10>

Comments

Comments behave somewhat similarly

In [12]:
threads.groupby(['dow', 'hour'])['num_comments'].sum().plot(title='Total comments by hour and day of week')
Out[12]:
<matplotlib.axes.AxesSubplot at 0x115525b38>
In [13]:
threads.groupby(['dow', 'hour'])['num_comments'].mean().plot()
Out[13]:
<matplotlib.axes.AxesSubplot at 0x115508e48>
In [14]:
threads[threads['dow'] >= 5].groupby(['hour'])['num_comments'].\
    agg('mean').plot()
threads[threads['dow'] < 5].groupby(['hour'])['num_comments'].\
    agg('mean').plot(title='Comments per thread for weekday (green) and weekend (blue) hours')
Out[14]:
<matplotlib.axes.AxesSubplot at 0x122af4c88>