Notebook

Exploring Hacker News Posts¶

Introduction¶

Hacker News is a extremely popular site in the tech and startup world. A user can submit a post, where they are voted on commented on, very similar to reddit. The top posts can recieve hundreds of thousands of visitors.

I am aiming to explore two types of posts: Ask HN and Show HN.

To find out the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Importing and Reading the Data¶

In the cell below I have done the following:

Imported the reader
Opended and read the file hacker_news.csv
Turned the file into a list of lists with the list() function and assigned it to a variable hn
Assigned only the header row to a variable headers, so I can easily reference the column titles if needed
Then updated the variable hn so that does not include the header
Finally I used the print() function to display the headersand the frist 5 rows of hn

In [1]:

from csv import reader
file = open("hacker_news.csv")
read = reader(file)
hn = list(read)
headers = hn[0]
hn = hn[1:]
print(headers)
print("")
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Out[1]:

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Extracting Ask HN and Show HN Posts¶

In the code cell below I first made three empty lists in which to store the specific posts I needed.

Then looped through each row in hn. I wanted to find the rows that contained the following elements: "ask hn", "show hn", and then the remaining. I decided to use the string method startswith, to ensure there was no issues with the strings in the list of lists being uppercase or lowercase. I made the title column of the data lower by using the lower method and assigning it to a variable called title.

Then used conditional statements to find the rows that started with the identified string. If it was found, I used the append method to append that specific row found into the specific list.

In the next two cells, I wanted to count and print my newly created lists to ensure all went well.

In [2]:

ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    if title.startswith('show hn'):
        show_posts.append(post)
    else: 
        other_posts.append(post)

In [3]:

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
18938

In [4]:

ask_posts[:5]

Out[4]:

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

In [5]:

show_posts[:5]

Out[5]:

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

Calculating the Average Number of Comments for Ask HN and Show HN Posts¶

In this section the aim was to compare the average number of comments for the Ask HN and Show HN posts.

The following tasks were complete in the below cells:

Used the print function to display the headers to find the right index
For each list (ask_posts and show_posts) I used a for loop to iterate over each, turning the num_comments column into a integer using the int function. Then added the sum of the comments to a pre made variable named total_ask_commentsor total_show_comments
Then computed the average for each and assigning it to variable, either avg_ask_comments or avg_show_commments

In [6]:

print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [7]:

total_ask_comments = 0

for a in ask_posts:
    num = int(a[4])
    total_ask_comments += num
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print("The average number of comments for Ask Posts: ", avg_ask_comments)

total_show_comments = 0

for s in show_posts:
    num = int(s[4])
    total_show_comments += num
    
avg_show_comments = total_show_comments / len(show_posts)

print("The average number of comments for Show Posts: ",avg_show_comments)

The average number of comments for Ask Posts:  14.038417431192661
The average number of comments for Show Posts:  10.31669535283993

From analysis on the average comments for each lists. It was found that Ask posts have more comments on average than the Show posts.

This could be due to the desired outcome of a Ask Post. If you were to do an Ask post, then you are intending that someone will comment, i.e. answer your question. The Show posts though do not have a question to answer, the viewers of the post simply look at the post. The viewer may wish to comment but it is not as natural in comparison to someone asking you a question.

Finding the Amount of Ask Posts and Comments by Hour Created¶

In this section, I made two dictionaries: counts_by_hour; and comments_by_hour.

counts_by_hour: contains the number of ask posts created during each hour of the day.
comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

A summary of the cells below:

imported the datetime module as dt
created an empty list result_list to store two elements from the columns: created_at; and num_comments
I iterated over ask_posts and appended the two elements to result_list
Then I created two empty dictionaries named: counts_by_hour; and comments_by_hour
Then created a for loop to iterate over the result_list
Before extracting and adding to the dictionaries. I parsed the date and created a datetime object using the datetime.strptime() method
I only wanted the hour section of the datetime object, so I used the datetime.strftime() method
Finally used conditional statements to compute the data, so it can be calculated and added to the create dictionary

In [8]:

print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [9]:

import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    comment_num = row[1]
    created = row[0]
    created_dt = dt.datetime.strptime(created, '%m/%d/%Y %H:%M')
    created_hour = created_dt.strftime('%H')
    
    if created_hour in counts_by_hour:
        counts_by_hour[created_hour] += 1
        comments_by_hour[created_hour] += comment_num
    else:
        counts_by_hour[created_hour] = 1
        comments_by_hour[created_hour] = comment_num 
        
print(counts_by_hour)
print("")
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}

Caculating the Average Number of Comments for Ask HN by Hour¶

Next I will use the two dictionaries created to calculate the average number of comments for posts created during each hour of the day.

This was done by:

Creating an empty list avg_per_hour
Iterated over the keys of comments_by_hour
Then computed the average number of comments and rounding the answer to a 2 decimal place using the round function
Finally appending two elements the hour and the average

In [10]:

avg_per_hour = []

for hour in comments_by_hour:
    average = round(comments_by_hour[hour] / counts_by_hour[hour], 2) # decided it was best to round the average to two decimal places
    avg_per_hour.append([hour, average])
    
avg_per_hour

Out[10]:

[['09', 5.58],
 ['13', 14.74],
 ['10', 13.44],
 ['14', 13.23],
 ['16', 16.8],
 ['23', 7.99],
 ['12', 9.41],
 ['17', 11.46],
 ['15', 38.59],
 ['21', 16.01],
 ['20', 21.52],
 ['02', 23.81],
 ['18', 13.2],
 ['03', 7.8],
 ['05', 10.09],
 ['19', 10.8],
 ['01', 11.38],
 ['22', 6.75],
 ['08', 10.25],
 ['04', 7.17],
 ['00', 8.13],
 ['06', 9.02],
 ['07', 7.85],
 ['11', 11.05]]

Sorting and Printing Values from List of Lists¶

In [11]:

swap_avg_per_hour = []

for row in avg_per_hour:
    hour = row[0]
    avg = row[1]
    swap_avg_per_hour.append([avg, hour])
    
swap_avg_per_hour

Out[11]:

[[5.58, '09'],
 [14.74, '13'],
 [13.44, '10'],
 [13.23, '14'],
 [16.8, '16'],
 [7.99, '23'],
 [9.41, '12'],
 [11.46, '17'],
 [38.59, '15'],
 [16.01, '21'],
 [21.52, '20'],
 [23.81, '02'],
 [13.2, '18'],
 [7.8, '03'],
 [10.09, '05'],
 [10.8, '19'],
 [11.38, '01'],
 [6.75, '22'],
 [10.25, '08'],
 [7.17, '04'],
 [8.13, '00'],
 [9.02, '06'],
 [7.85, '07'],
 [11.05, '11']]

In [12]:

sorted_swap = sorted(swap_avg_per_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    hour_dt = dt.datetime.strptime(row[1], '%H') 
    hour_str = hour_dt.strftime('%H:%M') 
    
    pt_hour_dt = dt.datetime.strptime(row[1], '%H') - dt.timedelta(hours=3)
    pt_hour_str = pt_hour_dt.strftime('%H:%M')
    
    ct_hour_dt = dt.datetime.strptime(row[1], '%H') - dt.timedelta(hours=1)
    ct_hour_str = ct_hour_dt.strftime('%H:%M')
    
    
    print('   ', '{pst_time} PST, {cst_time}, CST, {est_time} EST:    {avg:.2f} average comments per post'.format(pst_time=pt_hour_str, cst_time=ct_hour_str, est_time=hour_str, avg=row[0]))

Top 5 Hours for Ask Posts Comments
    12:00 PST, 14:00, CST, 15:00 EST:    38.59 average comments per post
    23:00 PST, 01:00, CST, 02:00 EST:    23.81 average comments per post
    17:00 PST, 19:00, CST, 20:00 EST:    21.52 average comments per post
    13:00 PST, 15:00, CST, 16:00 EST:    16.80 average comments per post
    18:00 PST, 20:00, CST, 21:00 EST:    16.01 average comments per post

The results showed that between the hours of 3 PM and 4PM EST had the highest average amount of comments per post. I felt it was unclear in why this was.

I therefore, decided to compare the most populous timezones in the USA (Pacifc, Central, and Eastern). To see if a clear indication appeared. The highest averages of comments, were to be found in the middle of the day, possibly when most users would be active. This would explain why these times across the USA display would be much higher than the other 4 results. In addition, it is important to mention that Hacker News was started by Y Combinator, which is located in Pacific Time.

It would interested to see were the most commmon post comes from in regards to timezone. To see if it matches with the above results.

From the results above, it would be reccomended that if your intention was to create a post to attractive the highest possible comments, 3 PM EST would be advised.

In [13]:

print("Top 5 Hours for Ask Posts Comments - European Timezone Comparison")

for row in sorted_swap[:5]:
    est_hour_dt = dt.datetime.strptime(row[1], '%H') 
    est_hour_str = est_hour_dt.strftime('%H:%M') 
    
    # Central European Standard Time Zone
    cest_hour_dt = dt.datetime.strptime(row[1], '%H') + dt.timedelta(hours=7)
    cest_hour_str = cest_hour_dt.strftime('%H:%M')
    
    print('   ', '{est_time} EST : {cest_time}, CEST:    {avg:.2f} average comments per post'.format(est_time=est_hour_str, cest_time=cest_hour_str, avg=row[0]))

Top 5 Hours for Ask Posts Comments - European Timezone Comparison
    15:00 EST : 22:00, CEST:    38.59 average comments per post
    02:00 EST : 09:00, CEST:    23.81 average comments per post
    20:00 EST : 03:00, CEST:    21.52 average comments per post
    16:00 EST : 23:00, CEST:    16.80 average comments per post
    21:00 EST : 04:00, CEST:    16.01 average comments per post

The above results are a comparison between the Eastern and Central European time zones.

From analysing the results, perhaps anothe reason why 3 PM EST is has a higher amount of comments on average is due to the fact that Europe is still active.

Conclusion¶

It can be concluded that the best time to post with the intention of gaining the most amount of comments for your post is between the hours of 3 PM - 4 PM EST.

This could be due to the fact that it is during a time when two large populations (North America and Europe) are most acitve.

Future add on for this project would be to compare this data collected with the following: Number of Users per country/ state, Where the highest amount of Posts come from i.e. location. This could provide further details on when it is best to post with the possibilties of other findings regarding the use general use of Hacker News for creating engagement.

In [ ]: