Notebook

Analysing post activities to make Hacker News top rated¶

Introduction¶

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here

Objective¶

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments and points on average?
Do posts created at a certain time receive more comments and points on average?

Summary of Results¶

After analyzing the data, we can say that Ask HM has higher comments than Show HM post. Moreover, the time of posting has an impact on the average number of comments; Ask post, posted at 15:00 receives, on average, the highest number of comments. On the other hand, points average is higher for Show posts at 23.00. For more details, please refer to the full analysis below.

Exploring existing data¶

Import necessary libraries¶

In [1]:

# import csv module to read data
from csv import reader 

# import datetime module to convert date 
from datetime import datetime as dt 

Load the dataset as list¶

In [2]:

opened_file = open('HN_posts_year_to_Sep_26_2016.csv')

# read file with reader method from csv module
read_file = reader(opened_file)

# Set opened file as a list of list in a variable name hacker news as hn
hn = list(read_file)

Datasets Short Description¶

Below are descriptions of the columns:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted

Display the first five rows of dataset¶

In [3]:

# First 5 rows of dataset including header
for row in hn[:5]:
    print(row)
    print('\n')
    
# Show the number of rows and columns in dataset

print('The data set has '+ str(len(hn))+' rows.')
print("Each row has "+ str(len(hn[0]))+ ' columns.')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


The data set has 293120 rows.
Each row has 7 columns.

Extract the first row of data as headers and remove the header from main dataset¶

In [23]:

# set 0 index for header and rest index for rows 
header, hn = hn[0], hn[1:]
print("The header is : ", header)
print()
print("Rows without header :", hn[:5])

The header is :  ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']

Rows without header : [['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']]

Separate posts beginning with Ask HN and Show HN into two different lists¶

In [6]:

# Create  three empty lists
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

The amount of posts on each list¶

In [24]:

print("There are "+ str(len(ask_posts))+" Ask HN posts")
print("There are "+ str(len(show_posts))+" Show HN posts")
print("There are "+ str(len(other_posts))+" other posts") 

There are 9139 Ask HN posts
There are 10158 Show HN posts
There are 273822 other posts

We can see that average posts in ask_posts is greater than show_posts.

Show first 5 rows of ask and show posts list¶

In [27]:

print("The first five rows of ask posts : \n", ask_posts[:5])

The first five rows of ask posts : 
 [['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]

In [28]:

print("The first five rows of show posts : \n", show_posts[:5])

The first five rows of show posts : 
 [['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'], ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17'], ['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']]

Let's determine if ask_posts or show_posts receive more comments on average.¶

In [30]:

# calculate total and average ask_post comments

total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments +=num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Total ask post {:,}".format(total_ask_comments))
print("Average ask post ", round(avg_ask_comments,2))
    

Total ask post 94,986
Average ask post  10.39

In [31]:

# calculate total and average show_post comments

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments +=num_comments
avg_show_comments = total_show_comments/len(show_posts)
print("Total show posts {:,}".format(total_show_comments))
print("Average show posts ", round(avg_show_comments,2))

Total show posts 49,633
Average show posts  4.89

HN posts breakdown by comments¶

HN Posts	Total	Average
ask_post	94,986	10.34
show_post	49,633	4.89

From the above two posts we can see that average posts of ask_post have more comments than show_posts.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask_posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask_posts receive by hour created.

In [32]:

# Create an empty list where we load new columns
result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])

Show total number of hour and comments in different dictionary¶

In [13]:

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    dt_created = each_row[0]
    num_comment = each_row[1]
    time = dt.datetime.strptime(dt_created, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += num_comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = num_comment
        counts_by_hour[time] = 1

print("Hourly comments number : ", comments_by_hour)
print("\n")
print("Hourly counted number :", counts_by_hour)

Hourly comments number :  {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


Hourly counted number : {'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}

The data shows that most ask posts recieve more comments around 15:00pm.

Show the average number of comments by hour¶

In [33]:

# Calculate the average number of comments by ask_posts
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

Out[33]:

[['02', 14.334928229665072],
 ['01', 8.45748987854251],
 ['22', 8.944297082228116],
 ['21', 10.465116279069768],
 ['19', 7.111510791366906],
 ['17', 7.289093298291721],
 ['15', 22.15909090909091],
 ['14', 7.14367816091954],
 ['13', 11.87704918032787],
 ['11', 6.95771144278607],
 ['10', 9.328173374613003],
 ['09', 4.8907284768211925],
 ['07', 6.716101694915254],
 ['03', 10.45631067961165],
 ['23', 7.200626959247649],
 ['20', 8.49904761904762],
 ['16', 5.575530586766542],
 ['08', 7.474683544303797],
 ['00', 8.25],
 ['18', 7.434451219512195],
 ['12', 8.205426356589147],
 ['04', 12.164948453608247],
 ['06', 8.265625],
 ['05', 10.686046511627907]]

Swap the average hour¶

In [34]:

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

[[14.334928229665072, '02'], [8.45748987854251, '01'], [8.944297082228116, '22'], [10.465116279069768, '21'], [7.111510791366906, '19'], [7.289093298291721, '17'], [22.15909090909091, '15'], [7.14367816091954, '14'], [11.87704918032787, '13'], [6.95771144278607, '11'], [9.328173374613003, '10'], [4.8907284768211925, '09'], [6.716101694915254, '07'], [10.45631067961165, '03'], [7.200626959247649, '23'], [8.49904761904762, '20'], [5.575530586766542, '16'], [7.474683544303797, '08'], [8.25, '00'], [7.434451219512195, '18'], [8.205426356589147, '12'], [12.164948453608247, '04'], [8.265625, '06'], [10.686046511627907, '05']]

Show top most 5 comments average¶

In [35]:

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments :", sorted_swap[:5])

Top 5 Hours for Ask Posts Comments : [[22.15909090909091, '15'], [14.334928229665072, '02'], [12.164948453608247, '04'], [11.87704918032787, '13'], [10.686046511627907, '05']]

In [36]:

for avg, hr in sorted_swap[:5]:
    print("{}:{:.2f}".format(
        dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg))

15:00:22.16
02:00:14.33
04:00:12.16
13:00:11.88
05:00:10.69

Based on the dataset documentation, the timezone used is the eastern time in the US, 15:00 will be equivalent to 3:00 pm est. The top 5 hours for most comments on Ask Posts are 15:00, 02:00, 20.00, 16:00 and 21.00. The hour that receives the most comments on average is 15:00 with thirty nine comments per post.

Determine HN posts points on average¶

In [38]:

# calculate total and average ask_post points

total_ask_points= 0
for row in ask_posts:
    num_points = int(row[3])
    total_ask_points +=num_points
avg_ask_points= total_ask_points/len(ask_posts)
print(" Total ask points {:,}".format(total_ask_points))
print("Average ask points :", round(avg_ask_points,2))

 Total ask points 103,378
Average ask points : 11.31

In [39]:

# calculate total and average show_post points

total_show_points = 0
for row in show_posts:
    num_points = int(row[3])
    total_show_points +=num_points
avg_show_points = total_show_points/len(show_posts)
print("Total show posts {:,}".format(total_show_points))
print("Average show posts :", round(avg_show_points,2))

Total show posts 150,781
Average show posts : 14.84

The average posts from ask and show its easily describle that show post recieve more points on average than ask post points where the average ask posts comments were more than show posts.

Post type	points(total)	points(avg)
Ask HN	103,378	11.31
Show HN	150,781	14.84

Show the number of hour and number of points by hour in different dictionaries¶

Though show post average number is high so we will analysis show posts part.

In [40]:

# create an empty list where we will load columns we need for analysis
result_list = []

for post in show_posts:
    result_list.append([post[6], int(post[3])])

points_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    dt_created = each_row[0]
    num_points = each_row[1]
    time = dt.datetime.strptime(dt_created, date_format).strftime("%H")
    if time in counts_by_hour:
        points_by_hour[time] += num_points
        counts_by_hour[time] += 1
    else:
        points_by_hour[time] = num_points
        counts_by_hour[time] = 1

print("Hourly points number : ", points_by_hour)
print("\n")
print("Hourly counted number :", counts_by_hour)

Hourly points number :  {'00': 4291, '23': 5060, '20': 6948, '19': 8928, '18': 9935, '16': 11487, '14': 10503, '10': 4303, '09': 3762, '08': 4640, '06': 3071, '03': 2168, '21': 5990, '17': 10563, '15': 11657, '11': 7742, '07': 3303, '04': 2707, '13': 10381, '12': 10787, '01': 2931, '22': 5026, '02': 2764, '05': 1834}


Hourly counted number : {'00': 276, '23': 319, '20': 525, '19': 556, '18': 656, '16': 801, '14': 696, '10': 323, '09': 302, '08': 316, '06': 192, '03': 206, '21': 430, '17': 761, '15': 836, '11': 402, '07': 236, '04': 194, '13': 610, '12': 516, '01': 247, '22': 377, '02': 209, '05': 172}

Average points by hour¶

In [41]:

# Create an empty list for average points by hour
avg_by_hour = []

for hr in points_by_hour:
    avg_by_hour.append([hr, points_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

Out[41]:

[['00', 15.547101449275363],
 ['23', 15.862068965517242],
 ['20', 13.234285714285715],
 ['19', 16.057553956834532],
 ['18', 15.144817073170731],
 ['16', 14.340823970037453],
 ['14', 15.09051724137931],
 ['10', 13.321981424148607],
 ['09', 12.456953642384105],
 ['08', 14.683544303797468],
 ['06', 15.994791666666666],
 ['03', 10.524271844660195],
 ['21', 13.930232558139535],
 ['17', 13.88042049934297],
 ['15', 13.94377990430622],
 ['11', 19.258706467661693],
 ['07', 13.995762711864407],
 ['04', 13.95360824742268],
 ['13', 17.018032786885247],
 ['12', 20.905038759689923],
 ['01', 11.866396761133604],
 ['22', 13.331564986737401],
 ['02', 13.224880382775119],
 ['05', 10.662790697674419]]

The hour that receives the most points for show posts on average is 12:00.

Show top most 5 points average¶

In [22]:

# Create an empty list
points_avg_by_hour = []

# Add columns in that list
for row in avg_by_hour:
    points_avg_by_hour.append([row[1], row[0]])
sorted_list = sorted(points_avg_by_hour, reverse=True)

print("Top 5 hours for points on 'Show HN' posts")
for avg, hr in sorted_list[:5]:
    print(
        "{}: {:.2f}".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 hours for points on 'Show HN' posts
12:00: 20.91
11:00: 19.26
13:00: 17.02
19:00: 16.06
06:00: 15.99

The top 5 hours for most upvotes from the hacker news community for show posts are 11:00, 12 noon, 13.00, 19:00 and 06.00.

Result Summary¶

This project analysed ask posts and show posts on the hacker news platform to determine which type of post and time period received the most comments and points on average.

Based on the analysis, to optimise the possibility of receiving more comments, we'd recommend users to post on the hacker news platform using the 'ask hn' title and possibly create the post sometime between the time period of 15:00pm and 16:00pm EST. This time period in my timezone is between 20:00pm and 21:00pm WAT.

Furthermore, on comparing both types of posts based on the points / upvotes recieved, the data shows that there are more points on average for show posts than there are for ask posts. This suggests that while 'ask hn' posts are more likely to recieve more comments from the hacker news community, 'show hn' posts tend to recieve more points average. The table shows:

Post type	points(total)	points(avg)	comments(total)	comments(avg)
Ask HN	103,378	11.31	94,986	10.34
Show HN	150,781	14.34	49,633	4.89

Conclusion:¶

We accomplished the following task in this guided project:

We set a goal for the project.
We collected and sorted the data.
We reformatted and cleaned the data to prepare it for analysis.
We analyzed the data.