In this project, we look at the popular technology news site Hacker News, and analyse the posts with the titles that begin with either Ask HN or Show HN.
Ask HN posts are the ones that users submit to ask the community a specific question, while, Show HN posts are the ones they submit to show the community a project, product, or something interesting.
We will compare the two types of posts to determine:
Dataset for this project can be downloaded from here. Below are the description of the columns:
Note: This dataset is based on approximately 20,000 rows randomly sampled from the submissions after removing posts without any comments.
Let's read the dataset and display some posts.
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print(headers)
for post in hn[:5]:
print(post)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
Now, let's seperate the posts as we are only interested in Ask HN and Show HN posts.
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
title = post[1]
title = title.lower()
if title.startswith("ask hn"):
ask_posts.append(post)
elif title.startswith("show hn"):
show_posts.append(post)
else:
other_posts.append(post)
print('Number of Ask HN posts: ',len(ask_posts))
print('Number of Show HN posts: ',len(show_posts))
print('Number of Other posts: ',len(other_posts))
print('\n')
print('='*5,'Ask HN posts','='*5)
print(ask_posts[:5])
print('\n')
print('='*5,'Show HN posts','='*5)
print(show_posts[:5])
Number of Ask HN posts: 1744 Number of Show HN posts: 1162 Number of Other posts: 17194 ===== Ask HN posts ===== [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']] ===== Show HN posts ===== [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]
Let's determine the total number of comments and average comment per post for Ask HN vs Show HN posts.
def get_comments_stats(posts, print_answers = True):
total_comments = 0
avg_comments = 0
total_posts = len(posts)
for post in posts:
num_comments = int(post[4])
num_points = int(post[3])
total_comments += num_comments
avg_comments = round(total_comments/total_posts,2)
if print_answers:
print('Total posts = ',total_posts)
print('Total comments = ',total_comments)
print('Average comments per post = ',avg_comments)
return avg_comments, total_comments
print('='*5,'Ask HN','='*5)
avg_ask_comments,total_ask_comments = get_comments_stats(ask_posts)
print('\n')
print('='*5,'Show HN','='*5)
avg_show_comments,total_show_comments = get_comments_stats(show_posts)
===== Ask HN ===== Total posts = 1744 Total comments = 24483 Average comments per post = 14.04 ===== Show HN ===== Total posts = 1162 Total comments = 11988 Average comments per post = 10.32
After seperating the Ask HN and Show HN posts and calculating the average comments across posts, we can see that the average number of comments per post is higher for Ask HN posts (about 14 comments per post) than Show HN posts (about 10 comments per post).
This answers our first question
Ask HN posts receive more comments on average than Show HN posts
Since ask posts are more likely to receive comments, we will focus only on these posts for answering our next question.
To answer this, we need to:
For this, we will seperate the hour a post was made from the created_at datetime field in our dataset and then create two dictionaries - One for the number of posts and another for the number of comments by the hour.
import datetime as dt
import pytz
result_list = []
for post in ask_posts:
created_at = dt.datetime.strptime(post[6],'%m/%d/%Y %H:%M')
num_comments = int(post[4])
result_list.append([created_at,num_comments])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
created_date = row[0]
created_hour = created_date.strftime('%H')
num_comments = row[1]
if created_hour in counts_by_hour:
counts_by_hour[created_hour] += 1
comments_by_hour[created_hour] += num_comments
else:
counts_by_hour[created_hour] = 1
comments_by_hour[created_hour] = num_comments
Now that we have the number of posts and total number of comments segmented by the hour in a day, we will determine the average number of comments per post by the hour the post was created in the day and display the top 5 hours during which Ask HN posts get more comments.
Note: The dataset timezone as per the documentation in US/Eastern - So we will display both this timezone and local timezone which in my case is Europe/London.
from IPython.display import display, Markdown
# Function to return the results (hour in the day) sorted by highest average comments
def sorted_result(freq_tbl):
table = freq_tbl
table_display = []
for key in table:
key_val_as_tuple = (table[key], key)
table_display.append(key_val_as_tuple)
table_sorted = sorted(table_display, reverse = True)
return table_sorted
# Function to sort and display the results formatted - By default displays top 5 in US/Eastern timezone but can change to local timezone and pass a Local timezone
def display_result(freq_tbl, display_top_n = 5, display_in_local_time = False, local_timezone = 'Europe/London'):
# Dataset timezone per documentation
us_eastern = pytz.timezone('US/Eastern')
# Ability to display in a different timezone by taking the parameter
local_tz = pytz.timezone(local_timezone)
heading_template = ""
if display_in_local_time:
heading_template = 'Top {} Hours ({}) for Ask Posts Comments'.format(display_top_n,local_tz.zone)
else:
heading_template = 'Top {} Hours ({}) for Ask Posts Comments'.format(display_top_n,us_eastern.zone)
display(Markdown('**'+heading_template+'**'))
#print(heading_template)
#print('='*len(heading_template))
table_sorted = sorted_result(freq_tbl)
display_str_template = "{}: {:.2f} average comments per post"
row_count = 0
for entry in table_sorted:
row_count += 1
if row_count <= display_top_n:
hour = dt.datetime.strptime(entry[1], '%H')
if display_in_local_time:
hour = hour.replace(tzinfo=us_eastern)
hour = hour.astimezone(local_tz)
hour_fmt = hour.strftime('%H:%M')
print(display_str_template.format(hour_fmt,entry[0]))
else:
exit
avg_by_hour = {}
for hour in comments_by_hour:
posts = counts_by_hour[hour]
num_comments = comments_by_hour[hour]
avg_comments = num_comments/posts
avg_by_hour[hour] = avg_comments
display_result(freq_tbl = avg_by_hour,display_in_local_time = False)
print('\n')
display_result(freq_tbl = avg_by_hour,display_in_local_time = True,local_timezone = 'Europe/London')
Top 5 Hours (US/Eastern) for Ask Posts Comments
15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
Top 5 Hours (Europe/London) for Ask Posts Comments
19:55: 38.59 average comments per post 06:55: 23.81 average comments per post 00:55: 21.52 average comments per post 20:55: 16.80 average comments per post 01:55: 16.01 average comments per post
This answers our last question:
Ask HN posts created around 3 pm US/Eastern time (around 8 pm Europe/London time) seems to have highest average comments per post
After our data analysis on the Ask HN and Show HN posts, we conclude that Ask HN posts receive more comments and that best hour in the day for the comment activity on the Ask HN posts is around 3 pm US/Eastern time (8 pm Europe/London time).