Hacker News is a extremely popular site in the tech and startup world. A user can submit a post, where they are voted on commented on, very similar to reddit. The top posts can recieve hundreds of thousands of visitors.
I am aiming to explore two types of posts: Ask HN
and Show HN
.
To find out the following:
Ask HN
or Show HN
receive more comments on average?In the cell below I have done the following:
hacker_news.csv
list()
function and assigned it to a variable hn
headers
, so I can easily reference the column titles if neededhn
so that does not include the headerprint()
function to display the headers
and the frist 5 rows of hn
from csv import reader
file = open("hacker_news.csv")
read = reader(file)
hn = list(read)
headers = hn[0]
hn = hn[1:]
print(headers)
print("")
hn[:5]
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
In the code cell below I first made three empty lists in which to store the specific posts I needed.
Then looped through each row in hn
. I wanted to find the rows that contained the following elements: "ask hn", "show hn", and then the remaining. I decided to use the string method startswith
, to ensure there was no issues with the strings in the list of lists being uppercase or lowercase. I made the title column of the data lower by using the lower
method and assigning it to a variable called title
.
Then used conditional statements to find the rows that started with the identified string. If it was found, I used the append
method to append that specific row found into the specific list.
In the next two cells, I wanted to count and print my newly created lists to ensure all went well.
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
title = post[1].lower()
if title.startswith('ask hn'):
ask_posts.append(post)
if title.startswith('show hn'):
show_posts.append(post)
else:
other_posts.append(post)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 18938
ask_posts[:5]
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
show_posts[:5]
[['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]
In this section the aim was to compare the average number of comments for the Ask HN and Show HN posts.
The following tasks were complete in the below cells:
print
function to display the headers to find the right indexask_posts
and show_posts
) I used a for loop to iterate over each, turning the num_comments
column into a integer using the int
function. Then added the sum of the comments to a pre made variable named total_ask_comments
or total_show_comments
avg_ask_comments
or avg_show_commments
print(headers)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
total_ask_comments = 0
for a in ask_posts:
num = int(a[4])
total_ask_comments += num
avg_ask_comments = total_ask_comments / len(ask_posts)
print("The average number of comments for Ask Posts: ", avg_ask_comments)
total_show_comments = 0
for s in show_posts:
num = int(s[4])
total_show_comments += num
avg_show_comments = total_show_comments / len(show_posts)
print("The average number of comments for Show Posts: ",avg_show_comments)
The average number of comments for Ask Posts: 14.038417431192661 The average number of comments for Show Posts: 10.31669535283993
From analysis on the average comments for each lists. It was found that Ask posts have more comments on average than the Show posts.
This could be due to the desired outcome of a Ask Post. If you were to do an Ask post, then you are intending that someone will comment, i.e. answer your question. The Show posts though do not have a question to answer, the viewers of the post simply look at the post. The viewer may wish to comment but it is not as natural in comparison to someone asking you a question.
In this section, I made two dictionaries: counts_by_hour
; and comments_by_hour
.
counts_by_hour
: contains the number of ask posts created during each hour of the day.comments_by_hour
: contains the corresponding number of comments ask posts created at each hour received.A summary of the cells below:
datetime
module as dt
result_list
to store two elements from the columns: created_at
; and num_comments
ask_posts
and appended the two elements to result_list
counts_by_hour
; and comments_by_hour
result_list
datetime.strptime()
methoddatetime.strftime()
methodprint(headers)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
import datetime as dt
result_list = []
for row in ask_posts:
result_list.append([row[6], int(row[4])])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
comment_num = row[1]
created = row[0]
created_dt = dt.datetime.strptime(created, '%m/%d/%Y %H:%M')
created_hour = created_dt.strftime('%H')
if created_hour in counts_by_hour:
counts_by_hour[created_hour] += 1
comments_by_hour[created_hour] += comment_num
else:
counts_by_hour[created_hour] = 1
comments_by_hour[created_hour] = comment_num
print(counts_by_hour)
print("")
print(comments_by_hour)
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
Next I will use the two dictionaries created to calculate the average number of comments for posts created during each hour of the day.
This was done by:
avg_per_hour
comments_by_hour
round
functionaverage
avg_per_hour = []
for hour in comments_by_hour:
average = round(comments_by_hour[hour] / counts_by_hour[hour], 2) # decided it was best to round the average to two decimal places
avg_per_hour.append([hour, average])
avg_per_hour
[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]
swap_avg_per_hour = []
for row in avg_per_hour:
hour = row[0]
avg = row[1]
swap_avg_per_hour.append([avg, hour])
swap_avg_per_hour
[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]
sorted_swap = sorted(swap_avg_per_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
hour_dt = dt.datetime.strptime(row[1], '%H')
hour_str = hour_dt.strftime('%H:%M')
pt_hour_dt = dt.datetime.strptime(row[1], '%H') - dt.timedelta(hours=3)
pt_hour_str = pt_hour_dt.strftime('%H:%M')
ct_hour_dt = dt.datetime.strptime(row[1], '%H') - dt.timedelta(hours=1)
ct_hour_str = ct_hour_dt.strftime('%H:%M')
print(' ', '{pst_time} PST, {cst_time}, CST, {est_time} EST: {avg:.2f} average comments per post'.format(pst_time=pt_hour_str, cst_time=ct_hour_str, est_time=hour_str, avg=row[0]))
Top 5 Hours for Ask Posts Comments 12:00 PST, 14:00, CST, 15:00 EST: 38.59 average comments per post 23:00 PST, 01:00, CST, 02:00 EST: 23.81 average comments per post 17:00 PST, 19:00, CST, 20:00 EST: 21.52 average comments per post 13:00 PST, 15:00, CST, 16:00 EST: 16.80 average comments per post 18:00 PST, 20:00, CST, 21:00 EST: 16.01 average comments per post
The results showed that between the hours of 3 PM and 4PM EST had the highest average amount of comments per post. I felt it was unclear in why this was.
I therefore, decided to compare the most populous timezones in the USA (Pacifc, Central, and Eastern). To see if a clear indication appeared. The highest averages of comments, were to be found in the middle of the day, possibly when most users would be active. This would explain why these times across the USA display would be much higher than the other 4 results. In addition, it is important to mention that Hacker News was started by Y Combinator, which is located in Pacific Time.
It would interested to see were the most commmon post comes from in regards to timezone. To see if it matches with the above results.
From the results above, it would be reccomended that if your intention was to create a post to attractive the highest possible comments, 3 PM EST would be advised.
print("Top 5 Hours for Ask Posts Comments - European Timezone Comparison")
for row in sorted_swap[:5]:
est_hour_dt = dt.datetime.strptime(row[1], '%H')
est_hour_str = est_hour_dt.strftime('%H:%M')
# Central European Standard Time Zone
cest_hour_dt = dt.datetime.strptime(row[1], '%H') + dt.timedelta(hours=7)
cest_hour_str = cest_hour_dt.strftime('%H:%M')
print(' ', '{est_time} EST : {cest_time}, CEST: {avg:.2f} average comments per post'.format(est_time=est_hour_str, cest_time=cest_hour_str, avg=row[0]))
Top 5 Hours for Ask Posts Comments - European Timezone Comparison 15:00 EST : 22:00, CEST: 38.59 average comments per post 02:00 EST : 09:00, CEST: 23.81 average comments per post 20:00 EST : 03:00, CEST: 21.52 average comments per post 16:00 EST : 23:00, CEST: 16.80 average comments per post 21:00 EST : 04:00, CEST: 16.01 average comments per post
The above results are a comparison between the Eastern and Central European time zones.
From analysing the results, perhaps anothe reason why 3 PM EST is has a higher amount of comments on average is due to the fact that Europe is still active.
It can be concluded that the best time to post with the intention of gaining the most amount of comments for your post is between the hours of 3 PM - 4 PM EST.
This could be due to the fact that it is during a time when two large populations (North America and Europe) are most acitve.
Future add on for this project would be to compare this data collected with the following: Number of Users per country/ state, Where the highest amount of Posts come from i.e. location. This could provide further details on when it is best to post with the possibilties of other findings regarding the use general use of Hacker News for creating engagement.