So, I actually have a question that lingers in my mind, and Hacker News (HN) seems like an appropriate place to ask.
Luckily, this guided project from Dataquest will help me explore the HN posts by analyzing these questions:
Of course, there are other more important factors, such as the relevancy of the posts and whether the topic is general and/or popular or very niche.
In my case, my question is quite niche and unpopular (trust me, I've searched the entire HN and found little to non-existent posts), so I don't really expect to get answers if I create an 'Ask HN' post. I even consider maybe I need to gauge the users' interest on that particular topic first by creating a 'Show HN' post and, if the points are high, then I can proceed with 'Ask HN'. Anyway! That sounds like me getting ahead of myself. Let's just keep things simple first and weigh the time factor and work from there.
The dataset for this project can be found here, and it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing no-comments submissions, and then randomly sampled from the remaining submissions.
And now, since I'm blabbering too much already, let's begin!
...oh, btw, my question is: "What are the best Massive Open Online Courses (MOOCs) related to clinical/medical/healthcare informatics?" Please contact me if you happened to know the answer, thanks :)
import csv
opened_file = open('hacker_news.csv')
hn = list(csv.reader(opened_file))
hn[:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
Before I dive in further, let's remove that distracting header from the first post.
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Since I'm interested in analyzing posts that contain 'Ask HN' and 'Show HN', let's create lists to separate these posts and see the distribution of the posts and their comments.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print("There are " + str((len(ask_posts))) + " 'Ask HN' posts, "
+ str((len(show_posts))) + " 'Show HN' posts, and "
+ str((len(other_posts))) + " other HN posts.")
There are 1744 'Ask HN' posts, 1162 'Show HN' posts, and 17194 other HN posts.
total_ask_comments = 0
for row in ask_posts:
total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
total_show_comments = 0
for row in show_posts:
total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print("Average 'Ask HN' comments is {:.2f}".format(avg_ask_comments)
+ ", while the average 'Show HN' comments is {:.2f}.".format(avg_show_comments)
)
Average 'Ask HN' comments is 14.04, while the average 'Show HN' comments is 10.32.
It seems like 'Ask HN' posts received more love than 'Show HN' (14 and 10, respectively). Well, it makes total sense, isn't it? People ask questions, people give answers. Ok then, this indicates that I should just create the 'Ask HN' post rather than create the 'Show HN' first.
Let's see if 'Ask HN' posts created at a certain time are more likely to attract comments by performing the following analyses:
import datetime as dt
result_list = []
for row in ask_posts:
created = row[6]
comments = int(row[4])
result_list.append([created, comments])
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
date = row[0]
comment = row [1]
time = dt.datetime.strptime(date, date_format).strftime("%H")
if time not in counts_by_hour:
counts_by_hour[time] = 1
comments_by_hour[time] = comment
else:
counts_by_hour[time] += 1
comments_by_hour[time] += comment
comments_by_hour
{'00': 447, '01': 683, '02': 1381, '03': 421, '04': 337, '05': 464, '06': 397, '07': 267, '08': 492, '09': 251, '10': 793, '11': 641, '12': 687, '13': 1253, '14': 1416, '15': 4477, '16': 1814, '17': 1146, '18': 1439, '19': 1188, '20': 1722, '21': 1745, '22': 479, '23': 543}
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
avg_by_hour
[['02', 23.810344827586206], ['23', 7.985294117647059], ['20', 21.525], ['18', 13.20183486238532], ['07', 7.852941176470588], ['10', 13.440677966101696], ['12', 9.41095890410959], ['06', 9.022727272727273], ['13', 14.741176470588234], ['00', 8.127272727272727], ['14', 13.233644859813085], ['05', 10.08695652173913], ['09', 5.5777777777777775], ['01', 11.383333333333333], ['03', 7.796296296296297], ['11', 11.051724137931034], ['22', 6.746478873239437], ['21', 16.009174311926607], ['15', 38.5948275862069], ['19', 10.8], ['17', 11.46], ['04', 7.170212765957447], ['08', 10.25], ['16', 16.796296296296298]]
Um, this is a little confusing. Let's just sort it by the average value and swap the hour and the average.
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
Now, let's see the top 5 hours for 'Ask HN' posts comments.
print("Top 5 Hours for 'Ask HN' Posts Comments")
for each_average, each_hour in sorted_swap[:5]:
top_5 = "{}: {:.2f} average comments per post"
print(top_5.format(
dt.datetime.strptime(each_hour, "%H").strftime("%H:%M"),each_average
)
)
Top 5 Hours for 'Ask HN' Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
Since the documentation pointed out that the timezone used is EST, and I live on the other side of the world (GMT +7), this will determine whether I can get my beauty sleep or not. Hopefully so.
Now, let's create a function to convert the timezone (I got these awesome codes from Stack Overflow).
import datetime
import pytz
def convert_datetime_timezone(dt, tz1, tz2):
tz1 = pytz.timezone(tz1)
tz2 = pytz.timezone(tz2)
dt = datetime.datetime.strptime(dt,"%H")
dt = tz1.localize(dt)
dt = dt.astimezone(tz2)
dt = dt.strftime("%H")
return dt
for row in sorted_swap[:5]:
convert_datetime_timezone(row[1], "US/Eastern", "Asia/Jakarta")
Perfect. I've spent hours on understanding and applying this chunk of code into my for loop. Phew.
Now, let's see the top 5 hours in my local timezone for 'Ask HN' posts comments.
print("Top 5 Hours for 'Ask HN' Posts Comments (GMT +7)")
for each_average, each_hour in sorted_swap[:5]:
top_5 = "{}: {:.2f} average comments per post"
newhr = convert_datetime_timezone(each_hour, "US/Eastern", "Asia/Jakarta")
print(top_5.format(
dt.datetime.strptime(newhr, "%H").strftime("%H:%M"),each_average
)
)
Top 5 Hours for 'Ask HN' Posts Comments (GMT +7) 03:00: 38.59 average comments per post 14:00: 23.81 average comments per post 08:00: 21.52 average comments per post 04:00: 16.80 average comments per post 09:00: 16.01 average comments per post
Uh, oh. 3 AM?! Bye, beauty sleep :(
The project is actually done on that 'Top 5 Hours'. Yay!
But... yeah, I can't just ignore that 'points' data now that I have some thoughts on it.
As I've mentioned before, 'Ask HN' may receive more comments because people obviously would give answers. On a closer look, the dataset provides us with another interesting data point called 'points'.
Maybe, 'points' to 'Show HN' posts is like 'comments' to 'Ask HN' posts. It's possible that the more interesting the content shared in the 'Show HN' posts, more points may be received. And while we're at it, let's weigh the time factor as well.
Let's check it out!
# Average point on 'Ask HN' posts
total_ask_points = 0
for row in ask_posts:
total_ask_points += int(row[3])
avg_ask_points = total_ask_points / len(ask_posts)
# Average point on 'Show HN' posts
total_show_points = 0
for row in show_posts:
total_show_points += int(row[3])
avg_show_points = total_show_points / len(show_posts)
print("Average 'Ask HN' points is {:.2f}".format(avg_ask_points)
+ ", while the average 'Show HN' points is {:.2f}.".format(avg_show_points))
Average 'Ask HN' points is 15.06, while the average 'Show HN' points is 27.56.
Whoa, my guess is correct! 'Show HN' posts have better average points than 'Ask HN' posts.
So, let's focus only on 'Show HN' posts here, and check at what times (in my local time) do these 'Show HN' posts receive more points, and their average amount of points received at each hour of the day.
# Calculate the amount of 'Show HN' posts created during each hour of day
# and the number of points received.
import datetime as dt
result_list = []
for row in show_posts:
created = row[6]
points = int(row[3])
result_list.append([created, points])
counts_by_hour = {}
points_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
date = row[0]
point = row [1]
time = dt.datetime.strptime(date, date_format).strftime("%H")
if time not in counts_by_hour:
counts_by_hour[time] = 1
points_by_hour[time] = point
else:
counts_by_hour[time] += 1
points_by_hour[time] += point
# Calculate the average amount of points `Show HN` posts created
# at each hour of the day receive, sorted and swapped.
avg_by_hour = []
for hour in points_by_hour:
avg_by_hour.append([hour, points_by_hour[hour] / counts_by_hour[hour]])
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
# Now print the the 5 hours with the highest average points in local time
print("Top 5 Hours for 'Show HN' Posts Points (GMT +7)")
for each_average, each_hour in sorted_swap[:5]:
top_5 = "{}: {:.2f} average comments per post"
newhr = convert_datetime_timezone(each_hour, "US/Eastern", "Asia/Jakarta")
print(top_5.format(
dt.datetime.strptime(newhr, "%H").strftime("%H:%M"),each_average
)
)
Top 5 Hours for 'Show HN' Posts Points (GMT +7) 11:00: 42.39 average comments per post 00:00: 41.69 average comments per post 10:00: 40.35 average comments per post 12:00: 37.84 average comments per post 06:00: 36.31 average comments per post
This is relieving. Look at those lovely afternoon times for me to post 'Show HN' to get more points.
I suspect that 'other HN' posts will score higher in both average comments and points. Let's see.
# Average number of comments `other HN` posts receive.
total_other_comments = 0
for row in other_posts:
total_other_comments += int(row[4])
avg_other_comments = total_other_comments / len(other_posts)
# Average point on 'other HN' posts
total_other_points = 0
for row in other_posts:
total_other_points += int(row[3])
avg_other_points = total_other_points / len(other_posts)
print("Average 'Ask HN' comments is {:.2f}.".format(avg_ask_comments))
print("Average 'Show HN' comments is {:.2f}.".format(avg_show_comments))
print("Average 'other HN' comments is {:.2f}.".format(avg_other_comments))
print("\n")
print("Average 'Ask HN' points is {:.2f}.".format(avg_ask_points))
print("Average 'Show HN' points is {:.2f}.".format(avg_show_points))
print("Average 'other HN' points is {:.2f}.".format(avg_other_points))
Average 'Ask HN' comments is 14.04. Average 'Show HN' comments is 10.32. Average 'other HN' comments is 26.87. Average 'Ask HN' points is 15.06. Average 'Show HN' points is 27.56. Average 'other HN' points is 55.41.
Yup, for 'other HN' posts, both of the average comments and points are higher than 'Ask HN' and 'Show HN' posts.
Since average point for 'other HN' posts is higher than the comments, let's focus on the top 5 hours to get more points for 'other HN' posts.
# Calculate the amount of 'other HN' posts created during each hour of day
# and the number of points received.
import datetime as dt
result_list = []
for row in other_posts:
created = row[6]
points = int(row[3])
result_list.append([created, points])
counts_by_hour = {}
points_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
date = row[0]
point = row [1]
time = dt.datetime.strptime(date, date_format).strftime("%H")
if time not in counts_by_hour:
counts_by_hour[time] = 1
points_by_hour[time] = point
else:
counts_by_hour[time] += 1
points_by_hour[time] += point
# Calculate the average amount of points `other HN` posts created
# at each hour of the day receive, sorted and swapped.
avg_by_hour = []
for hour in points_by_hour:
avg_by_hour.append([hour, points_by_hour[hour] / counts_by_hour[hour]])
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
# Now print the the 5 hours with the highest average points in local time
print("Top 5 Hours for 'other HN' Posts Points (GMT +7)")
for each_average, each_hour in sorted_swap[:5]:
top_5 = "{}: {:.2f} average comments per post"
newhr = convert_datetime_timezone(each_hour, "US/Eastern", "Asia/Jakarta")
print(top_5.format(
dt.datetime.strptime(newhr, "%H").strftime("%H:%M"),each_average
)
)
Top 5 Hours for 'other HN' Posts Points (GMT +7) 01:00: 62.53 average comments per post 02:00: 61.79 average comments per post 03:00: 60.54 average comments per post 22:00: 60.48 average comments per post 07:00: 60.01 average comments per post
Ooookay, the top 5 hours are again the enemy of my beauty sleep LOL
Well, I need to do the following if I want to get more comments and more points on HN posts:
I need to learn on how to incorporate Daylight Saving Time (DST) in the code. Perhaps by factoring in dates and maybe those fascinating equinoxes and solstices.
...but unfortunately, my 'Ask HN' post received no reply :( This is okay since I'm just considering one factor (i.e. time) and the topic itself is not a popular/general HN topic.
Anyway, this was a fun project and well-worth a lack of beauty sleep :p