Hacker News is a News Site started by the startup incubator Y Combinator. It site a site where you can share a story or ask(known as "posts") are voted and commented upon, similar to reddit.
Our goal for this project is to determine the following:
Ask HN
vs Show HN
receive more comments on average?We will use this data set, has approximately 20,000 rows.
Let's start by exploring our data set. First we will print the headers and first five rows.
opened_file = open("hacker_news.csv")
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0] # assigning the first row which is the headers to "headers" variable
hn = hn[1:]
print(headers, '\n')
for row in hn[:5]:
print(row, '\n')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
Below are descriptions of the columns:
id
: The unique identifier from Hacker News for the posttitle
: The title of the posturl
: The URL that the posts links to, if it the post has a URLnum_points
: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotesnum_comments
: The number of comments that were made on the postauthor
: The username of the person who submitted the postcreated_at
: The date and time at which the post was submittedLet's start by spliting the data to three parts:
Ask HN
Show HN
Ask HN
or Show HN
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print("Ask posts length:", len(ask_posts))
print("Show posts length:", len(show_posts))
print("Other posts length:", len(other_posts))
Ask posts length: 1744 Show posts length: 1162 Other posts length: 17194
For our analysis we are only interested in post starting with Ask HN
and Show HN
.
Let's start by computing the average of both Ask HN
and Show HN
.
total_ask_comments = 0
for post in ask_posts:
comments = int(post[4])
total_ask_comments += comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Overall total of Ask HN comments: {:.2f}".format(avg_ask_comments))
Overall total of Ask HN comments: 14.04
total_show_comments = 0
for post in show_posts:
comments = int(post[4])
total_show_comments += comments
avg_show_comments = total_show_comments / len(ask_posts)
print("Overall total of Show HN comments: {:.2f}".format(avg_show_comments))
Overall total of Show HN comments: 6.87
We find here that Ask HN
posts with 14.04% receives more comments on average than Show HN
posts with 6.87%.
With that, we will be using Ask HN
to determine the particular hour that attracts more comments.
import datetime as dt
result_list = [] # a list that will contain date of post created and # of comments
for post in ask_posts:
created_at = post[6]
comments = post[4]
result_list.append([created_at, comments])
counts_by_hour = {} # a list that will contain number of post per hour
comments_by_hour = {} # a list that will contain number comments per hour
for result in result_list:
hour = dt.datetime.strptime(result[0], "%m/%d/%Y %H:%M")
hour = hour.strftime("%I %p") # Converting our hour to 12 hour format and Locale’s equivalent of either AM or PM
comments = int(result[1])
if not hour in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comments
# To sort out the values in counts_by_hour by descending order
counts_by_hour = {k: v for k, v in sorted(counts_by_hour.items(), key=lambda item: item[1], reverse = True)}
# To sort out the values in comments_by_hour by descending order
comments_by_hour = {k: v for k, v in sorted(comments_by_hour.items(), key=lambda item: item[1], reverse = True)}
print("Hour | # of post")
for hour, posts_len in counts_by_hour.items():
print(hour, "| ", posts_len)
Hour | # of post 03 PM | 116 07 PM | 110 09 PM | 109 06 PM | 109 04 PM | 108 02 PM | 107 05 PM | 100 01 PM | 85 08 PM | 80 12 PM | 73 10 PM | 71 11 PM | 68 01 AM | 60 10 AM | 59 02 AM | 58 11 AM | 58 12 AM | 55 03 AM | 54 08 AM | 48 04 AM | 47 05 AM | 46 09 AM | 45 06 AM | 44 07 AM | 34
We see from the result that in 3pm has highest with 116 post followed by 7pm with 110 posts, 9pm with 109 posts and so on.
Now, let's check number of comments per hour.
print("Hour | # of Comments")
for hour, comments_len in comments_by_hour.items():
print(hour, "| ", comments_len)
Hour | # of Comments 03 PM | 4477 04 PM | 1814 09 PM | 1745 08 PM | 1722 06 PM | 1439 02 PM | 1416 02 AM | 1381 01 PM | 1253 07 PM | 1188 05 PM | 1146 10 AM | 793 12 PM | 687 01 AM | 683 11 AM | 641 11 PM | 543 08 AM | 492 10 PM | 479 05 AM | 464 12 AM | 447 03 AM | 421 06 AM | 397 04 AM | 337 07 AM | 267 09 AM | 251
We see the almost the same result here, being 3pm has most comments followed by 4pm, 9pm and so on.
avg_by_hour = [] # a list that will contain the average of comments per hour
for hour in comments_by_hour:
avg_by_hour.append([hour,comments_by_hour[hour] / counts_by_hour[hour]])
# To sort out the values in avg_by_hour by descending order
avg_by_hour = [[k, v] for k, v in sorted(avg_by_hour, key=lambda item: item[1], reverse = True)]
print("Top 5 Hours for Ask Posts Comments")
for row in avg_by_hour[:5]:
template = "{}: {:.2f} average comments per post."
hour = dt.datetime.strptime(row[0], "%I %p")
hour = hour.strftime("%I %p")
average = row[1]
print(template.format(hour, average))
Top 5 Hours for Ask Posts Comments 03 PM: 38.59 average comments per post. 02 AM: 23.81 average comments per post. 08 PM: 21.52 average comments per post. 04 PM: 16.80 average comments per post. 09 PM: 16.01 average comments per post.
From our analysis, we conclude that having Ask HN
to our post will attract more comments than Show HN
. Based from our top 5 average comments per hour, comments will appear mostly during 3pm, 2am, 8pm and so on.