In this project, we'll work with a data set of submissions to popular technology site Hacker News.
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds and thousands of visitors as a result. You can find the data set here, but note that only a sample of the data, which is around 20,000 rows are used.
We're specifically interested in posts whose titles begin with either "Ask HN" or "Show HN". Users submit "Ask HN" posts to ask the Hacker News community a specific question. Likewise, users submit "Show HN" posts to show the Hacker News community a project, product, or just generally something interesting.
Firstly, we are going to open and read the data available. Here, each row(or a datapoint) is in the form of a list that gets stored in an outer list, hence the format list of lists.
from csv import reader
opened_file = open('hacker_news.csv', encoding = 'UTF-8')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5]) #displaying first 5 rows
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
Let us isolate the header from the rest of the data and display the first five rows of "hn" again to verify that we removed the header row properly.
headers = hn[0]
hn = hn[1:]
print(hn[:5])
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
Now that we've removed the headers from "hn", we're ready to filter our data. Since we're only concerned with post titles beginning with "Ask HN" or "Show HN", we'll create new lists of lists containing just the data for those titles.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title_lowercase = title.lower()
if title_lowercase.startswith('ask hn'):
ask_posts.append(row)
elif title_lowercase.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print("Number of data points related to 'Ask HN':",len(ask_posts))
print("Number of data points related to 'Show HN':",len(show_posts))
print("Remaining Posts:",len(other_posts))
Number of data points related to 'Ask HN': 9139 Number of data points related to 'Show HN': 10158 Remaining Posts: 273822
Below are the first five rows in the ask_posts list of lists:
print(ask_posts[:5])
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]
Below are the first five rows in the show_posts list of lists:
print(show_posts[:5])
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'], ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17'], ['12577142', 'Show HN: Jumble Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']]
Next, let's determine if "ask posts" or "show posts" receive more comments on an average.
tot_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
tot_ask_comments += num_comments
avg_ask_comments = tot_ask_comments/len(ask_posts)
print('"Ask posts" avg number of comments:',round(avg_ask_comments,2))
tot_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
tot_show_comments += num_comments
avg_show_comments = tot_show_comments/len(show_posts)
print('"Show posts" avg number of comments:',round(avg_show_comments,2))
"Ask posts" avg number of comments: 10.39 "Show posts" avg number of comments: 4.89
It is clear that the comments on 'Ask HN' posts are almost twice the number on 'Show HN' posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.
For the first step, let us create two dictionaries 'counts_by_hour' which contains the number of ask posts created during each hour of the day and 'comments_by_hour' which contains the corresponding number of comments ask posts received at each hour.
import datetime as dt #importing the datetime module
result_list = []
for row in ask_posts:
time_stamp = row[6]
num_comments = int(row[4])
result_list.append([time_stamp, num_comments])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
time_stamp = row[0]
num_comments = int(row[1])
time_obj = dt.datetime.strptime(time_stamp,"%m/%d/%Y %H:%M")
hour = time_obj.hour #using attribute "hour" from the datetime module
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += num_comments
elif hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = num_comments
print(counts_by_hour)
print(comments_by_hour)
{2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209} {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}
For the next step, let us create a list of lists containing the hours during which posts were created and the average number of comments those posts received.
avg_comments_by_hour = []
for key in counts_by_hour:
element1 = key
element2 = round(comments_by_hour[key]/counts_by_hour[key], 2) #calcuating average & rounding to 2 decimal places
avg_comments_by_hour.append([element1, element2]) # appending a list of 2 elements
print(avg_comments_by_hour)
[[2, 11.14], [1, 7.41], [22, 8.8], [21, 8.69], [19, 7.16], [17, 9.45], [15, 28.68], [14, 9.69], [13, 16.32], [11, 8.96], [10, 10.68], [9, 6.65], [7, 7.01], [3, 7.95], [23, 6.7], [20, 8.75], [16, 7.71], [8, 9.19], [0, 7.56], [18, 7.94], [12, 12.38], [4, 9.71], [6, 6.78], [5, 8.79]]
Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's sort the list of lists and print the five highest values in a format that's easier to read.
swap_avg_comments_by_hour = []
for item in avg_comments_by_hour:
element1 = item[1]
element2 = item[0]
swap_avg_comments_by_hour.append([element1, element2]) #swapping the elements and appending to a new list
sorted_swap = sorted(swap_avg_comments_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")
for item in sorted_swap[:5]:
avg_comments = item[0]
time = str(item[1])
time_obj = dt.datetime.strptime(time, "%H")
time_obj += dt.timedelta(hours=9, minutes = 30) #coverting EST to IST
time_format = time_obj.strftime("%H:%M")
string = f"{time_format} {avg_comments}"
print(string)
Top 5 Hours for Ask Posts Comments 00:30 28.68 22:30 16.32 21:30 12.38 11:30 11.14 19:30 10.68
From the above analysis, we observe that creating a post anytime between 22:30 to 00:30 IST will fetch us more number of comments.
From our third block of code from the start, we observe that "show HN" and "other" posts(i.e., excluding "show HN" and "ask HN") that we segregated into the respective "show_posts" and "other_posts" lists account for almost 97% of all the posts that are in our dataset. Let us explore two different attributes this time, "URL" and "num_points(or upvotes)" and see the impact the former has on the latter.
show_urls = []
show_blanks = 0
for row in show_posts:
url = row[2]
if not url:
url = 'unknown' #replacing blanks with 'unknown'
row[2] = url
show_blanks += 1
show_urls.append(url)
print('Number of posts without "URL" in the "show_posts" list:',show_blanks)
print('Number of posts with "URL" in the "show_posts" list:',(len(show_urls) - show_blanks))
Number of posts without "URL" in the "show_posts" list: 371 Number of posts with "URL" in the "show_posts" list: 9787
other_urls = []
other_blanks = 0
for row in other_posts:
url = row[2]
if not url:
url = 'unknown'
row[2] = url
other_blanks += 1
other_urls.append(url)
print('Number of posts without "URL" in the "other_posts" list:',other_blanks)
print('Number of posts with "URL" in the "other_posts" list:',(len(other_urls) - other_blanks))
Number of posts without "URL" in the "other_posts" list: 4409 Number of posts with "URL" in the "other_posts" list: 269413
In the above two lines of code, we replace all the blanks present under the 'URL' attribute with a string 'unknown'. Hence, these lines of code can be run only once. Though the variables "show_blanks" and "other_blanks" do not effect the rest of the code, if ran more than once, the two variables mentioned will return a zero.
For the next step, let us calculate the total and the average number of points achieved by "show HN" posts with and without a 'URL' attached to it.
tot_show_points = 0
tot_show_points_unknown = 0
show_url_count = 0
show_url_unknown_count = 0
for row in show_posts:
url = row[2]
num_points = int(row[3])
if url == 'unknown':
tot_show_points_unknown += num_points
show_url_unknown_count += 1
else:
tot_show_points += num_points
show_url_count += 1
print("Total number of points for posts without URL:", tot_show_points_unknown)
print("Total number of points for posts with URL:",tot_show_points, "\n")
print("Average number of points for posts without URL:", round(tot_show_points_unknown/show_url_unknown_count, 2))
print("Average number of points for posts with URL:",round(tot_show_points/show_url_count, 2))
Total number of points for posts without URL: 1855 Total number of points for posts with URL: 148926 Average number of points for posts without URL: 5.0 Average number of points for posts with URL: 15.22
Finally, let us calculate the total and average number of points achieved by "other" posts with and without a 'URL' attached to it.
tot_other_points = 0
tot_other_points_unknown = 0
other_url_count = 0
other_url_unknown_count = 0
for row in other_posts:
url = row[2]
num_points = int(row[3])
if url == 'unknown':
tot_other_points_unknown += num_points
other_url_unknown_count += 1
else:
tot_other_points += num_points
other_url_count += 1
print("Total number of points for posts without URL:", tot_other_points_unknown)
print("Total number of points for posts with URL:",tot_other_points, "\n")
print("Average number of points for posts without URL:", round(tot_other_points_unknown/other_url_unknown_count, 2))
print("Average number of points for posts with URL:",round(tot_other_points/other_url_count, 2))
Total number of points for posts without URL: 30391 Total number of points for posts with URL: 4119658 Average number of points for posts without URL: 6.89 Average number of points for posts with URL: 15.29
We observe a vast difference in the overall and the average number of points for posts with and without URL. From the average values, we see that there is twice in the case of "other" posts and more than thrice in the case of "show HN" posts.