In this project, we'll work with a data set of submissions to popular technology site Hacker News.
Hacker News is a site started by the startup incubator Y Combinator
, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
You can find the dataset Here
from csv import reader
open_file = open("DataQuest\HackerNews.csv",encoding="utf8")
read_file = reader(open_file)
data_file= list(read_file)
hn = data_file[1:]
header = data_file[0]
print("This is the header of the row:")
print(header)
print("\n")
print("First five rows of the dataset:")
print(hn[1:6])
This is the header of the row: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] First five rows of the dataset: [['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']]
Since we are interested only in post that begins with either "Ask HN"
or "Show HN"
, we will filter out those rows where the post title begins with those two keywords.
# we will begin with creating three empty list
ask_posts=[]
show_posts=[]
other_posts=[]
for row in hn:
title = row[1]
if title.startswith("Ask HN"):
ask_posts.append(row)
elif title.startswith("Show HN"):
show_posts.append(row)
else:
other_posts.append(row)
print("Total number of posts starting with \"Ask HN\" is {:,}".format(len(ask_posts)))
print("Total number of posts starting with\"Show HN\" is {:,}".format(len(show_posts)))
print("Total number of other post\"Other HN\" is {:,}".format(len(other_posts)))
Total number of posts starting with "Ask HN" is 9,122 Total number of posts starting with"Show HN" is 10,150 Total number of other post"Other HN" is 273,847
We created three lists:
In the next task, we will find out average numbers of comments in lists - Ask HN
and Show HN
total_ask_comments = 0
for row in ask_posts:
comments_count = int(row[4])
total_ask_comments += comments_count
average_ask_comments = round(total_ask_comments/len(ask_posts),2)
print("Average numbers of comments/post in ask_posts list is {}".format(average_ask_comments))
total_show_comments = 0
for row in show_posts:
comments_count = int(row[4])
total_show_comments += comments_count
average_show_comments = round(total_show_comments/len(show_posts),2)
print("Average numbers of comments/post in show_posts list is {}".format(average_show_comments))
Average numbers of comments/post in ask_posts list is 10.41 Average numbers of comments/post in show_posts list is 4.89
Clearly, Ask posts receive more comments on an avearge (10.41) compared to show posts (4.89). Hence, we shall focus our analysis only on Ask posts.
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
"""As we see clearly, data time stored in ask_posts list is in string class, to work with data and time, we need to first
convert it into datatime objects, so that working with data and times becomes easier."""
print(ask_posts[0][-1])
print(type(ask_posts[0][-1]))
9/26/2016 2:53 <class 'str'>
# Changed the datetime format to "mm/dd/yyyy hh:mm"
for each in ask_posts:
date_time = each[6]
date_time = date_time.replace("-","/")
each[6] = date_time
# We will first import datetime module, so that we can use all classes available within the module.
import datetime as dt # using alias "as" to shorten the name of the module.
# Next we will iterate through the loop and will parse each datetime value stored in list into datetime class to..
#.. instantiate an object- datetime object
result_list = []
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in ask_posts:
result_list.append([row[6],int(row[4])])
for element in result_list:
date_time = element[0]
num_comments = element[1]
datetime_ob = dt.datetime.strptime(date_time, date_format)
hour = datetime_ob.strftime("%H") # pick only hour in string format from datetime object using strftime- its a method.
# use hour as a key to enter the data in the dictionaries
if hour not in counts_by_hour:
counts_by_hour[hour] = 1 # this is no. of post
comments_by_hour[hour] = num_comments # this is no. of comments
else:
counts_by_hour[hour] +=1
comments_by_hour[hour] += num_comments
we created two dictionaries:
counts_by_hour
: contains the number of ask posts created during each hour of the day.comments_by_hour
: contains the corresponding number of comments ask posts created at each hour received.hours_avg_comments=[]
for hour, comments in comments_by_hour.items():
comments_total = comments
posts_total = counts_by_hour[hour]
average_comments_per_post = round((comments_total/posts_total),2)
hours_avg_comments.append((average_comments_per_post, hour)) # created a tuple for each pair of avg comments & hour
# in descending order - use sort
hours_avg_comments.sort(reverse=True)
for each in hours_avg_comments:
print(each[1],":",each[0])
15 : 28.68 13 : 16.35 12 : 12.38 02 : 11.14 10 : 10.68 04 : 9.74 14 : 9.71 17 : 9.45 08 : 9.19 11 : 9.01 22 : 8.82 05 : 8.79 20 : 8.75 21 : 8.72 03 : 7.97 18 : 7.95 16 : 7.72 00 : 7.58 01 : 7.41 19 : 7.18 07 : 7.04 06 : 6.78 23 : 6.7 09 : 6.65
We can also create above result in this way.
avg_by_hour =[]
for hour in comments_by_hour:
posts = counts_by_hour[hour]
comments = comments_by_hour[hour]
average = (comments/posts)
avg_by_hour.append([hour,average])
print(avg_by_hour)
[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.819371727748692], ['21', 8.720930232558139], ['19', 7.176043557168784], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.70703125], ['13', 16.350678733031675], ['11', 9.012903225806452], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.04], ['03', 7.974074074074074], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.717993079584775], ['08', 9.190661478599221], ['00', 7.575250836120401], ['18', 7.954248366013072], ['12', 12.380116959064328], ['04', 9.743801652892563], ['06', 6.782051282051282], ['05', 8.794258373205741]]
swap_avg_by_hour = []
for each in avg_by_hour:
swap_avg_by_hour.append([each[1],each[0]])
#print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
#print(sorted_swap)
for each in sorted_swap[0:5]:
date_ob = dt.datetime.strptime(each[1],"%H")
date_format = date_ob.strftime("%H:%M")
print("{hour}: {comments:.2f} average comments per post".format(hour=date_format,comments=each[0]))
15:00: 28.68 average comments per post 13:00: 16.35 average comments per post 12:00: 12.38 average comments per post 02:00: 11.14 average comments per post 10:00: 10.68 average comments per post
From our above findings, we can say that top five hours during which the posts receive highest numbers of average comments are "15:00", "13:00", "12:00", "02:00", "10:00"
. Hour is in 24 hour format.