Notebook

Hacker News Data Science Project¶

In this project, we'll work with a data set of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the dataset Here

In [1]:

from csv import reader
open_file = open("DataQuest\HackerNews.csv",encoding="utf8")
read_file = reader(open_file)
data_file= list(read_file)
hn = data_file[1:]
header = data_file[0]

In [2]:

print("This is the header of the row:")
print(header)
print("\n")


print("First five rows of the dataset:")
print(hn[1:6])

This is the header of the row:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


First five rows of the dataset:
[['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']]

Since we are interested only in post that begins with either "Ask HN" or "Show HN", we will filter out those rows where the post title begins with those two keywords.

In [3]:

# we will begin with creating three empty list
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title = row[1]
    
    if title.startswith("Ask HN"):
        ask_posts.append(row)
        
    elif title.startswith("Show HN"):
        show_posts.append(row)
        
    else:
        other_posts.append(row)
        
print("Total number of posts starting with \"Ask HN\" is {:,}".format(len(ask_posts)))
print("Total number of posts starting with\"Show HN\" is {:,}".format(len(show_posts)))
print("Total number of other post\"Other HN\" is {:,}".format(len(other_posts)))

Total number of posts starting with "Ask HN" is 9,122
Total number of posts starting with"Show HN" is 10,150
Total number of other post"Other HN" is 273,847

We created three lists:

Posts starting with "Ask HN"
Posts starting with "Show HN"
All other posts

In the next task, we will find out average numbers of comments in lists - Ask HN and Show HN

In [4]:

total_ask_comments = 0
for row in ask_posts:
    comments_count = int(row[4])
    total_ask_comments += comments_count

average_ask_comments = round(total_ask_comments/len(ask_posts),2)
print("Average numbers of comments/post in ask_posts list is {}".format(average_ask_comments))

total_show_comments = 0
for row in show_posts:
    comments_count = int(row[4])
    total_show_comments += comments_count

average_show_comments = round(total_show_comments/len(show_posts),2)
print("Average numbers of comments/post in show_posts list is {}".format(average_show_comments))

Average numbers of comments/post in ask_posts list is 10.41
Average numbers of comments/post in show_posts list is 4.89

Clearly, Ask posts receive more comments on an avearge (10.41) compared to show posts (4.89). Hence, we shall focus our analysis only on Ask posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [5]:

"""As we see clearly, data time stored in ask_posts list is in string class, to work with data and time, we need to first
convert it into datatime objects, so that working with data and times becomes easier."""

print(ask_posts[0][-1])
print(type(ask_posts[0][-1]))

9/26/2016 2:53
<class 'str'>

In [42]:

# Changed the datetime format to "mm/dd/yyyy hh:mm"

for each in ask_posts:
    date_time = each[6]
    date_time = date_time.replace("-","/")
    each[6] = date_time

In [53]:

# We will first import datetime module, so that we can use all classes available within the module.

import datetime as dt  # using alias "as" to shorten the name of the module. 

# Next we will iterate through the loop and will parse each datetime value stored in list into datetime class to..
#.. instantiate an object- datetime object

result_list = []
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in ask_posts:
    result_list.append([row[6],int(row[4])])
    
for element in result_list:
    date_time = element[0]
    num_comments = element[1]
    datetime_ob = dt.datetime.strptime(date_time, date_format)
    hour = datetime_ob.strftime("%H")   # pick only hour in string format from datetime object using strftime- its a method.
    
    # use hour as a key to enter the data in the dictionaries
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1         # this is no. of post
        comments_by_hour[hour] = num_comments      # this is no. of comments 
    else:
        counts_by_hour[hour] +=1
        comments_by_hour[hour] += num_comments

we created two dictionaries:

counts_by_hour: contains the number of ask posts created during each hour of the day.
comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

In [81]:

hours_avg_comments=[]

for hour, comments in comments_by_hour.items():
    comments_total = comments
    posts_total = counts_by_hour[hour]
    average_comments_per_post = round((comments_total/posts_total),2)
    hours_avg_comments.append((average_comments_per_post, hour))   # created a tuple for each pair of avg comments & hour
      

# in descending order - use sort
hours_avg_comments.sort(reverse=True)

for each in hours_avg_comments:
    print(each[1],":",each[0])

We can also create above result in this way.

In [90]:

avg_by_hour =[]

for hour in comments_by_hour:
    posts = counts_by_hour[hour]
    comments = comments_by_hour[hour]
    average = (comments/posts)
    avg_by_hour.append([hour,average])
    
print(avg_by_hour)
    

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.819371727748692], ['21', 8.720930232558139], ['19', 7.176043557168784], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.70703125], ['13', 16.350678733031675], ['11', 9.012903225806452], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.04], ['03', 7.974074074074074], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.717993079584775], ['08', 9.190661478599221], ['00', 7.575250836120401], ['18', 7.954248366013072], ['12', 12.380116959064328], ['04', 9.743801652892563], ['06', 6.782051282051282], ['05', 8.794258373205741]]

In [100]:

swap_avg_by_hour = []

for each in avg_by_hour:
    swap_avg_by_hour.append([each[1],each[0]])
                             
#print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
#print(sorted_swap)

for each in sorted_swap[0:5]:
    date_ob = dt.datetime.strptime(each[1],"%H")
    date_format = date_ob.strftime("%H:%M")
    
    print("{hour}: {comments:.2f} average comments per post".format(hour=date_format,comments=each[0]))
    

15:00: 28.68 average comments per post
13:00: 16.35 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post

From our above findings, we can say that top five hours during which the posts receive highest numbers of average comments are "15:00", "13:00", "12:00", "02:00", "10:00". Hour is in 24 hour format.

In [ ]: