Notebook

Exploring Hacker News Posts

In this project we'll work with a dataset of submissions to popular technology site Hacker News. The aim of this project is to determine which post type recieve most comments and at what time on the Hacker News site.

In [80]:

# Opening the hacker_news.csv file
opened_file = open('hacker_news.csv')

# Reading the hacker_news.csv file in as a list of lists
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

# Displaying the first five rows of hn
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

In [81]:

# Extracting the first row of data
headers = print(hn[0])

# Remove the first row from hn
hn = hn[1:]

# Display headers
print(headers)

# Display the first five rows of hn to verify that 
# the headers row is properly removed
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
None
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

In [82]:

# Creating three empty lists
ask_posts = []
show_posts = []
other_posts = []

# iterating through hn
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Checking the number of posts 
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print(ask_posts[:5])

1744
1162
17194
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]

In [84]:

# Finding the total number of comments in ask posts
total_ask_comments = 0

# Iterating over the ask posts
for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
    
# Computing the average number of comments in ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661

In [85]:

# Finding the total number of comments in ask posts
total_show_comments = 0
for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments

# Computing the average number of comments in show posts 
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993

Ask posts received more comments on average than show posts as seen from the results of the codes above.

In [86]:

# Import datetime as dt
import datetime as dt
result_list = []

# Iterating over ask_posts
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])

# Creat two empty dictionaries    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = row[1]
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += row[1]
        
comments_by_hour        
        
   

        

        

        
        
        

Out[86]:

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

Calculating the average number of comments per post for posts created during each hour of the day.

In [87]:

# Creating a list of lists containing the hours during which the
# posts were created and the average number of comments.
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour   

Out[87]:

[['16', 16.796296296296298],
 ['05', 10.08695652173913],
 ['15', 38.5948275862069],
 ['20', 21.525],
 ['04', 7.170212765957447],
 ['23', 7.985294117647059],
 ['06', 9.022727272727273],
 ['09', 5.5777777777777775],
 ['01', 11.383333333333333],
 ['10', 13.440677966101696],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['03', 7.796296296296297],
 ['07', 7.852941176470588],
 ['13', 14.741176470588234],
 ['14', 13.233644859813085],
 ['18', 13.20183486238532],
 ['11', 11.051724137931034],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['21', 16.009174311926607],
 ['02', 23.810344827586206],
 ['00', 8.127272727272727],
 ['19', 10.8]]

** Sorting The List of Lists And printing The Highest Values**

In [90]:

# Creating a lists that equals avg_by_hour with swapped columns
swap_avg_by_hour = []

# Iterating over the rows of avg_by_hour
for row in avg_by_hour:
    first_elemt = row[1]
    second_elemt = row[0]
    swap_avg_by_hour.append([first_elemt, second_elemt])
    
print(swap_avg_by_hour) 

# sorting swap_avg_by_hour in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[16.796296296296298, '16'], [10.08695652173913, '05'], [38.5948275862069, '15'], [21.525, '20'], [7.170212765957447, '04'], [7.985294117647059, '23'], [9.022727272727273, '06'], [5.5777777777777775, '09'], [11.383333333333333, '01'], [13.440677966101696, '10'], [9.41095890410959, '12'], [11.46, '17'], [7.796296296296297, '03'], [7.852941176470588, '07'], [14.741176470588234, '13'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.051724137931034, '11'], [6.746478873239437, '22'], [10.25, '08'], [16.009174311926607, '21'], [23.810344827586206, '02'], [8.127272727272727, '00'], [10.8, '19']]

Out[90]:

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

** Printing and iterating through the top 5 hours for ask posts comments**

In [91]:

# Print the string "Top 5 Hours for Ask Posts Comments"
print("Top 5 Hours for Ask Posts Comments")

# Iterating through each average and each hour
for avg,hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

The right hour to create a post is 15:00 because it has the highest average comments per post than any other of the hours. The time zone used is the Eastern time in the US, therefore 15:00 is 3:00 pm est.

Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis and result, ask post receive more comments at 3:00 pm est.

In [ ]: