Notebook

Exploring Hacker News Posts¶

Introduction¶

Hacker News is a popular site where the users can post, comment, and vote.
We are interested in two types of posts Ask HN and Show HN posts. Users submit Ask HN posts to ask the Hacker News community a specific question while Show HN posts are to show the Hacker News community a project, product, or just something interesting.

Posing Questions:
1-Do Ask HN or Show HN receive more comments on average?
2-Do posts created at a certain time receive more comments on average?

The source of the data set is here

In [1]:

#import required package
from csv import reader
#open the file
opened_file = open('hacker_news.csv')
#read the file
read_file = reader(opened_file)
#Convert csv file into list of lists
hn = list(read_file)
#explore the first 5 rows
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

In [2]:

#extract the first row of data(headers)
headers = hn[0]
#remove the headers from data set hn 
hn = hn[1:]
print('Headers',headers)
#explore the first 5 rows without the header
print('First 5 Rows in The Data Set\n',hn[:5])

Headers ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
First 5 Rows in The Data Set
 [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Since we are only concerned with post titles that are beginning with Ask HN and Show HN.
We will create a list of lists for Ask HN and Show HN post title. Since the capitalization matters in startswith() fnction,we will use lower() function to return a lowercase version of the title to control all cases.

In [3]:

#lists for posts based on their title
ask_posts = []
show_posts = []
other_posts = []

#loop through each row in hn
for row in hn:
    #Assign the title of the post to the variable title
    title = row[1]
    #convert titles into lowercase version
    title = title.lower()
    
    #check if the title starts with...
    if title.startswith('ask hn'):
        #add the title to the list
        ask_posts.append(row)
    #check if the title starts with...    
    elif title.startswith('show hn'):
        #add the title to the list
        show_posts.append(row)    
    else:
        other_posts.append(row)
        
#Check the number of posts in each list       
print('The number of posts with Ask hn title is :',len(ask_posts)) 
print('The number of posts with Show hn title is :',len(show_posts))
print('The number of posts with other title is :',len(other_posts))        
        

The number of posts with Ask hn title is : 1744
The number of posts with Show hn title is : 1162
The number of posts with other title is : 17194

In [4]:

#explore the first 5 rows of ask_posts list
ask_posts[:5]

Out[4]:

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

In [5]:

#explore the first 5 rows of show_posts list
show_posts[:5]

Out[5]:

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

In [6]:

#explore the first 5 rows of other_posts list
other_posts[:5]

Out[6]:

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Data Analysis¶

Calculate The Average Number of Comments:¶

In this step, we calculate the average number of comments for Ask HN and Show HN posts, to find out which post type gets the highest average.

In [7]:

#set the counter to 0
total_ask_comments = 0

#loop through the list
for row in ask_posts:
    '''assign the number of comments to num_comments variable 
    and convert the column from string into integer'''
    num_comments = int(row[4])
    #add the vlue
    total_ask_comments += num_comments 
    
#calculate the average 
avg_ask_comments = total_ask_comments / len(ask_posts)
print('The average number of comments on ask posts is :{:.2f}'.format(avg_ask_comments))

The average number of comments on ask posts is :14.04

In [8]:

#set the counter to 0
total_show_comments = 0

#loop through the list
for row in show_posts:
    '''assign the number of comments to num_comments variable 
    and convert the column from string into integer'''
    num_comments = int(row[4])
    #add the vlue
    total_show_comments += num_comments
    
#calculate the average    
avg_show_comments = total_show_comments / len(show_posts) 
print('The average number of comments on show posts is :{:.2f}'.format(avg_show_comments))

The average number of comments on show posts is :10.32

As we see the Ask HN posts receive more comments on average about (14.04) while Show HN posts receive (10.32) on average

Ask HN posts Analysis¶

As the Ask HN posts are more likely to receive more comments, we will focus our analysis on these posts, to understand more about members' behaviors and how they interact on this website.

Finding The Amount of Ask posts and Comments by Hour Created¶

We will find the amount of Ask posts and comments by hour created through two steps.
1-Create a list of lists with two elements time and the number of comments for the post.
2-Create two dictionaries one to keep track of the number of posts created by hour and the second one to keep track of the number of comments created by hour.

In [9]:

#required package
import datetime as dt

#a list of lists with 2 elements time and the number of comments for the post
result_list = []

#loop
for row  in ask_posts:
    #get time 
    post_time = row[6]
    #get the number of comments
    num_comments = int(row[4])
    #add time and the number of comments as list to the result_list 
    result_list.append([post_time,num_comments])

#two dics to find amount of Ask posts and Comments by Hour Created     
#keep track of the number of posts created by hour  
counts_by_hour = {}
#keep track of the number of comments created by hour
comments_by_hour = {}

#loop 
for item in result_list:
    #get the date
    date = item[0]
    #get datetime object
    date = dt.datetime.strptime(date,'%m/%d/%Y %H:%M')
    #get the hour
    hour = date.hour
    
    #if hour is already a key in the dic
    if hour in counts_by_hour:
        #Increment the value by 1
        counts_by_hour[hour] += 1
        #Increment the value by comment number
        comments_by_hour[hour] += item[1]
    #if not      
    else:
        #set the value to 1
        counts_by_hour[hour] = 1
        #set the value to the comment number
        comments_by_hour[hour] = item[1]        

In [10]:

#explore the number of posts created by hour
counts_by_hour

Out[10]:

{0: 55,
 1: 60,
 2: 58,
 3: 54,
 4: 47,
 5: 46,
 6: 44,
 7: 34,
 8: 48,
 9: 45,
 10: 59,
 11: 58,
 12: 73,
 13: 85,
 14: 107,
 15: 116,
 16: 108,
 17: 100,
 18: 109,
 19: 110,
 20: 80,
 21: 109,
 22: 71,
 23: 68}

In [11]:

#explore the number of comments created by hour
comments_by_hour

Out[11]:

{0: 447,
 1: 683,
 2: 1381,
 3: 421,
 4: 337,
 5: 464,
 6: 397,
 7: 267,
 8: 492,
 9: 251,
 10: 793,
 11: 641,
 12: 687,
 13: 1253,
 14: 1416,
 15: 4477,
 16: 1814,
 17: 1146,
 18: 1439,
 19: 1188,
 20: 1722,
 21: 1745,
 22: 479,
 23: 543}

Calculating The Average Number of Comments for Ask HN Posts by Hour¶

In this step, we will calculate the average number of comments for posts created during each individual hour. For each hour we will take the total comments and divide it by the total posts from the dictionaries we have created.

In [12]:

#list of lists ,2 elements : hour & avg
avg_by_hour = []

#nested loop through the 2 dics
for comment_h in comments_by_hour:
    for post_h in counts_by_hour:
        #if hour is equal
        if comment_h == post_h:
            #calculate the avg
            avg = comments_by_hour[comment_h]/counts_by_hour[post_h]
            #add the hour & avg to the list
            avg_by_hour.append([post_h,avg])
            
#display the results        
print('The Average Number of Comments per Post in Each Hour')

#loop
for h,avg in avg_by_hour:
    #print hour & average
    print(h,":",avg)

The Average Number of Comments per Post in Each Hour
0 : 8.127272727272727
1 : 11.383333333333333
2 : 23.810344827586206
3 : 7.796296296296297
4 : 7.170212765957447
5 : 10.08695652173913
6 : 9.022727272727273
7 : 7.852941176470588
8 : 10.25
9 : 5.5777777777777775
10 : 13.440677966101696
11 : 11.051724137931034
12 : 9.41095890410959
13 : 14.741176470588234
14 : 13.233644859813085
15 : 38.5948275862069
16 : 16.796296296296298
17 : 11.46
18 : 13.20183486238532
19 : 10.8
20 : 21.525
21 : 16.009174311926607
22 : 6.746478873239437
23 : 7.985294117647059

Display The Average in Descending Order¶

To make the results more readable and easy to identify the hour with the highest average, we will sort the list based on the average in descending order.
Since the sorted()function will sort the list based on the first column, we need to swap the avg_by_hour list's columns to make the average the first column.

In [13]:

#list has swapped columns of avg_by_hour list
swap_avg_by_hour = []

#loop
for item in avg_by_hour:
    #add the swapped columns to the list 
    swap_avg_by_hour.append([item[1],item[0]])
    
#explore    
swap_avg_by_hour   

Out[13]:

[[8.127272727272727, 0],
 [11.383333333333333, 1],
 [23.810344827586206, 2],
 [7.796296296296297, 3],
 [7.170212765957447, 4],
 [10.08695652173913, 5],
 [9.022727272727273, 6],
 [7.852941176470588, 7],
 [10.25, 8],
 [5.5777777777777775, 9],
 [13.440677966101696, 10],
 [11.051724137931034, 11],
 [9.41095890410959, 12],
 [14.741176470588234, 13],
 [13.233644859813085, 14],
 [38.5948275862069, 15],
 [16.796296296296298, 16],
 [11.46, 17],
 [13.20183486238532, 18],
 [10.8, 19],
 [21.525, 20],
 [16.009174311926607, 21],
 [6.746478873239437, 22],
 [7.985294117647059, 23]]

In [14]:

#sort the list based on the average in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
#explore
sorted_swap

Out[14]:

[[38.5948275862069, 15],
 [23.810344827586206, 2],
 [21.525, 20],
 [16.796296296296298, 16],
 [16.009174311926607, 21],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [13.20183486238532, 18],
 [11.46, 17],
 [11.383333333333333, 1],
 [11.051724137931034, 11],
 [10.8, 19],
 [10.25, 8],
 [10.08695652173913, 5],
 [9.41095890410959, 12],
 [9.022727272727273, 6],
 [8.127272727272727, 0],
 [7.985294117647059, 23],
 [7.852941176470588, 7],
 [7.796296296296297, 3],
 [7.170212765957447, 4],
 [6.746478873239437, 22],
 [5.5777777777777775, 9]]

As we can notice the highest average is 38.59 at 15:00.However, to format the hour as this we will use the datetime package.

In [15]:

print('Top 5 Hours for Ask Posts Comments')

#loop
for item in sorted_swap[:5]:
    #get hour as string
    hour = str(item[1])
    #get the average
    avg = item[0]
    
    #create a datetime object
    hour = dt.datetime.strptime(hour,'%H')
    #format hour ex:15:00
    hour = hour.strftime('%H:%M')
    
    #display
    print(hour,': {:.2f}'.format(avg))

Top 5 Hours for Ask Posts Comments
15:00 : 38.59
02:00 : 23.81
20:00 : 21.52
16:00 : 16.80
21:00 : 16.01

As we see these the top hours that people being active on the website.
Based on the documentation the time zone is Eastern Time in the US . So, to make this insight more useful for people who are sharing my own time zone, we will convert the hours into my time zone Asia/Riyadh.

Find The Differenc Hours Between The Two Time Zones¶

In [16]:

#requred package to deal with the time zone
import pytz
#get the current time of 'US/Eastern' time zone
timezone_Eastern = 'US/Eastern'
time_Eastern = dt.datetime.now(pytz.timezone(timezone_Eastern))
print(time_Eastern)

#convert the time from 'US/Eastern' into 'Asia/Riyadh'
myTimezone = time_Eastern.astimezone(pytz.timezone('Asia/Riyadh'))
print(myTimezone)

2020-07-23 09:09:51.084056-04:00
2020-07-23 16:09:51.084056+03:00

As we notice the difference between US/Eastern and Asia/Riyadh is 7 hours. We will use the timedelta object to add 7 hours to the top 5 hours to get the hours in Asia/Riyadh time zone.

In [17]:

# get the top 5 hours in Asia/Riyadh time zone
import pytz
print('Top 5 Hours for Ask Posts Comments in Asia/Riyadh Time Zone')

for item in sorted_swap[:5]:
    hour = str(item[1])
    avg = item[0]
    
    hour = dt.datetime.strptime(hour,'%H')
    #add 7 hours to the hours that are in US/Eastern time zone
    hour= hour + dt.timedelta(hours = 7)
    hour = hour.strftime('%H:%M')
    
    print(hour,': {:.2f}'.format(avg))

Top 5 Hours for Ask Posts Comments in Asia/Riyadh Time Zone
22:00 : 38.59
09:00 : 23.81
03:00 : 21.52
23:00 : 16.80
04:00 : 16.01

conclusions¶

Ask HN posts receive more comments on average than Show HN posts. However, what is the proper hours you should post on the website to get the best chance of receiving comments in your post? Based on the analysis. The top two hours regarding Eastern Time in the US is 3:00 p.m. with average comment (38.59) and 2:00 a.m. with average (23.81). The top two hours in my time zone Asia time in Riyadh is 11:00 p.m. with average comment (38.59) and 9:00 a.m. with average (23.81).