Notebook

Yacking Hackers¶

The word *Hacker* can mean many of things. Hackers can hack with good intentions or malacious ones.

For this project we will be focusing on a commmunity of Hackers with good intentions. We will be looking at a popular technology platform, Hacker News. This site works very similar to other online forums such as reddit, where user's submit posts and recieve answers, comments, votes and feedback.

We removed any submissions without any comments bringing our total amount of rows to an approximate *20,000. Within the dataset we are specifically interested in the posts with titles that begin with either Ask HN* or *Show HN*.

*Ask HN* - users asks the Hacker News community specific questions
- Ask HN: Career outlook in tech without a CS degree?Link
- Ask HN: How would you, a software engineer, spec out a new MacBook Pro?Link
*Show HN* - users posts to show the Hacker News community projects or anything interesting
- Show HN: I built a sonar into my surfboard (foobarbecue.github.io) Link
- Show HN: Dev's Full Stack Nightmare (an accurate representation) (sleepy-meadow-72878.herokuapp.com) Link

With Hacker News being extremely popular in the hacker culture and tech startups, users posts has a chance of recieving hundreds of thousands of visitors as a result.

In this analysis, we will be comparing *Ask HN* and *Show HN* to determine if either of these two recieve more comments on average and if posts created at a certain time recieve more comments on avereage.

Lets begin:¶

In [1]:

### import csv file, open, read ###

import csv
opened_file = open('hacker_news.csv')
read_file = csv.reader(opened_file)
hn = list(read_file)


### iterate through hn, print first 5 rows ### 
for i in hn [:5]:
    print(i)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

To continue, we must split the column header row with the rest of the data:

In [2]:

headers = hn[0]
hn = hn[1:]

In [3]:

print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [4]:

### first 5 rows of data ###
hn[0:5]

Out[4]:

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

The Column headers that we will use for our analysis are:

*num_points* - The number of points a post has
*num_comments* - The number of comments a post has
*created_at* - When the post was created

Analyzing the Data:¶

Start by extracting Ask HN posts, Show HN posts and Other posts¶

Create three empty lists, each containg the different types of posts (ASk HN, Show Hn, Other).

Then iterate over each row and append the appropiate rows by title.

In [5]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1] 
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)


print('Total Number Ask HN posts:', len(ask_posts))
print('Total Number Show HN posts:', len(show_posts))
print('Total Number Other posts:', len(other_posts))    

Total Number Ask HN posts: 1744
Total Number Show HN posts: 1162
Total Number Other posts: 17194

Calculating the Average Number of Comments for Ask HN and Show HN and Other Posts¶

We can now use the lists we've created above to calculate the average number of comments per Ask posts, Show posts and other posts.

In [6]:

total_ask_comments = 0

### Ask HN Avg ###
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print('Total number of Ask Comments:', total_ask_comments)
print('Avg number of Ask Comments:', avg_ask_comments)
print('\n')
print('Clean Avg number of Ask Comments: {:.2f}'.format(avg_ask_comments))
    

Total number of Ask Comments: 24483
Avg number of Ask Comments: 14.038417431192661


Clean Avg number of Ask Comments: 14.04

In [7]:

total_show_comments = 0

### Show HN Avg ###
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print('Total number of Show Comments:', total_show_comments)
print('Avg number of Show Comments:', avg_show_comments)
print('\n')
print('Clean Avg number of Show Comments: {:.2f}'.format(avg_show_comments))

Total number of Show Comments: 11988
Avg number of Show Comments: 10.31669535283993


Clean Avg number of Show Comments: 10.32

In [8]:

total_other_comments = 0

### Other Posts Avg ###
for row in other_posts:
    num_comments = int(row[4])
    total_other_comments += num_comments
    
avg_other_comments = total_other_comments / len(other_posts)
print('Total number of Other Comments:', total_other_comments)
print('Avg number of Other Comments:', avg_other_comments)
print('\n')
print('Clean Avg number of Other Comments: {:.2f}'.format(avg_other_comments))

Total number of Other Comments: 462055
Avg number of Other Comments: 26.8730371059672


Clean Avg number of Other Comments: 26.87

We can now see Other Posts recieves more comments on average than Show HN and Ask HN combined.

Other Posts Avg Comments - *26.87*
Ask HN Avg Comments - *14.04*
Show HN Avg Comments - *10.32*

But for our analysis, we are interested in between the Ask HN avg and the Show HN avg. Which above, clearly states that Ask HN posts recieves more comments on average than Show HN posts.

Calculating the Average Number of Comments per post for posts created during each hour of the day (Ask HN)¶

Now, it's time to determine at what certain time are posts more likely to attract comments.

We'll be working with the data in the created_at column to:

calculate the number of ask posts and show posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts and show posts receive by hour created.

In [9]:

import datetime as dt
result_list = []

### Ask HN ###
for row in ask_posts:
    result_list.append([row[6],int(row[4])])
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    comments_count = row[1]
    date_string = row[0]
    date_created = dt.datetime.strptime(date_string,"%m/%d/%Y %H:%M")
    hour_created = date_created.hour
    if hour_created in counts_by_hour:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments_count 
    else:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments_count


print('Ask HN posts created by hour:', counts_by_hour)
print('\n')
print('Ask HN comments post per hour:', comments_by_hour)    

Ask HN posts created by hour: {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


Ask HN comments post per hour: {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}

Now to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

In [10]:

avg_by_hour = []

### Avg comments posts recieve ###
for key in counts_by_hour:
    avg_posts = comments_by_hour[key]/counts_by_hour[key] 
    avg_by_hour.append([key,avg_posts])

avg_by_hour
    

Out[10]:

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

We now have the results we need but the above format makes it very difficult to read and identify the hours with the highest values. We will swap the elements to show the average of number of comments first and the hour second.

Lets sort the list of lists and print the 5 highest values in a more presentable format.

Sorting and Printing Value from a List of Lists¶

In [11]:

swap_avg_by_hour = []

### swapping elements and appending ###
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
   
print(swap_avg_by_hour)

[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]

In [12]:

### time to sort swap_avg_by_hour descending order ###

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
    hour_formatted = dt.datetime.strptime(str(row[1]),'%H')
    hour_formatted = hour_formatted.strftime('%H:%M')

    print('{}: {:.2f} average comments per post'.format(hour_formatted,row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

Now to replicate what we just did to figure out the top 5 hours for Show HN posts comments.

Calculating the Average Number of Comments per post for posts created during each hour of the day (Show Hn)¶

In [13]:

import datetime as dt
result_list = []

### Show HN ###
for row in show_posts:
    result_list.append([row[6],int(row[4])])
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    comments_count = row[1]
    date_string = row[0]
    date_created = dt.datetime.strptime(date_string,"%m/%d/%Y %H:%M")
    hour_created = date_created.hour
    if hour_created in counts_by_hour:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments_count 
    else:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments_count


print('Show HN posts created by hour:', counts_by_hour)
print('\n')
print('Show HN comments post per hour:', comments_by_hour) 

Show HN posts created by hour: {14: 86, 22: 46, 18: 61, 7: 26, 20: 60, 5: 19, 16: 93, 19: 55, 15: 78, 3: 27, 17: 93, 6: 16, 2: 30, 13: 99, 8: 34, 21: 47, 4: 26, 11: 44, 12: 61, 23: 36, 9: 30, 1: 28, 10: 36, 0: 31}


Show HN comments post per hour: {14: 1156, 22: 570, 18: 962, 7: 299, 20: 612, 5: 58, 16: 1084, 19: 539, 15: 632, 3: 287, 17: 911, 6: 142, 2: 127, 13: 946, 8: 165, 21: 272, 4: 247, 11: 491, 12: 720, 23: 447, 9: 291, 1: 246, 10: 297, 0: 487}

In [14]:

avg_by_hour = []

### Avg comments posts recieve ###
for key in counts_by_hour:
    avg_posts = comments_by_hour[key]/counts_by_hour[key] 
    avg_by_hour.append([key,avg_posts])

avg_by_hour

Out[14]:

[[14, 13.44186046511628],
 [22, 12.391304347826088],
 [18, 15.770491803278688],
 [7, 11.5],
 [20, 10.2],
 [5, 3.0526315789473686],
 [16, 11.655913978494624],
 [19, 9.8],
 [15, 8.102564102564102],
 [3, 10.62962962962963],
 [17, 9.795698924731182],
 [6, 8.875],
 [2, 4.233333333333333],
 [13, 9.555555555555555],
 [8, 4.852941176470588],
 [21, 5.787234042553192],
 [4, 9.5],
 [11, 11.159090909090908],
 [12, 11.80327868852459],
 [23, 12.416666666666666],
 [9, 9.7],
 [1, 8.785714285714286],
 [10, 8.25],
 [0, 15.709677419354838]]

In [15]:

swap_avg_by_hour = []

### swapping elements and appending ###
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
   
print(swap_avg_by_hour)

[[13.44186046511628, 14], [12.391304347826088, 22], [15.770491803278688, 18], [11.5, 7], [10.2, 20], [3.0526315789473686, 5], [11.655913978494624, 16], [9.8, 19], [8.102564102564102, 15], [10.62962962962963, 3], [9.795698924731182, 17], [8.875, 6], [4.233333333333333, 2], [9.555555555555555, 13], [4.852941176470588, 8], [5.787234042553192, 21], [9.5, 4], [11.159090909090908, 11], [11.80327868852459, 12], [12.416666666666666, 23], [9.7, 9], [8.785714285714286, 1], [8.25, 10], [15.709677419354838, 0]]

In [17]:

### sort swap_avg_by_hour descending order ###

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('Top 5 Hours for Show Posts Comments')
for row in sorted_swap[:5]:
    hour_formatted = dt.datetime.strptime(str(row[1]),'%H')
    hour_formatted = hour_formatted.strftime('%H:%M')

    print('{}: {:.2f} average comments per post'.format(hour_formatted,row[0]))

Top 5 Hours for Show Posts Comments
18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post
22:00: 12.39 average comments per post

Conclusion¶

We now have the best hours of the day when a user should expect the most comments for an Ask HN post and Show Hn post.

Ask HN
- *3:00* pm with an average of *38.59* comments per post
- *2:00* am with an average of *23.81* comments per post
- *8:00* pm with an average of *21.52* comments per post
- *4:00* pm with an average of *16.80* comments per post
- *9:00* pm with an average of *16.01* comments per post
Show HN
- *6:00* pm with an average of *15.77* comments per post
- *12:00* am with an average of *15.71* comments per post
- *2:00* pm with an average of *13.44* comments per post
- *11:00* pm with an average of *12.42* comments per post
- *10:00* pm with an average of *12.39* comments per post

We can see that Ask HN posts contains more average comments per posts than Show HN posts. This shows that users comment more on question based posts for intellectual gratification of helping that user understand the problem they are facing. Problems can be solved in many different ways, this can often lead to debates among the commentators, hence more comments.