#!/usr/bin/env python # coding: utf-8 # # Exploring Hacker News Posts # # ## About Hacker News # [Hacker News](https://news.ycombinator.com/news) is a popular site started by startup incubator [Y Combinator](https://www.ycombinator.com), where user- submitted stories or "posts" are voted and commented on. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can recceive hundreds of thousands of visitors as a result. # # ## Source # You can find the [data set here](https://www.kaggle.com/hacker-news/hacker-news-posts). The data has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all sunmissions that did not receive any comments, and then randomly sampling from the remaining submissions. # # ## Goal # Users submit **Ask HN** posts to ask the Hacker News community a specific question. # Users submit **Show HN** posts to show the Hacker News communtity a project, product, or something interesting. # For this analysis, we will compare Ask HN post and Show HN posts to determine: # # 1. Which posts, Ask HN or Show HN, receives more comments on average? # 2. Do posts created at a certain time receive more comments on average? # # # # # Introduction # # We begin by reading in the data set and removing the headers. # In[1]: # Read in the data set. from csv import reader opened_file = open('hacker_news.csv') read_file = reader(opened_file) hn = list(read_file) hn[:5] # # Removing Headers from a List of Lists # In[2]: # Remove the headers. headers = hn[0] hn = hn[1:] print(headers) print(hn[:5]) # The data set contain the title of the posts, the number of comments for each post, and the date the post was created. # # Extracting Ask HN and Show HN Posts # # We will create a list for Ask HN, Show HN and Other HN Posts, extract corresponding posts and place them in the appropriate list. The Other HN list will contain non-Ask HN and non-Show HN post that have comments.Then we will determine the length of each list. Filtering the data allows for specific and and easier analysis moving forward. # In[3]: # Extracting posts that begin with 'Ask HN' or 'Show HN' and inserting those posts into the approriapte list. ask_posts = [] show_posts = [] other_posts = [] for post in hn: title = post[1] if title.lower().startswith('ask hn'): ask_posts.append(post) elif title.lower().startswith('show hn'): show_posts.append(post) else: other_posts.append(post) print(len(ask_posts)) print(len(show_posts)) print(len(other_posts)) # # Calculating the Average Number of Comments for AsK HN and Show HN Posts # # We now have the Ask HN and Show HN posts in separate lists. We will calculate the average number of posts received by each. # In[4]: # Calculate the average number of 'Ask HN' posts received. total_ask_comments = 0 for post in ask_posts: total_ask_comments += int(post[4]) print(total_ask_comments) avg_ask_comments = total_ask_comments / len(ask_posts) print(avg_ask_comments) # In[5]: total_show_comments = 0 for post in show_posts: total_show_comments += int(post[4]) print(total_show_comments) avg_comments = total_show_comments / len(show_posts) print(avg_comments) # There were more than twice as many Ask HN posts as there were Show HN posts. On average, Ask HN posts received approximately 14 comments; while, Show HN post received approximately 10. Ask HN posts are more likely to receive comments; therefore, for the remainder of the analysis, these posts will be the focus. # # Finding the Amount of Ask Posts and Comments by hour Created # In[6]: # Calculate the number of Ask HN post created during each our of the day and the number of comments received for each. import datetime as dt result_list = [] for post in ask_posts: result_list.append( [post[6], int(post[4])] ) counts_by_hour = {} comments_by_hour = {} date_format = "%m/%d/%Y %H:%M" for row in result_list: date = row[0] comments = row[1] hour = dt.datetime.strptime(date, date_format).strftime("%H") if hour in counts_by_hour: comments_by_hour[hour] += comments counts_by_hour[hour] += 1 else: comments_by_hour[hour] = comments counts_by_hour[hour] = 1 comments_by_hour # # Calculating the Average Number of Comments for ask Hn Posts by Hour # In[7]: # Calculate the average number of comments on 'Ask HN' posts created at each hour of the day that it was received. avg_by_hour = [] for hour in comments_by_hour: avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]]) avg_by_hour # # Sorting and Printing Values from a List of Lists. # In[8]: # Create a new list to add the hour of the day and average number of comments received that hour to. Sort from highest to lowest average number of comments. swap_avg_by_hour = [] for row in avg_by_hour: swap_avg_by_hour.append([row[1], row[0]]) print(swap_avg_by_hour) sorted_swap = sorted(swap_avg_by_hour, reverse = True) sorted_swap # In[9]: # Sort the values and print the 5 hours with the highest average comments. print("Top 5 Hours for ask Post Comments") for avg, hr in sorted_swap[:5]: print( "{}: {:.2f} average comments per post".format( dt.datetime.strptime(hr, "%H"). strftime("%H:%M"),avg ) ) # With an average of 38.59 comments per post, 15:00(3:00 pm [US est](https://www.kaggle.com/hacker-news/hacker-news-posts/home)) has the highest average number of posts. That is more than 60% greater than the hour with the second highest average number of comments per post. # # Conclusion # # The goal of this project was to analyze Ask HN posts and Show HN posts to ascertain which type of post and which time received the most comments on average. From, our analysis, we determined that the Ask HN posts created between 15:00 and 16:00(3:00pm est - 4:00 pm est) recieved the most comments on average. # # However, posts that did not have comments were excluded. And Ask HN and Show HN represented a small number of posts that were commented on in our data set compared to Other HN posts. Therefore, when considering the posts *that received comments*, we can say that Asks HN posts received more comments on average and Ask HN posts created between 15:00 and 14:00 received the most comments on average.