#!/usr/bin/env python # coding: utf-8 # # Exploring Hacker News Posts # # ## Introduction # # Hacker News is a extremely popular site in the tech and startup world. A user can submit a post, where they are voted on commented on, very similar to reddit. The top posts can recieve hundreds of thousands of visitors. # # I am aiming to explore two types of posts: `Ask HN` and `Show HN`. # # To find out the following: # # * Do `Ask HN` or `Show HN` receive more comments on average? # * Do posts created at a certain time receive more comments on average? # ## Importing and Reading the Data # # In the cell below I have done the following: # # * Imported the reader # * Opended and read the file `hacker_news.csv` # * Turned the file into a list of lists with the `list()` function and assigned it to a variable `hn` # * Assigned only the header row to a variable `headers`, so I can easily reference the column titles if needed # * Then updated the variable `hn` so that does not include the header # * Finally I used the `print()` function to display the `headers`and the frist 5 rows of `hn` # # In[1]: from csv import reader file = open("hacker_news.csv") read = reader(file) hn = list(read) headers = hn[0] hn = hn[1:] print(headers) print("") hn[:5] # ## Extracting Ask HN and Show HN Posts # # In the code cell below I first made three empty lists in which to store the specific posts I needed. # # Then looped through each row in `hn`. I wanted to find the rows that contained the following elements: "ask hn", "show hn", and then the remaining. I decided to use the string method `startswith`, to ensure there was no issues with the strings in the list of lists being uppercase or lowercase. I made the title column of the data lower by using the `lower` method and assigning it to a variable called `title`. # # Then used conditional statements to find the rows that started with the identified string. If it was found, I used the `append` method to append that specific row found into the specific list. # # In the next two cells, I wanted to count and print my newly created lists to ensure all went well. # In[2]: ask_posts = [] show_posts = [] other_posts = [] for post in hn: title = post[1].lower() if title.startswith('ask hn'): ask_posts.append(post) if title.startswith('show hn'): show_posts.append(post) else: other_posts.append(post) # In[3]: print(len(ask_posts)) print(len(show_posts)) print(len(other_posts)) # In[4]: ask_posts[:5] # In[5]: show_posts[:5] # ## Calculating the Average Number of Comments for Ask HN and Show HN Posts # In this section the aim was to compare the average number of comments for the Ask HN and Show HN posts. # # The following tasks were complete in the below cells: # * Used the `print` function to display the headers to find the right index # * For each list (`ask_posts` and `show_posts`) I used a for loop to iterate over each, turning the `num_comments` column into a integer using the `int` function. Then added the sum of the comments to a pre made variable named `total_ask_comments`or `total_show_comments` # * Then computed the average for each and assigning it to variable, either `avg_ask_comments` or `avg_show_commments` # # In[6]: print(headers) # In[7]: total_ask_comments = 0 for a in ask_posts: num = int(a[4]) total_ask_comments += num avg_ask_comments = total_ask_comments / len(ask_posts) print("The average number of comments for Ask Posts: ", avg_ask_comments) total_show_comments = 0 for s in show_posts: num = int(s[4]) total_show_comments += num avg_show_comments = total_show_comments / len(show_posts) print("The average number of comments for Show Posts: ",avg_show_comments) # From analysis on the average comments for each lists. It was found that Ask posts have more comments on average than the Show posts. # # This could be due to the desired outcome of a Ask Post. If you were to do an Ask post, then you are intending that someone will comment, i.e. answer your question. The Show posts though do not have a question to answer, the viewers of the post simply look at the post. The viewer may wish to comment but it is not as natural in comparison to someone asking you a question. # ## Finding the Amount of Ask Posts and Comments by Hour Created # # In this section, I made two dictionaries: `counts_by_hour`; and `comments_by_hour`. # # * `counts_by_hour`: contains the number of ask posts created during each hour of the day. # * `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received. # # A summary of the cells below: # # * imported the `datetime` module as `dt` # * created an empty list `result_list` to store two elements from the columns: `created_at`; and `num_comments` # * I iterated over `ask_posts` and appended the two elements to `result_list` # * Then I created two empty dictionaries named: `counts_by_hour`; and `comments_by_hour` # * Then created a for loop to iterate over the `result_list` # * Before extracting and adding to the dictionaries. I parsed the date and created a datetime object using the `datetime.strptime()` method # * I only wanted the hour section of the datetime object, so I used the `datetime.strftime()` method # * Finally used conditional statements to compute the data, so it can be calculated and added to the create dictionary # In[8]: print(headers) # In[9]: import datetime as dt result_list = [] for row in ask_posts: result_list.append([row[6], int(row[4])]) counts_by_hour = {} comments_by_hour = {} for row in result_list: comment_num = row[1] created = row[0] created_dt = dt.datetime.strptime(created, '%m/%d/%Y %H:%M') created_hour = created_dt.strftime('%H') if created_hour in counts_by_hour: counts_by_hour[created_hour] += 1 comments_by_hour[created_hour] += comment_num else: counts_by_hour[created_hour] = 1 comments_by_hour[created_hour] = comment_num print(counts_by_hour) print("") print(comments_by_hour) # ## Caculating the Average Number of Comments for Ask HN by Hour # # Next I will use the two dictionaries created to calculate the average number of comments for posts created during each hour of the day. # # This was done by: # # * Creating an empty list `avg_per_hour` # * Iterated over the keys of `comments_by_hour` # * Then computed the average number of comments and rounding the answer to a 2 decimal place using the `round` function # * Finally appending two elements the hour and the `average` # In[10]: avg_per_hour = [] for hour in comments_by_hour: average = round(comments_by_hour[hour] / counts_by_hour[hour], 2) # decided it was best to round the average to two decimal places avg_per_hour.append([hour, average]) avg_per_hour # ## Sorting and Printing Values from List of Lists # In[11]: swap_avg_per_hour = [] for row in avg_per_hour: hour = row[0] avg = row[1] swap_avg_per_hour.append([avg, hour]) swap_avg_per_hour # In[12]: sorted_swap = sorted(swap_avg_per_hour, reverse=True) print("Top 5 Hours for Ask Posts Comments") for row in sorted_swap[:5]: hour_dt = dt.datetime.strptime(row[1], '%H') hour_str = hour_dt.strftime('%H:%M') pt_hour_dt = dt.datetime.strptime(row[1], '%H') - dt.timedelta(hours=3) pt_hour_str = pt_hour_dt.strftime('%H:%M') ct_hour_dt = dt.datetime.strptime(row[1], '%H') - dt.timedelta(hours=1) ct_hour_str = ct_hour_dt.strftime('%H:%M') print(' ', '{pst_time} PST, {cst_time}, CST, {est_time} EST: {avg:.2f} average comments per post'.format(pst_time=pt_hour_str, cst_time=ct_hour_str, est_time=hour_str, avg=row[0])) # The results showed that between the hours of 3 PM and 4PM EST had the highest average amount of comments per post. I felt it was unclear in why this was. # # I therefore, decided to compare the most populous timezones in the USA (Pacifc, Central, and Eastern). To see if a clear indication appeared. The highest averages of comments, were to be found in the middle of the day, possibly when most users would be active. This would explain why these times across the USA display would be much higher than the other 4 results. In addition, it is important to mention that Hacker News was started by Y Combinator, which is located in Pacific Time. # # It would interested to see were the most commmon post comes from in regards to timezone. To see if it matches with the above results. # # From the results above, it would be reccomended that if your intention was to create a post to attractive the highest possible comments, 3 PM EST would be advised. # In[13]: print("Top 5 Hours for Ask Posts Comments - European Timezone Comparison") for row in sorted_swap[:5]: est_hour_dt = dt.datetime.strptime(row[1], '%H') est_hour_str = est_hour_dt.strftime('%H:%M') # Central European Standard Time Zone cest_hour_dt = dt.datetime.strptime(row[1], '%H') + dt.timedelta(hours=7) cest_hour_str = cest_hour_dt.strftime('%H:%M') print(' ', '{est_time} EST : {cest_time}, CEST: {avg:.2f} average comments per post'.format(est_time=est_hour_str, cest_time=cest_hour_str, avg=row[0])) # The above results are a comparison between the Eastern and Central European time zones. # # From analysing the results, perhaps anothe reason why 3 PM EST is has a higher amount of comments on average is due to the fact that Europe is still active. # ## Conclusion # # It can be concluded that the best time to post with the intention of gaining the most amount of comments for your post is between the hours of 3 PM - 4 PM EST. # # This could be due to the fact that it is during a time when two large populations (North America and Europe) are most acitve. # # Future add on for this project would be to compare this data collected with the following: Number of Users per country/ state, Where the highest amount of Posts come from i.e. location. This could provide further details on when it is best to post with the possibilties of other findings regarding the use general use of Hacker News for creating engagement. # In[ ]: