This notebook presents analysis of data from the first million page views on my blog, Probably Overthinking It.
Copyright 2015 Allen Downey
MIT License: http://opensource.org/licenses/MIT
%matplotlib inline
import pandas as pd
def read_table(filename):
fp = open(filename)
t = pd.read_html(fp)
table = t[5]
return table
table1 = read_table('blogger1.html')
table1.shape
(100, 8)
table2 = read_table('blogger2.html')
table2.shape
(20, 8)
table = pd.concat([table1, table2], ignore_index=True)
table.shape
(120, 9)
import string
chars = string.ascii_letters + ' '
def convert(s):
return (int(s.rstrip(chars)))
def clean(s):
i = s.find('Edit')
return s[:i]
table['title'] = table[1].apply(clean)
table.title
0 One million is a lot 1 When will I win the Great Bear Run? 2 Bayes meets Fourier 3 First babies are more likely to be late 4 Bayesian analysis of gluten sensitivity 5 Bayes theorem in real life 6 The Inspection Paradox is Everywhere 7 Orange is the new stat 8 Will Millennials Ever Get Married? 9 Bayesian Billiards 10 The Sleeping Beauty Problem 11 Hypothesis testing is only mostly useless 12 Two hour marathon by 2041 -- probably 13 Bayesian survival analysis for "Game of Thrones" 14 Statistical inference is only mostly wrong 15 Upcoming talk on survival analysis in Python 16 Bayesian analysis of match rates on Tinder 17 Godless freshmen: now more Nones than Catholics 18 Bayesian predictions for Super Bowl XLIX 19 Statistics tutorials at PyCon 2015 20 The Rock Hyrax Problem 21 The World Cup Problem Part 2: Germany v. Argen... 22 The World Cup Problem: Germany v. Brazil 23 On efficient algorithms for finding the goddam... 24 Two hour marathon in 2041 25 Bayesian election forecasting 26 Regression with Python, pandas and StatsModels 27 New study: vaccines prevent disease and death 28 An exercise in hypothesis testing 29 More likely to be killed by a terrorist ... 90 Girl Named Florida solutions 91 The red-haired girl named Florida 92 Somebody bet on the Bayes 93 All your Bayes are belong to us! 94 My favorite Bayes's Theorem problems 95 The Blinky Monty Problem 96 Repeated tests: how bad can it be? 97 The Jimmy Nut Company problem 98 Upcoming webcast: Only One Test 99 News flash: OJ did it. 100 Postcard from NKS Summer Camp 101 A hierarchical Bayesian model of pond scum 102 More hypotheses, less trivia 103 There is only one test! 104 Statistics Workshop 105 Think Stats will be published by O'Reilly in June 106 Two Hour Marathon in 2045 107 Bayesianness is next to Godliness 108 Survival analysis 109 Freshman hordes more godless than ever! 110 Predicting marathon times 111 BQ is unfair to women 112 Moving the goalposts 113 The BQ Effect 114 Are first babies more likely to be late? 115 Yet another reason SAT scores are non-predictive 116 Are you popular? Hint: no. 117 Obesity epidemic cured! 118 Observer effect in relay races 119 Proofiness and elections Name: title, dtype: object
table['plusses'] = table[4].fillna(0)
table.plusses.head()
0 0 1 1 2 7 3 2 4 9 Name: plusses, dtype: float64
table['comments'] = table[5].apply(convert)
table.comments.head()
0 0 1 1 2 1 3 3 4 1 Name: comments, dtype: int64
table['views'] = table[6].apply(convert)
table.views
0 0 1 723 2 2363 3 944 4 3110 5 2514 6 30484 7 2131 8 589 9 1273 10 2348 11 1816 12 2891 13 32406 14 4666 15 1242 16 7602 17 1491 18 2254 19 1193 20 648 21 1789 22 3040 23 819 24 3090 25 1621 26 6456 27 1834 28 1057 29 1536 ... 90 9454 91 1153 92 2332 93 48836 94 34384 95 3367 96 3797 97 1929 98 885 99 0 100 0 101 2162 102 1520 103 4246 104 203 105 1445 106 1745 107 1083 108 2849 109 1379 110 3847 111 815 112 513 113 3066 114 130722 115 17876 116 1468 117 289 118 725 119 396 Name: views, dtype: int64
table['date'] = pd.to_datetime(table[7])
table.date.head()
0 2015-11-01 1 2015-10-26 2 2015-10-23 3 2015-09-23 4 2015-09-01 Name: date, dtype: datetime64[ns]
table = table[table.views > 0]
table.shape
(115, 13)
table.index = range(115, 0, -1)
table.title
115 When will I win the Great Bear Run? 114 Bayes meets Fourier 113 First babies are more likely to be late 112 Bayesian analysis of gluten sensitivity 111 Bayes theorem in real life 110 The Inspection Paradox is Everywhere 109 Orange is the new stat 108 Will Millennials Ever Get Married? 107 Bayesian Billiards 106 The Sleeping Beauty Problem 105 Hypothesis testing is only mostly useless 104 Two hour marathon by 2041 -- probably 103 Bayesian survival analysis for "Game of Thrones" 102 Statistical inference is only mostly wrong 101 Upcoming talk on survival analysis in Python 100 Bayesian analysis of match rates on Tinder 99 Godless freshmen: now more Nones than Catholics 98 Bayesian predictions for Super Bowl XLIX 97 Statistics tutorials at PyCon 2015 96 The Rock Hyrax Problem 95 The World Cup Problem Part 2: Germany v. Argen... 94 The World Cup Problem: Germany v. Brazil 93 On efficient algorithms for finding the goddam... 92 Two hour marathon in 2041 91 Bayesian election forecasting 90 Regression with Python, pandas and StatsModels 89 New study: vaccines prevent disease and death 88 An exercise in hypothesis testing 87 More likely to be killed by a terrorist 86 Bayesian solution to the Lincoln index problem ... 30 Estimating the age of renal tumors 29 Comment on "Racism and Meritocracy" 28 Girl Named Florida solutions 27 The red-haired girl named Florida 26 Somebody bet on the Bayes 25 All your Bayes are belong to us! 24 My favorite Bayes's Theorem problems 23 The Blinky Monty Problem 22 Repeated tests: how bad can it be? 21 The Jimmy Nut Company problem 20 Upcoming webcast: Only One Test 19 A hierarchical Bayesian model of pond scum 18 More hypotheses, less trivia 17 There is only one test! 16 Statistics Workshop 15 Think Stats will be published by O'Reilly in June 14 Two Hour Marathon in 2045 13 Bayesianness is next to Godliness 12 Survival analysis 11 Freshman hordes more godless than ever! 10 Predicting marathon times 9 BQ is unfair to women 8 Moving the goalposts 7 The BQ Effect 6 Are first babies more likely to be late? 5 Yet another reason SAT scores are non-predictive 4 Are you popular? Hint: no. 3 Obesity epidemic cured! 2 Observer effect in relay races 1 Proofiness and elections Name: title, dtype: object
dates = table.date.sort_values()
diffs = dates.diff()
diffs.head()
1 NaT 2 6 days 3 7 days 4 7 days 5 9 days Name: date, dtype: timedelta64[ns]
diffs.dropna().describe()
count 114 mean 15 days 09:41:03.157894 std 20 days 04:36:55.930513 min 1 days 00:00:00 25% 5 days 00:00:00 50% 10 days 00:00:00 75% 17 days 18:00:00 max 180 days 00:00:00 Name: date, dtype: object
table.sort_values(by=['views'], ascending=False)[['title', 'views', 'date']].head(20)
title | views | date | |
---|---|---|---|
6 | Are first babies more likely to be late? | 130722 | 2011-02-07 |
25 | All your Bayes are belong to us! | 48836 | 2011-10-27 |
24 | My favorite Bayes's Theorem problems | 34384 | 2011-10-20 |
103 | Bayesian survival analysis for "Game of Thrones" | 32406 | 2015-03-25 |
110 | The Inspection Paradox is Everywhere | 30484 | 2015-08-18 |
41 | Bayesian statistics made simple | 23892 | 2012-03-14 |
5 | Yet another reason SAT scores are non-predictive | 17876 | 2011-02-02 |
72 | Are your data normal? Hint: no. | 16152 | 2013-08-07 |
36 | Freshman hordes even more godless! | 10826 | 2012-01-29 |
34 | Think Complexity | 10670 | 2012-01-23 |
54 | Secularization in America: part six | 9773 | 2012-07-10 |
28 | Girl Named Florida solutions | 9454 | 2011-11-10 |
55 | Secularization in America: part seven | 7705 | 2012-07-11 |
100 | Bayesian analysis of match rates on Tinder | 7602 | 2015-02-10 |
90 | Regression with Python, pandas and StatsModels | 6456 | 2014-09-14 |
57 | Are first babies more likely to be late, revis... | 5776 | 2013-01-08 |
78 | Correlation is evidence of causation | 4911 | 2014-02-20 |
102 | Statistical inference is only mostly wrong | 4666 | 2015-03-02 |
17 | There is only one test! | 4246 | 2011-05-31 |
65 | The Price is Right Problem | 4062 | 2013-04-22 |
table.sort_values(by=['views'], ascending=True)[['title', 'views', 'date']].head(20)
title | views | date | |
---|---|---|---|
16 | Statistics Workshop | 203 | 2011-05-17 |
3 | Obesity epidemic cured! | 289 | 2011-01-17 |
1 | Proofiness and elections | 396 | 2011-01-04 |
45 | Fog warning system: part two | 504 | 2012-04-20 |
8 | Moving the goalposts | 513 | 2011-02-24 |
108 | Will Millennials Ever Get Married? | 589 | 2015-07-13 |
96 | The Rock Hyrax Problem | 648 | 2014-12-04 |
62 | Belly Button Biodiversity: Part Four | 675 | 2013-03-22 |
46 | Fog warning system: part three | 704 | 2012-04-25 |
115 | When will I win the Great Bear Run? | 723 | 2015-10-26 |
2 | Observer effect in relay races | 725 | 2011-01-10 |
60 | Belly Button Biodiversity: Part Two | 783 | 2013-02-08 |
9 | BQ is unfair to women | 815 | 2011-03-02 |
93 | On efficient algorithms for finding the goddam... | 819 | 2014-10-04 |
70 | Belly Button Biodiversity: The End Game | 839 | 2013-05-30 |
20 | Upcoming webcast: Only One Test | 885 | 2011-08-16 |
50 | Secularization in America: part three | 927 | 2012-06-22 |
61 | Belly Button Biodiversity: Part Three | 932 | 2013-02-18 |
113 | First babies are more likely to be late | 944 | 2015-09-23 |
32 | Frank is a scoundrel, probably | 947 | 2012-01-05 |
import thinkstats2
import thinkplot
cdf = thinkstats2.Cdf(table.views)
thinkplot.PrePlot(1)
thinkplot.Cdf(cdf, complement=True)
thinkplot.Config(xlabel ='Number of page views', xscale='log',
ylabel='CCDF', yscale='log',
legend=False)
table.sort_values(by=['comments'], ascending=False)[['title', 'comments', 'date']].head(5)
title | comments | date | |
---|---|---|---|
25 | All your Bayes are belong to us! | 56 | 2011-10-27 |
106 | The Sleeping Beauty Problem | 53 | 2015-06-12 |
28 | Girl Named Florida solutions | 25 | 2011-11-10 |
110 | The Inspection Paradox is Everywhere | 23 | 2015-08-18 |
54 | Secularization in America: part six | 14 | 2012-07-10 |
table.sort_values(by=['plusses'], ascending=False)[['title', 'plusses', 'date']].head(5)
title | plusses | date | |
---|---|---|---|
110 | The Inspection Paradox is Everywhere | 909 | 2015-08-18 |
25 | All your Bayes are belong to us! | 59 | 2011-10-27 |
103 | Bayesian survival analysis for "Game of Thrones" | 54 | 2015-03-25 |
67 | Software engineering practices for graduate st... | 34 | 2013-05-06 |
102 | Statistical inference is only mostly wrong | 31 | 2015-03-02 |