Data Analysis with Pandas¶

Data can often be stored in a multiple of file formats:

Excel spreadsheets (.xlsx);
Comma seperated files (.csv);
Json (.json);
...

Simlarly you might want to store data in any of the above data formats. This is where the Pandas library can be useful:

"... easy-to-use data structures and data analysis tools for the Python programming language."

In this section we will see how to:

Read in data files;
Query those data files;
Write to data files.

Reading in data files¶

Consider the file goldbach.xlsx which contains rows of data confirming the Goldbach conjecture:

Every even integer greater than 2 can be expressed as the sum of two primes.

The data is made up of 3 columns: $N$, $a$ and $b$ and there is a row for every possible expression of $N = a + b$ for $a \leq b$ prime. Note that this data was made using [Appendix A](A1 - Appendix creating Goldbach data.ipynb).

Let us start by importing Pandas and reading in the data file. To do this you will need to know the path to the file on your computer:

In [1]:

import pandas as pd

In [2]:

df = pd.read_excel("data/goldbach.xlsx")

This reads in the excel file as a Pandas data frame. Let us take a look at the first few rows of the data frame (the head):

In [3]:

df.head()

Out[3]:

	N	a	b
0	4	2	2
1	6	3	3
2	8	3	5
3	10	3	7
4	10	5	5

Let us look at the last few rows (the tail):

In [4]:

df.tail()

Out[4]:

	N	a	b
2591	500	151	349
2592	500	163	337
2593	500	193	307
2594	500	223	277
2595	500	229	271

Querying our data¶

A quick way of getting a summary of the data we have is with .describe():

In [5]:

df.describe()

Out[5]:

	N	a	b
count	2596.000000	2596.000000	2596.000000
mean	311.466872	72.297766	239.169106
std	127.817425	58.087254	111.180401
min	4.000000	2.000000	2.000000
25%	214.000000	23.000000	157.000000
50%	329.000000	59.000000	239.000000
75%	420.000000	107.000000	317.000000
max	500.000000	241.000000	491.000000

Although in this particular data set that is not terribly useful, that's more useful in the case of statistical analysis.

Let us take a closer look at a specific number ($N=322$) and the ways it can be written as the sum of two primes.

In [6]:

df[df['N'] == 322]

Out[6]:

	N	a	b
1250	322	5	317
1251	322	11	311
1252	322	29	293
1253	322	41	281
1254	322	53	269
1255	322	59	263
1256	322	71	251
1257	322	83	239
1258	322	89	233
1259	322	131	191
1260	322	149	173

To briefly explain what is happening there: df['N'] == 322 is returning a series of True and False, identify the positions in our data where $N=322$. We directly pass that series to the data frame to get out those numbers df[df['N'] == 322].

Let us find out how many decompositions exist for each number in our data frame. We will do this using the value_counts() statement:

In [7]:

s = df['N'].value_counts()
s

Out[7]:

420    30
480    29
462    28
450    27
390    27
456    24
468    24
330    24
486    23
474    23
498    23
360    22
492    22
378    22
414    21
444    21
300    21
438    21
426    21
396    21
324    20
408    20
354    20
490    19
336    19
294    19
210    19
270    19
432    19
384    19
       ..
58      4
80      4
88      4
92      4
152     4
42      4
26      3
30      3
24      3
40      3
22      3
98      3
52      3
56      3
62      3
128     3
44      3
38      2
18      2
10      2
14      2
16      2
20      2
68      2
28      2
32      2
12      1
8       1
6       1
4       1
Name: N, Length: 249, dtype: int64

The above is a series (ordered by count), let us rename the count variable and create a new dataframe:

In [8]:

df = pd.DataFrame(s.rename('counts'))
df.head()

Out[8]:

	counts
420	30
480	29
462	28
450	27
390	27

We have created a dataframe with an index given by N. Let us create a new variable which is the normalised count:

In [9]:

df["normalised_counts"] = df['counts'] / df.index

In [10]:

df.head()

Out[10]:

	counts	normalised_counts
420	30	0.071429
480	29	0.060417
462	28	0.060606
450	27	0.060000
390	27	0.069231

We could now look a bit closer at this using describe:

In [11]:

df.describe()

Out[11]:

	counts	normalised_counts
count	249.000000	249.000000
mean	10.425703	0.051769
std	5.897948	0.029921
min	1.000000	0.017588
25%	6.000000	0.030973
50%	9.000000	0.044776
75%	13.000000	0.062500
max	30.000000	0.250000

We can also directly plot our data, but just as when we used Sympy we need to tell Jupyter to display the plots in the notebook:

In [12]:

%matplotlib inline

Here is a quick histogram of the counts:

In [13]:

df['counts'].hist();

The normalised counts:

In [14]:

df['normalised_counts'].hist();

Writing our data to files¶

Finally let us write this data to a 'comma seperated value' (.csv) in case we wanted to look at it later:

In [15]:

df.to_csv("data/goldbach_counts.csv")

Summary¶

This section briefly showed us how to use pandas to read, manipulate and write data. We also took a brief look at plotting with pandas but this should only be used in an exploratary way. For higher quality plots matplotlib is recommended.