Pandas basics¶

This Cheatbook guides you through basic operations and mechanics of pandas.

References¶

First, we import pandas as pd for short. This allows us to use pandas under the abbreviation pd, which saves use some keystrokes.

In [1]:

import pandas as pd

Main pandas' data structures¶

One base data type in pandas is a Series or columns. It is a list of obervations (= entries) for one variable (= characteristic).

In [2]:

series = pd.Series(["A", "B", "C", "D"])
series

Out[2]:

0    A
1    B
2    C
3    D
dtype: object

We can give the Series a name.

In [3]:

series.name = "letter"
series

Out[3]:

0    A
1    B
2    C
3    D
Name: letter, dtype: object

A series has an index (in many cases just a list of number in an ascending order) on the left side which we can access as well.

In [4]:

index = series.index
index

Out[4]:

RangeIndex(start=0, stop=4, step=1)

A DataFrame is a colletion of Series. We can create a DataFrame from our Series.

In [5]:

dataframe = pd.DataFrame(series)
dataframe

Out[5]:

	letter
0	A
1	B
2	C
3	D

We can add an additional Series.

In [6]:

dataframe['a'] = pd.Series([1,2,3])
dataframe['b'] = pd.Series([4,4,4])
dataframe

Out[6]:

	letter	a	b
0	A	1.0	4.0
1	B	2.0	4.0
2	C	3.0	4.0
3	D	NaN	NaN

Accessing values¶

Accessing a Series

In [7]:

dataframe['letter']

Out[7]:

0    A
1    B
2    C
3    D
Name: letter, dtype: object

Accessing multiple Series.

In [8]:

dataframe[['a', 'b']]

Out[8]:

	a	b
0	1.0	4.0
1	2.0	4.0
2	3.0	4.0
3	NaN	NaN

Accessing a row by an integer-based index.

In [9]:

dataframe.iloc[0]

Out[9]:

letter    A
a         1
b         4
Name: 0, dtype: object

Accessing multiple rows by the given index.

In [10]:

dataframe.loc[[0,2]]

Out[10]:

	letter	a	b
0	A	1.0	4.0
2	C	3.0	4.0

Displaying onle the first (two) rows of a DataFrame.

In [11]:

dataframe.head(2)

Out[11]:

	letter	a	b
0	A	1.0	4.0
1	B	2.0	4.0

Working with values¶

We can use basic Python-like operations on the Series in a DataFrame.

In [12]:

dataframe['a'] * dataframe['b']

Out[12]:

0     4.0
1     8.0
2    12.0
3     NaN
dtype: float64

If we convert between datatypes, we can also work with them.

In [13]:

dataframe['letter'] + dataframe['a'].astype(str)

Out[13]:

0    A1.0
1    B2.0
2    C3.0
3    Dnan
dtype: object

On string values, we can apply various methods.

In [14]:

# this code is just for demonstration purposes and not needed in an analysis
[x for x in dir(dataframe['letter'].str) if not x.startswith("_")]

Out[14]:

['capitalize',
 'casefold',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'slice',
 'slice_replace',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'wrap',
 'zfill']

E.g. we could set the values in the char series to lower case with lower().

In [15]:

dataframe['letter'].str.lower()

Out[15]:

0    a
1    b
2    c
3    d
Name: letter, dtype: object

Converting data¶

In [16]:

data = pd.DataFrame(
    ["1", 10, None, "3.0", "hi"],
    columns=["num"])
data

Out[16]:

	num
0	1
1	10
2	None
3	3.0
4	hi

Converting data to numeric while

In [17]:

number_data = pd.to_numeric(data['num'], errors="coerce")
number_data

Out[17]:

0     1.0
1    10.0
2     NaN
3     3.0
4     NaN
Name: num, dtype: float64

Filtering data¶

Removing nulls¶

With dropna(), we remove any entries that have NaN values.

In [18]:

numbers = number_data.dropna()
numbers

Out[18]:

0     1.0
1    10.0
3     3.0
Name: num, dtype: float64

Selecting data¶

We can create a condition that is true, if the entries fullfil the condition.

In [19]:

is_one_digit = (numbers < 10) & (numbers >= 0)
is_one_digit

Out[19]:

0     True
1    False
3     True
Name: num, dtype: bool

We can then take this result as input for selecting data that fulfills that condition.

In [20]:

one_digit_numbers = numbers[is_one_digit]
one_digit_numbers

Out[20]:

0    1.0
3    3.0
Name: num, dtype: float64

We can also negate the condition with the ~ sign.

In [21]:

other_numbers = numbers[~is_one_digit]
other_numbers

Out[21]:

1    10.0
Name: num, dtype: float64

Stacking and unstacking data¶

We can transform data.

In [22]:

teams = pd.DataFrame({
    "team" : ["A", "A", "B"],
    "name": ["Kevin", "Phillip", "Mike"]
    })
teams

Out[22]:

	team	name
0	A	Kevin
1	A	Phillip
2	B	Mike

stack() put data in various columns on top of each other.

In [23]:

teams.stack()

Out[23]:

0  team          A
   name      Kevin
1  team          A
   name    Phillip
2  team          B
   name       Mike
dtype: object

unstack() "pulls apart" the DataFrame.

In [24]:

teams.unstack()

Out[24]:

team  0          A
      1          A
      2          B
name  0      Kevin
      1    Phillip
      2       Mike
dtype: object

We can unstack multiple times until everything is widened.

In [25]:

teams.unstack().unstack()

Out[25]:

	0	1	2
team	A	A	B
name	Kevin	Phillip	Mike

Pivoting data¶

Let's say we have more data in our DataFrame.

In [26]:

teams['working_hours'] = [30,20,40]
teams.head()

Out[26]:

	team	name	working_hours
0	A	Kevin	30
1	A	Phillip	20
2	B	Mike	40

With pivot_table, we can transform the data to get another view on it.

In [27]:

hours_per_team = teams.pivot_table("working_hours", "name", "team", fill_value=0)
hours_per_team

Out[27]:

team	A	B
name
Kevin	30	0
Mike	0	40
Phillip	20	0

This is helpful if we want to calculate results for specific characteristics of our data.

In [28]:

hours_per_team.sum()

Out[28]:

team
A    50
B    40
dtype: int64

Reading data¶

Pandas can read various data sources. A common format is CSV (comma separated values). The read_csv funktion can read this data.

In [29]:

changes = pd.read_csv("datasets/change_history.csv")
changes.head()

Out[29]:

	timestamp
0	2017-01-01 11:11:39
1	2017-01-01 13:18:26
2	2017-01-01 16:01:37
3	2017-01-01 19:02:45
4	2017-01-01 20:47:01

Saving data¶

Pandas can save DataFrames and Series in various ways, too.

In [30]:

changes.to_csv("/tmp/mychanges.csv")
pd.read_csv("/tmp/mychanges.csv").head()

Out[30]:

	Unnamed: 0	timestamp
0	0	2017-01-01 11:11:39
1	1	2017-01-01 13:18:26
2	2	2017-01-01 16:01:37
3	3	2017-01-01 19:02:45
4	4	2017-01-01 20:47:01

Setting the index parameter to None avoids exporting the index.

In [31]:

changes.to_csv("/tmp/mychanges.csv", index=None)
pd.read_csv("/tmp/mychanges.csv").head()

Out[31]:

	timestamp
0	2017-01-01 11:11:39
1	2017-01-01 13:18:26
2	2017-01-01 16:01:37
3	2017-01-01 19:02:45
4	2017-01-01 20:47:01

Joining data¶

Datasets can also be joined. Let's take a look at this, too.

In [32]:

commits = pd.DataFrame({
    "commit_id" : ["twq3", "23ae", "aead", "hqd2", "fg3d"],
    "author": ["Kevin", "Phillip", "Mike", "Kevin", "Mike"]})
commits

Out[32]:

	commit_id	author
0	twq3	Kevin
1	23ae	Phillip
2	aead	Mike
3	hqd2	Kevin
4	fg3d	Mike

For the data that we want to join, we set the series that fits the information in the other DataFrame as index.

In [33]:

name_teams = teams.set_index("name")
name_teams

Out[33]:

	team	working_hours
name
Kevin	A	30
Phillip	A	20
Mike	B	40

We can then join the former dataset with the other one that has the same values in a specific columns (here: author).

In [34]:

teams_commits = commits.join(name_teams, on="author")
teams_commits

Out[34]:

	commit_id	author	team	working_hours
0	twq3	Kevin	A	30
1	23ae	Phillip	A	20
2	aead	Mike	B	40
3	hqd2	Kevin	A	30
4	fg3d	Mike	B	40

Summary¶

I hope you find this useful! Have fun with Pandas and take a look at the other Cheatbooks as well!