First, we import pandas as pd
for short. This allows us to use pandas under the abbreviation pd
, which saves use some keystrokes.
import pandas as pd
One base data type in pandas is a Series or columns. It is a list of obervations (= entries) for one variable (= characteristic).
series = pd.Series(["A", "B", "C", "D"])
series
0 A 1 B 2 C 3 D dtype: object
We can give the Series a name.
series.name = "letter"
series
0 A 1 B 2 C 3 D Name: letter, dtype: object
A series has an index (in many cases just a list of number in an ascending order) on the left side which we can access as well.
index = series.index
index
RangeIndex(start=0, stop=4, step=1)
A DataFrame is a colletion of Series. We can create a DataFrame from our Series.
dataframe = pd.DataFrame(series)
dataframe
letter | |
---|---|
0 | A |
1 | B |
2 | C |
3 | D |
We can add an additional Series.
dataframe['a'] = pd.Series([1,2,3])
dataframe['b'] = pd.Series([4,4,4])
dataframe
letter | a | b | |
---|---|---|---|
0 | A | 1.0 | 4.0 |
1 | B | 2.0 | 4.0 |
2 | C | 3.0 | 4.0 |
3 | D | NaN | NaN |
Accessing a Series
dataframe['letter']
0 A 1 B 2 C 3 D Name: letter, dtype: object
Accessing multiple Series.
dataframe[['a', 'b']]
a | b | |
---|---|---|
0 | 1.0 | 4.0 |
1 | 2.0 | 4.0 |
2 | 3.0 | 4.0 |
3 | NaN | NaN |
Accessing a row by an integer-based index.
dataframe.iloc[0]
letter A a 1 b 4 Name: 0, dtype: object
Accessing multiple rows by the given index.
dataframe.loc[[0,2]]
letter | a | b | |
---|---|---|---|
0 | A | 1.0 | 4.0 |
2 | C | 3.0 | 4.0 |
Displaying onle the first (two) rows of a DataFrame.
dataframe.head(2)
letter | a | b | |
---|---|---|---|
0 | A | 1.0 | 4.0 |
1 | B | 2.0 | 4.0 |
We can use basic Python-like operations on the Series in a DataFrame.
dataframe['a'] * dataframe['b']
0 4.0 1 8.0 2 12.0 3 NaN dtype: float64
If we convert between datatypes, we can also work with them.
dataframe['letter'] + dataframe['a'].astype(str)
0 A1.0 1 B2.0 2 C3.0 3 Dnan dtype: object
On string values, we can apply various methods.
# this code is just for demonstration purposes and not needed in an analysis
[x for x in dir(dataframe['letter'].str) if not x.startswith("_")]
['capitalize', 'casefold', 'cat', 'center', 'contains', 'count', 'decode', 'encode', 'endswith', 'extract', 'extractall', 'find', 'findall', 'get', 'get_dummies', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'islower', 'isnumeric', 'isspace', 'istitle', 'isupper', 'join', 'len', 'ljust', 'lower', 'lstrip', 'match', 'normalize', 'pad', 'partition', 'repeat', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'slice', 'slice_replace', 'split', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'wrap', 'zfill']
E.g. we could set the values in the char
series to lower case with lower()
.
dataframe['letter'].str.lower()
0 a 1 b 2 c 3 d Name: letter, dtype: object
data = pd.DataFrame(
["1", 10, None, "3.0", "hi"],
columns=["num"])
data
num | |
---|---|
0 | 1 |
1 | 10 |
2 | None |
3 | 3.0 |
4 | hi |
Converting data to numeric while
number_data = pd.to_numeric(data['num'], errors="coerce")
number_data
0 1.0 1 10.0 2 NaN 3 3.0 4 NaN Name: num, dtype: float64
With dropna()
, we remove any entries that have NaN
values.
numbers = number_data.dropna()
numbers
0 1.0 1 10.0 3 3.0 Name: num, dtype: float64
We can create a condition that is true, if the entries fullfil the condition.
is_one_digit = (numbers < 10) & (numbers >= 0)
is_one_digit
0 True 1 False 3 True Name: num, dtype: bool
We can then take this result as input for selecting data that fulfills that condition.
one_digit_numbers = numbers[is_one_digit]
one_digit_numbers
0 1.0 3 3.0 Name: num, dtype: float64
We can also negate the condition with the ~
sign.
other_numbers = numbers[~is_one_digit]
other_numbers
1 10.0 Name: num, dtype: float64
We can transform data.
teams = pd.DataFrame({
"team" : ["A", "A", "B"],
"name": ["Kevin", "Phillip", "Mike"]
})
teams
team | name | |
---|---|---|
0 | A | Kevin |
1 | A | Phillip |
2 | B | Mike |
stack()
put data in various columns on top of each other.
teams.stack()
0 team A name Kevin 1 team A name Phillip 2 team B name Mike dtype: object
unstack()
"pulls apart" the DataFrame.
teams.unstack()
team 0 A 1 A 2 B name 0 Kevin 1 Phillip 2 Mike dtype: object
We can unstack multiple times until everything is widened.
teams.unstack().unstack()
0 | 1 | 2 | |
---|---|---|---|
team | A | A | B |
name | Kevin | Phillip | Mike |
Let's say we have more data in our DataFrame.
teams['working_hours'] = [30,20,40]
teams.head()
team | name | working_hours | |
---|---|---|---|
0 | A | Kevin | 30 |
1 | A | Phillip | 20 |
2 | B | Mike | 40 |
With pivot_table
, we can transform the data to get another view on it.
hours_per_team = teams.pivot_table("working_hours", "name", "team", fill_value=0)
hours_per_team
team | A | B |
---|---|---|
name | ||
Kevin | 30 | 0 |
Mike | 0 | 40 |
Phillip | 20 | 0 |
This is helpful if we want to calculate results for specific characteristics of our data.
hours_per_team.sum()
team A 50 B 40 dtype: int64
Pandas can read various data sources. A common format is CSV (comma separated values). The read_csv
funktion can read this data.
changes = pd.read_csv("datasets/change_history.csv")
changes.head()
timestamp | |
---|---|
0 | 2017-01-01 11:11:39 |
1 | 2017-01-01 13:18:26 |
2 | 2017-01-01 16:01:37 |
3 | 2017-01-01 19:02:45 |
4 | 2017-01-01 20:47:01 |
Pandas can save DataFrames and Series in various ways, too.
changes.to_csv("/tmp/mychanges.csv")
pd.read_csv("/tmp/mychanges.csv").head()
Unnamed: 0 | timestamp | |
---|---|---|
0 | 0 | 2017-01-01 11:11:39 |
1 | 1 | 2017-01-01 13:18:26 |
2 | 2 | 2017-01-01 16:01:37 |
3 | 3 | 2017-01-01 19:02:45 |
4 | 4 | 2017-01-01 20:47:01 |
Setting the index
parameter to None
avoids exporting the index.
changes.to_csv("/tmp/mychanges.csv", index=None)
pd.read_csv("/tmp/mychanges.csv").head()
timestamp | |
---|---|
0 | 2017-01-01 11:11:39 |
1 | 2017-01-01 13:18:26 |
2 | 2017-01-01 16:01:37 |
3 | 2017-01-01 19:02:45 |
4 | 2017-01-01 20:47:01 |
Datasets can also be joined. Let's take a look at this, too.
commits = pd.DataFrame({
"commit_id" : ["twq3", "23ae", "aead", "hqd2", "fg3d"],
"author": ["Kevin", "Phillip", "Mike", "Kevin", "Mike"]})
commits
commit_id | author | |
---|---|---|
0 | twq3 | Kevin |
1 | 23ae | Phillip |
2 | aead | Mike |
3 | hqd2 | Kevin |
4 | fg3d | Mike |
For the data that we want to join, we set the series that fits the information in the other DataFrame as index.
name_teams = teams.set_index("name")
name_teams
team | working_hours | |
---|---|---|
name | ||
Kevin | A | 30 |
Phillip | A | 20 |
Mike | B | 40 |
We can then join the former dataset with the other one that has the same values in a specific columns (here: author
).
teams_commits = commits.join(name_teams, on="author")
teams_commits
commit_id | author | team | working_hours | |
---|---|---|---|---|
0 | twq3 | Kevin | A | 30 |
1 | 23ae | Phillip | A | 20 |
2 | aead | Mike | B | 40 |
3 | hqd2 | Kevin | A | 30 |
4 | fg3d | Mike | B | 40 |
I hope you find this useful! Have fun with Pandas and take a look at the other Cheatbooks as well!