SoNAR (IDH) - HNA Curriculum

Notebook 1: Jupyter and Python

This curriculum serves as a guide and introductory resource to using the SoNAR (IDH) network database with the help of Python and Jupyter notebooks.

It uses Jupyter Notebooks to document and explain how to do historical network analysis with the SoNAR (IDH) data. This first notebooks provides a high-level overview of the basic functionality of Jupyter notebooks as well as a quick introduction to basic Python. All the following notebooks will build up upon the fundamentals explained in this notebook.

If you are already familiar with Jupyter Notebooks and Python you can directly jump to Notebook 2 - Historical Network Analysis.

Jupyter Notebooks¶

Jupyter Notebooks are basically a kind of code editor you can use inside your web browser. This means you can write, edit and run code and from your browser, the code is immediately interpreted and the results are visible directly under the executed code.

What makes Jupyter notebooks so versatile is that you can mix code blocks and text blocks within the same document and thus create interactive, transparent and reproducible documentations or guides.

Project Jupyter¶

Project Jupyter is the name of the non-profit and open-source project that develops the Jupyter notebook technology this curriculum is based on. Project Jupyter is maintained and developed by a big community with a focus on making scientific computing accessible, easy and free. The name Jupyter is derived from the names of the three programming languages Julia, Python and R. The main objective of Jupyter is supporting interactive data science and scientific computing across all programming languages. All notebooks in this curriculum use the Python programming language.

Operating Jupyter¶

This curriculum consists of five Jupyter Notebooks. When you are new to Jupyter Notebooks you can take the "User Interface Tour" in the help menu on top of the screen. The screenshot below shows where to find the interactive tour:

Notebooks

Jupyter notebooks are documents that combine live runnable code with narrative text (Markdown), equations, images, interactive visualizations and other rich output.

The Toolbar

When you open up a notebook you can see the toolbar at the very top of the notebook. This is what the toolbar can do for you:

Category	Data Types
	Save the current state of the notebook
	Insert a new cell below the current selection
	Cut the selected cell
	Copy the selected cell
	Paste the copied or cut cell
	Move cell up or down
	Run the selected cell
	Stop the execution of the selected cell
	Restart the kernel of the notebook
	Restart the kernel and execute all code cells chronologically
	Select the cell type (Code, Markdown, Raw)

The Cells

Jupyter Notebooks consist of cells that are arranged vertically. Every text you read so far is written inside a cell. When you double click any image or text in this notebook, the respective cell will jump into "edit mode" and you can change the contents to your liking. When you hit the Run button in the toolbar after you are done editing, the cell will be executed and it jumps back to what is called the "command mode" (Double click this cell to try it).

Hint:You can only edit cells when you are using the notebook in an interactive environment (either a local setup or on binder).

There are two relevant cell types Jupyter Notebooks provide. The first one being Code and the second one is Markdown. Here is how they differ from each other:

*Code Cell*

Code cells let you write and execute programming code. Make sure you select the Codecell type in the cell type drop down you find in the toolbar.

When you type code in a code cell and hit on the run button in the toolbar, the code is executed and the output of the code is beneath the code cell:

Executed code will be held in memory of the kernel. That means that your running notebook will hold some kind of state, depending of which code blocks you already executed. This will come in handy when we'll discuss variables later on.

*Markdown Cell*

Markdown is a markup language you can use to format text. Markdown has a very simple syntax you can use for formatting text, tables or lists. You can also embed images, HTML blocks and other media formats with Markdown.

Hint: Double-click this cell to change it into edit mode and see the underlying markdown syntax.

A quick example of how Markdown works can be found in the images beneath. The left hand side shows the markdown syntax during editing mode of a cell. The right hand side shows the executed Markdown cell after you hit the run button.

raw markdown	rendered markdown

A good overview of how Markdown in Jupyter notebooks works can be found here.

📝 Exercise¶

Now, try to reproduce the example from above on your own following the steps below:

Insert a new code cell below by selecting Code as the cell type and clicking on the + icon.
Type the code print("Hello world!") inside this cell.
Execute the code cell (either by clicking on the run-button in the toolbar or by using the hotkey command+shift+enter)

Congratulations to your first line of code! 🎉

Python¶

Jupyter Notebooks can be used with a multitude of programming languages. This curriculum uses Python, a language well known both for its friendliness towards beginners and its maturity as a professional tool. The following sections provide a quick introduction to Python.

For a more in depth introduction you can check out these resources:

Structured, interactive beginner's guide to Python: Learn Python
Big selection of free Python tutorials: Real Python Tutorials
Best practice guide for Python: The Hitchhiker’s Guide to Python!

Arithmetic & logical operators¶

Some very basic commands you need to know when coding with Python are arithmetic operators. You can use these operators to execute basic calculations. See the table below for an overview about the base operators Python provides

Operator	Description	Example
`+`	addition	1 `+` 1 is 2
`-`	subtraction	1 `-` 1 is 0
`*`	multiplication	1 `*` 1 is 1
`/`	division	2 `/` 2 is 1
`**`	exponentiation	2 `**` 2 is 4
`%`	modulo operator (returns the remainder)	5 `%` 2 is 1
`//`	integer division (drops the remainder)	5 `//` 2 is 2

Additionally Python also provides a set of logical and comparison operators. The evaluation of logical operators always result in True or False. See table below:

Operator	Description	Example
`<`	less than	2 `<` 3 is `True`
`<=`	less than or equal to	2 `<=` 3 is `True`
`>`	greater than	2 `>` 3 is `False`
`>=`	greater than or equal to	2 `>=` 3 is `False`
`==`	equal	2 `==` 3 is `False`
`!=`	not equal	2 `!=` 3 is `True`
`not`	not x; reverses result	`not` (2 `!=` 3) is `False`
`or`	x OR y must be true	2 `==`3 `or` 2 `!=`3 is `True`
`and`	x AND y must be true	2 `==`3 `or` 2 `!=`3 is `False`

Let's use these operators:

Hint: Remember that you execute code by selecting the cell and then either hitting the run button in the notebook toolbar or by using the keyboard shortcut Shift + Enter

Calculate 2*7:

In [25]:

2*7

Out[25]:

You also can use variables in Python to hold and manipulate values. The creation of a variable is very easy. All you need to do is coming up with a name for your variable and assign something to it. You can freely chose the name of the variable and you also can overwrite the data in a variable. Let's checkout what that means:

In the code block below we create two variables. The first one we name variable_1and we assign the result of the calculation 1 + 1to the variable. The second variable we name variable_2 and we store the result off 2*7in it.

After that we can use the variables to manipulate them further or do some calculations with them. E.g. we can calculate variable_1 - variable_2.

In [22]:

variable_1 = 1+1
variable_2 = 2*7

variable_1 - variable_2

Out[22]:

-12

Hint: Assigning values to a variable is done by using the = sign.
Make sure not to confuse the single = (assignment operator) with the double == (logical operator)

Let's check whether variable2 is greater than variable1.

In [23]:

variable_2 > variable_1

Out[23]:

True

Now let's check whether variable2 is 14 and variable1 is 3 (both conditions must be correct to be True, otherwise the condition is False)

In [24]:

variable_2 == 14 and variable_1 == 3

Out[24]:

False

Hint: Jupyter Notebooks follow a linear execution logic. This means you can use variables that were created in cells you already executed. However, when using variables of cells that you did not execute yet, Python will throw an error.

Data types¶

In the previous section, numbers where used to calculate things. However, Python can handle a variety of other data types as well. In this section we check out three of the most important categories of data types Python can handle. This list is not exhaustive though. You can find a complete overview of data types Python natively supports here.

We will cover three categories of data types in this section, namely:

Category	Data Types
`Text Type`	`str`
`Numeric Types`	`int`, `float`
`Sequence & Mapping Types`	`list`, `range`, `dict`

Text type¶

The text type is used for character input. Whenever you want to work with character strings (words & text) Python uses the text type for doing so. The differentiation between different types is crucial since there are operations that are meaningful for text but not for numbers (e.g. capitalizing letters, splitting at line breaks).

In the following example we will assign a text to a variable. Afterwards we're gonna "ask" Python about the data type of this variable.

We use the following sentence:

Ada Lovelace was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine.

Text taken from: https://en.wikipedia.org/wiki/Ada_Lovelace

Hint: When you want to assign text to a variable, make sure your text is wrapped inside quotation marks

In [23]:

# We create a variable called 'ada_description'.
# By using the "=" sign, we assign the character string on the right side to the new variable
ada_description = "Ada Lovelace was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine"

There is no output of the cell above because we only assign something to a variable. This does not produce any output, we just created a new variable. We can print the content of the variable by using the print() function mentioned earlier. We also can just type the variable name into a new cell and Jupyter will show us what's in the variable:

In [8]:

ada_description

Out[8]:

"Ada Lovelace was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine"

Let's check the data type of ada_description by using a function called type()

In [9]:

type(ada_description)

Out[9]:

str

Python returns str as the type of ada_description. This means that there is text (a character string) stored inside the ada_description object.

Hint: Keep in mind that Python will only treat textual inputs as character strings when you put it into quotation marks (e.g. "hello"). If you do not wrap the textual input into quotation marks, Python will interpret the input as variable names (e.g. hello) - this can lead to errors or unwanted results.

Numeric types¶

Integers

Integers are whole numbers.

In [76]:

age = -46

print(age)
type(age)

-46

Out[76]:

int

Float

Floating point real values represent real numbers and are written with a decimal point.

In [74]:

temperature_celsius = 12.876

print(temperature_celsius)
type(temperature_celsius)

12.876

Out[74]:

float

Sequence & mapping types¶

Python provides different built-in ways to store multiple values/items inside a single variable. These sequence and mapping types are ubiquitous when working with Python.

Lists¶

Lists can be considered as the most basic form of item collections in Python. Items inside lists can be of different types; the items can be changed and lists can contain duplicate values. Lists are notated with square brackets []. More about lists can be found here.

In [4]:

sequence_1 = [1, "2", 3.5, "6"]

print(sequence_1)
type(sequence_1)

[1, '2', 3.5, '6']

Out[4]:

list

Ranges¶

Ranges return a sequence of numbers. A range object itself only stores the information at which position the range starts (defaults to 0), where it ends and the step size of the range (defaults to 1).

Ranges are very useful when creating iterations. See the section about loops for more details an that. General documentation about the usage of ranges can be found here.

In [77]:

sequence_2 = range(0, 100)

print(sequence_2)
type(sequence_2)

range(0, 100)

Out[77]:

range

Dictionaries¶

Dictionaries are used when key-value pairs are needed. Key-value pairs basically bind two values to each other, one being the key the other one being the value. The key usually represents something like a category or a class and the value is a form or characteristic the key can take.

More details on dictionaries can be found here.

In [1]:

mapping_ada_1 = {"name": "Ada Lovelace",
                 "birth_year": 1815}

print(mapping_ada_1)
type(mapping_ada_1)

{'name': 'Ada Lovelace', 'birth_year': 1815}

Out[1]:

dict

Nested dictionaries¶

Dictionaries can be nested arbitrarily. So the value of an key-value pair can itself be an dictionary. This is very useful for describing complex data structures.

In [2]:

mapping_2 = {"Ada Lovelace": { "birth_year": 1815,
                               "gender": "female" },
             "Alan Turing": { "birth_year": 1912,
                              "gender": "male",
                              "cause_of_death": "homophobia" }
            }

print(mapping_2)
type(mapping_2)

{'Ada Lovelace': {'birth_year': 1815, 'gender': 'female'}, 'Alan Turing': {'birth_year': 1912, 'gender': 'male', 'cause_of_death': 'homophobia'}}

Out[2]:

dict

Hint: Nesting also applies to lists. So you can have a list in a list. Even a list in a dictionary and vice versa is possible! Feel free to try it out by editing the code cell above.

Loops & if/else statements¶

For many computational tasks we need to define conditions and iterations, so the computer is able to do more than just executing a list of instructions from first to last.

Loops¶

Loops can be used to execute statements a desired number of times. This can be very helpful in reducing the amount of code needed for a specific task.

For loops¶

Let's start with a simple example. At first we create a list of five names. Afterwards we create a loop that outputs a personal greeting to each name:

In [2]:

names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]

for name in names:
    print("Hi", name, "!")

Hi Ada !
Hi Cornélie !
Hi Stanisław !
Hi Mathew !
Hi Liss !

Let's break that down:

Part of Command	Meaning
`for name in names:`	We ask Python to do something for each element `name` in the variable `names`. The singular `name` is an arbitrary choice to make this iteration more comprehensible. This is called an iteration variable. You can use any term you like for the iteration variable.
`print("Hi", name, "!")`	Here we tell Python to print the word `Hi` along with the respective `name` element of the `names` list and a `!` afterwards.

We can also loop over ranges. Let's create a range from 0 to 5 and do some simple calculation and print meaningful outputs. This time we use the name i (short for iteration) as the name for the iteration variable:

In [26]:

for i in range(0, 5):
    print("Current number is", i)
    print("Let's divide it by 3")
    print(i, "divided by 3 is", i/3)
    print("") #this empty string results in a blank line after each iteration in the output below.

Current number is 0
Let's divide it by 3
0 divided by 3 is 0.0

Current number is 1
Let's divide it by 3
1 divided by 3 is 0.3333333333333333

Current number is 2
Let's divide it by 3
2 divided by 3 is 0.6666666666666666

Current number is 3
Let's divide it by 3
3 divided by 3 is 1.0

Current number is 4
Let's divide it by 3
4 divided by 3 is 1.3333333333333333

While loops¶

Another kind of loop you can do is the while loop.

The while loop works conditionally. This means you can define a condition and as long as this condition is true, the loop proceeds.

Let's tell Python to count up from zero until it reaches 4.

In [6]:

count_variable = 0

while count_variable < 4:
    print(count_variable)
    count_variable += 1 # This line raises the count variable by 1 after each iteration

After each iteration we use the += operator which is equivalent to count_variable = count_variable + 1.

If ... else statements¶

There are many more scenarios in which you need to define other conditions for your code to run than while loops. In this case you can use If/Else statements. If/Else statements let you run any code conditionally. With if/else statements you can define what code Python should run when a condition is true and what should happen when the condition is not true.

Let's use the names list again from the for loop example above.

This time we only want to greet every second person. This means we need to embed an If/Else statement within the for loop.

In [2]:

names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]
greet = True

for name in names:
    if greet:
        print("Hi", name, "!")
    else:
        print("No greeting for", name)
    greet = not greet

Hi Ada !
No greeting for Cornélie
Hi Stanisław !
No greeting for Mathew
Hi Liss !

Let's break that down:

Part of Command	Meaning
`greet`	The variable `greet` is set to the initial value `true` before we enter the if/else statement.
`if greet:`	Here we check whether `greet` is `True`. If it is, the line `print("Hi", name, "!")` is executed.
`else:`	The `else` part of the code defines what happens, `greet` is not `True`. In this case the respective person is not greeted.
`greet = not greet`	After each iteration we use the `not` operator to reverse the current state of the `greet` variable. When `greet` is `True` it becomes `False` and the other way around. This way `greet` changes between `True` and `False` in every iteration.

📝 Exercises¶

Create a loop that counts from 100 to 115

In [ ]:

Use the following list of names and create a loop that adds the greeting "Good Morning" to every name:

names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]

Desired output:

"Good Morning Ada" "Good Morning Cornélie" "Good Morning Stanisław" "Good Morning Mathew" "Good Morning Liss"

In [ ]:

Use your for-loop from above and print "Good Night" instead of "Good Morning" for every list item on an even position (list entries: 0,2,4).

In [ ]:

Functions¶

Generally speaking, functions in Python are blocks of code that only run, when they are called. Functions always follow the pattern:

function_name(arguments)

We already used some of the built-in functions of Python like print(), list() or range().

However, you can not only use function already existing in Python but you also can write your own functions.

Let's write a small function and see how it works, afterwards.

In [27]:

# with >to="Everyone"< we define a default value for the greeting. This way the function greets everyone by default.
def hello(to="Everyone"):  
    return "Hello {}, how are you?".format(to)


hello()

Out[27]:

'Hello Everyone, how are you?'

In [28]:

# Now we pass in a different argument to the function, so it greets "Ada" and not "Everyone"

hello(to="Ada")

Out[28]:

'Hello Ada, how are you?'

At first, let's quickly summarise what happened in the previous code blocks.
We wrote a function called hello(). This function has one argument called toand the default value of this argument is Everyone. The function itself inserts this argument into a character string so the final return value of the function is "Hello Everyone, how are you?"

Now, let's have a more detailed view at what happened:

1. def

When you want to create a new function you need to use def to let Python know you want to define a new function.

2. hello(to='Everyone'):

This part of the code defines the function name (hello()) as well as the argument it digests (to="Everyone").
An argument is any kind of data or information you want to pass on from outside the function into the function. It is not mandatory to use any arguments, nor is it mandatory to define a default value for an argument.

In the example above we defined a default value for the to= argument. This means that when calling the function it is not necessary to define the to= argument unless you want to use another value than the default Everyone.

3. return "Hello {}, how are you?".format(to)

This is the logic of the function. return defines what the function is supposed to return. In this case it is returning a character string called Hello {}, how are you?.
Right after the character string there is .format(to). This is called a method. As discussed in Section 2.2.1 there are different things you can do with variables in dependance of the respective data type or class. Character strings in Python have plenty different methods (a kind of sub-function that only works for character strings). One of these methods is format(). format() can be used to replace a placeholder in a character string with a specified value.

Inside the return string "Hello {}, how are you?" there are curly brackets ({}) - this is the placeholder format() replaces with the specified value that is passed as an argument into format(). In the examle above we pass the argument to into format() - to has the deafult value Everyone und thus the resulting string that our function returns when calling hello() is Hello Everyone, how are you?

Hint: What is the difference between a method and a function?
A method is a function that is bound to an object. A function is not bound and might be applied to any object. Methods are called by attaching a . to the object name, followed by the method name: object_name.method_name(argument)

Libraries¶

So far we only used the base functionality of Python. But you also can extend Python by using additional packages/libraries. Throughout this curriculum we will use quite a variety of different libraries.

Since Python is an open-source programming language, there are thousands of libraries written by the Python community. This means whenever you face an issue you want to solve with Python, the likelihood is rather high that there is a library that is exactly made for solving your specific problem.

The installation of new libraries is very easy. You just need to run pip install PACKAGE_NAME from your command line.

Hint: When you want to install a new package from within a Jupyter notebook, you need to put a ! at the beginning of the code cell, to inform Jupyter about your intention to run the command as a terminal command and not as a Python command.

Let's try that out by installing a package called pandas - pandas is the most popular Python library for working with tabular data.

In [2]:

!pip install pandas

Collecting pandas
  Downloading pandas-1.2.3-cp38-cp38-manylinux1_x86_64.whl (9.7 MB)
     |████████████████████████████████| 9.7 MB 2.2 MB/s eta 0:00:01
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/site-packages (from pandas) (2.8.1)
Collecting numpy>=1.16.5
  Downloading numpy-1.20.1-cp38-cp38-manylinux2010_x86_64.whl (15.4 MB)
     |████████████████████████████████| 15.4 MB 22.6 MB/s eta 0:00:01
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Installing collected packages: numpy, pandas
Successfully installed numpy-1.20.1 pandas-1.2.3

After seccussfully installing pandaswe can load the library in a code cell and use it in our code.

You can import libraries to your project by using the library call. Additionally you have some option when loading libraries. You can define a new name under which the library should be available in your project. And you also can select just a subsection of the full library in case you do not need every functionality inside the library.

Usually when using pandas there is a convention to abbreviate the library name to pd. So whenever you load the library and you want to stick to the convention you would load it as:

In [3]:

import pandas as pd

Data Frames (with Pandas)¶

Scientific computing often involves tabular data representation. The advantage of tabular data representation is the clear structure. This data structure is often referred to as data frame or data table. The terminology around data frames differs slightly, depending on the scientific field of the speaker/author.

This table provides a quick overview of different terms for the same concepts:

Element	Term
Column	Feaure, Variable, Dimension
Row	Instance, Observation
Cell	Value, Data Point, Datum

Key advantages of pandas data frames:

Manage and analyse data
Well suited for working with relational and labeled data (e.g.: Excel, CSV, SQL Tables)

Let's create an example data frame with pandas. At first we create a dictionary that contains our data structure. Afterwards we transform the dictionary into a pandas data frame.

This is the example table we want to create:

name	birth_year	gender
Ada	1815	female
Cornélie	1965	female
Stanisław	1987	male
Mathew	1896	male
Liss	1976	non-binary

In [4]:

data_dict = dict({"name": ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"],
                 "birth_year": [1815, 1965, 1987, 1896, 1976],
                  "gender": ["female", "female", "male", "male", "non-binary"]})

data_frame = pd.DataFrame(data=data_dict)

data_frame

Out[4]:

	name	birth_year	gender
0	Ada	1815	female
1	Cornélie	1965	female
2	Stanisław	1987	male
3	Mathew	1896	male
4	Liss	1976	non-binary

This pandas data frame has some very convenient methods helping us to get an understanding what's in the data.

Summarize data¶

Summarizing data is crucial for exploring and understanding a data frame. There are different ways of summarizing or describing a data set.

This section presents four very useful ways pandas offers to summarize data.

Describe¶

The describe() method generates descriptive statistics of the data frame. These statistics include measures of central tendency, dispersion and shape of a dataset's distribution. The default functionality of describe() only generates descriptive statistics for numeric data. However, there is the argument include=. When pass on the parameter include="all", there will be descriptive stats of character variables as well.

For a full documentation of the describe() method check out the official documentation.

In [7]:

data_frame.describe()

Out[7]:

	birth_year
count	5.000000
mean	1927.800000
std	72.365047
min	1815.000000
25%	1896.000000
50%	1965.000000
75%	1976.000000
max	1987.000000

Info¶

The info() method generates a more technical summary of the dataset. The output shows information about the index (row labels of the dataset) like the range and the type. Furthermore you get information about the columns, about the number of Non-null values per column (null in pandas is an umbrella term for any kind of missing value). You also get an information about the data type (dtype) of any column.

For a full documentation of the info() method check out the official documentation.

In [10]:

data_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        5 non-null      object
 1   birth_year  5 non-null      int64 
 2   gender      5 non-null      object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes

Value counts¶

The value_counts() method generates a table that depicts the counts of unique rows in the dataframe. There is a subset argument you can use to define a list of column names you want to use when counting unique combinations within this subset instead of using all available columns.

For a full documentation of the value_counts() method check out the official documentation.

In [13]:

data_frame.value_counts()

Out[13]:

name       birth_year  gender    
Ada        1815        female        1
Cornélie   1965        female        1
Liss       1976        non-binary    1
Mathew     1896        male          1
Stanisław  1987        male          1
dtype: int64

Group by¶

The groupby() method enables you to apply data operations on subgroups of the data by using values of one or multiple variables to define the subgroups. When you use the groupby method, you define subsections of the data and the operation you want to do will be done per subsection. This is a very useful method for data aggregation and data cleaning.

For a full documentation of the groupby() method check out the official documentation.

The example below groups the dataset by the variable gender and counts the occurrence of each gender in the dataset per column.

In [24]:

data_frame.groupby("gender").count()

Out[24]:

	name	birth_year
gender
female	2	2
male	2	2
non-binary	1	1

Sort and arrange data¶

The sort_values() method enables you to sort a dataset by either one or multiple values. You can sort by numerical and character variables and you can define whether you want the order to be ascending (ascending = True) or descending (ascending = False).

For a full documentation of the sort_values() method check out the official documentation.

In [15]:

data_frame.sort_values(by="birth_year", ascending=True)

Out[15]:

	name	birth_year	gender
0	Ada	1815	female
3	Mathew	1896	male
1	Cornélie	1965	female
4	Liss	1976	non-binary
2	Stanisław	1987	male

Transform data¶

Transforming and manipulating data is a very important part of cleaning and tidying up raw data. It is often times crucial to change aspects of the data you want to use. Things you might want to do are numerous like creating and altering existing variables or changing naming conventions.

Rename variables¶

The rename() method let's you change column names. You can pass in a dictionary to the columns= argument. This dictionary describes the mapping of the present column name(s) and the new column name(s).

For a full documentation of the rename() method check out the official documentation.

The example below changes the column name name to first_name.

In [15]:

data_frame.rename(columns={"name": "first_name"})

Out[15]:

	name	birth_year	gender
0	Ada	1815	female
1	Cornélie	1965	female
2	Stanisław	1987	male
3	Mathew	1896	male
4	Liss	1976	non-binary

Assign¶

The assign() method adds new columns to a dataframe. This method is very useful when you want to generate new variables based on existing ones or if you want to add entirely new data as a column to your dataset.

For a full documentation of the assign() method check out the official documentation.

The example below shows how to add an age column to the dataset.

In [19]:

data_frame.assign(age=2021-data_frame["birth_year"])

Out[19]:

	name	birth_year	gender	age
0	Ada	1815	female	205
1	Cornélie	1965	female	55
2	Stanisław	1987	male	33
3	Mathew	1896	male	124
4	Liss	1976	non-binary	44

Query & filter data¶

When you work with data you will most likely need to filter your data according to some specific logic depending on your task. There are numerous ways to filter and clean up your data. This section provides a brief introduction about the most versatile ways pandas offers.

Select columns¶

You can easily select a subgroup of columns by passing a list of the column names into square brackets ([]) at the end of the dataset object:

In [31]:

data_frame[["name", "gender"]]

Out[31]:

	name	gender
0	Ada	female
1	Cornélie	female
2	Stanisław	male
3	Mathew	male
4	Liss	non-binary

Select rows¶

The iloc() method is short for integer location and let's you select rows by using the numeric location information (row number).

For a full documentation of the iloc() method check out the official documentation.

In [27]:

data_frame.iloc[[3]]

Out[27]:

	name	birth_year	gender
3	Mathew	1896	male

Drop duplicates¶

The drop_duplicates() method enables you to drop duplicated values and only keep distinct ones.

For a full documentation of the drop_duplicates() method check out the official documentation.

The example below selects the column gender and drops all duplicated values in this column.

In [32]:

data_frame[["gender"]].drop_duplicates()

Out[32]:

	gender
0	female
2	male
4	non-binary

Sample¶

The sample() method enables you to pick a random subgroup of your dataset.

For a full documentation of the sample() method check out the official documentation.

The example below picks a random sample of size n=3.

In [37]:

data_frame.sample(n=3)

Out[37]:

	name	birth_year	gender
2	Stanisław	1987	male
4	Liss	1976	non-binary
1	Cornélie	1965	female

Query¶

More complex data filters can easily be applied by using the query() method. You can use logical expressions based on the variables in the dataset inside the query() method to define very distinct filters.

For a full documentation of the query() method check out the official documentation.

The example below applies a filter with the following logic:

Gender must be female or non-binary
Birth year must be greater than 1900

In [35]:

data_frame.query(
    "gender == 'female' | gender == 'non-binary' & birth_year > 1900")

Out[35]:

	name	birth_year	gender
0	Ada	1815	female
1	Cornélie	1965	female
4	Liss	1976	non-binary

Export and import data¶

The last topic covered in this notebook is the process of exporting (saving) data and importing (loading) data. This section focuses on handling tabular data, since this is the most important data format for this curriculum.

Export¶

Exporting a dataset with pandas is very easy. There is a variety of different output formats you can choose from. Every output format has a different method. The table below provides an overview of the most common export formats. You can click on any method name to see the official documentation of the respective method.

Method	Details
`to_csv()`	CSV is the most popular and generic data format for tabular data. Click here for more details.
`to_excel()`	The excel format is ideal when you want to work on with the data in Microsoft Excel.
`to_pickle()`	Pickle is Python specific data format . Click here for more details.
`to_feather()`	Feather is a data format of Apache Arrow. It is well suited for exchanging data between Python and R. Click here for more details.
`to_parquet()`	Parquet is the data format of Apache Spark. To some extend it's similar to Feather and extensively used in cloud computing environments. Click here for more details.

The example below shows how to save a dataset to a local file called "example_exports.csv".

In [45]:

data_frame.to_csv("./example_export.csv", index=False)

Import¶

Importing a dataset to the current Python session is as easy as exporting the data. For every export method in the table above, there is a complementary import method. Every import method starts with a read_.

The file that was exported in the cell above can be imported by using the read_csv() method as shown in the cell below.

In [46]:

example_import = pd.read_csv("./example_export.csv")
example_import

Out[46]:

	name	birth_year	gender
0	Ada	1815	female
1	Cornélie	1965	female
2	Stanisław	1987	male
3	Mathew	1896	male
4	Liss	1976	non-binary

📝 Exercises¶

In [ ]:

Create a data frame from a dictionary using pandas. The final data frame needs to look like the table below:

City	Country	Inhabitants	Wikipedia URL
Trincomalee	Sri Lanka	99135	https://en.wikipedia.org/wiki/Trincomalee
Kołobrzeg	Poland	46830	https://en.wikipedia.org/wiki/Ko%C5%82obrzeg
Manali	India	8096	https://en.wikipedia.org/wiki/Manali,_Himachal_Pradesh
St. Paul's Bay	Malta	29097	https://en.wikipedia.org/wiki/St._Paul%27s_Bay

In [ ]:

Order the data frame by inhabitants (descending)

In [ ]:

Draw a random sample of two rows of this data frame and assign them to a new object

In [ ]:

Save this sampled 2-row data frame as a csv file.

This was the introduction to Jupyter notebooks and Python. This notebook introduced you into the basic programming concepts we are going to use in this curriculum. Come back to this notebook in case you want to refresh some Python basics. In the next notebook we learn the basics of graph theory and we are going to analyze a network of Nobel laureates.

Solutions for the exercises¶

This section provides the solutions for the exercises in this notebook.

2.3.3 📝 Exercises¶

Create a loop that counts from 100 to 115

In [4]:

for i in range(16):
    print(100+i)

Use the following list of names and create a loop that adds the greeting "Good Morning" to every name:

names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]

Desired output:

"Good Morning Ada" "Good Morning Cornélie" "Good Morning Stanisław" "Good Morning Mathew" "Good Morning Liss"

In [6]:

names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]

for name in names:
    print("Good Morning", name)

Good Morning Ada
Good Morning Cornélie
Good Morning Stanisław
Good Morning Mathew
Good Morning Liss

Use your for-loop from above and print "Good Night" instead of "Good Morning" for every list item on an even position (list entries: 0,2,4).

In [8]:

names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]
night = True

for name in names:
    if night:
        print("Good Night", name)
    else:
        print("Good Morning", name)
    night = not night

Good Night Ada
Good Morning Cornélie
Good Night Stanisław
Good Morning Mathew
Good Night Liss

2.6.6 📝 Exercises¶

Create a data frame from a dictionary using pandas. The final data frame needs to look like the table below:

City	Country	Inhabitants	Wikipedia URL
Trincomalee	Sri Lanka	99135	https://en.wikipedia.org/wiki/Trincomalee
Kołobrzeg	Poland	46830	https://en.wikipedia.org/wiki/Ko%C5%82obrzeg
Manali	India	8096	https://en.wikipedia.org/wiki/Manali,_Himachal_Pradesh
St. Paul's Bay	Malta	29097	https://en.wikipedia.org/wiki/St._Paul%27s_Bay

In [11]:

import pandas as pd

data_dict = dict({"City": ["Trincomalee", "Kołobrzeg", "Manali", "St. Paul's Bay"],
                 "Country": ["Sri Lanka", "Poland", "India", "Malta"],
                 "Inhabitants": [99135, 46830, 8096, 29097],
                 "Wikipedia URL": ["https://en.wikipedia.org/wiki/Trincomalee", "https://en.wikipedia.org/wiki/Ko%C5%82obrzeg",
                                   "https://en.wikipedia.org/wiki/Manali,_Himachal_Pradesh","https://en.wikipedia.org/wiki/St._Paul%27s_Bay"]})

data_frame = pd.DataFrame(data=data_dict)

data_frame

Out[11]:

	City	Country	Inhabitants	Wikipedia URL
0	Trincomalee	Sri Lanka	99135	https://en.wikipedia.org/wiki/Trincomalee
1	Kołobrzeg	Poland	46830	https://en.wikipedia.org/wiki/Ko%C5%82obrzeg
2	Manali	India	8096	https://en.wikipedia.org/wiki/Manali,_Himachal...
3	St. Paul's Bay	Malta	29097	https://en.wikipedia.org/wiki/St._Paul%27s_Bay

Order the data frame by inhabitants (descending)

In [12]:

data_frame.sort_values(by="Inhabitants", ascending=False)

Out[12]:

	City	Country	Inhabitants	Wikipedia URL
0	Trincomalee	Sri Lanka	99135	https://en.wikipedia.org/wiki/Trincomalee
1	Kołobrzeg	Poland	46830	https://en.wikipedia.org/wiki/Ko%C5%82obrzeg
3	St. Paul's Bay	Malta	29097	https://en.wikipedia.org/wiki/St._Paul%27s_Bay
2	Manali	India	8096	https://en.wikipedia.org/wiki/Manali,_Himachal...

Draw a random sample of two rows of this data frame and assign them to a new object

In [13]:

sampled_rows = data_frame.sample(n=2)
sampled_rows

Out[13]:

	City	Country	Inhabitants	Wikipedia URL
3	St. Paul's Bay	Malta	29097	https://en.wikipedia.org/wiki/St._Paul%27s_Bay
1	Kołobrzeg	Poland	46830	https://en.wikipedia.org/wiki/Ko%C5%82obrzeg

Save this sampled 2-row data frame as a csv file.

In [ ]:

sampled_rows.to_csv("./sampled_rows.csv")