SoNAR (IDH) - HNA Curriculum
Notebook 1: Jupyter and Python
This curriculum serves as a guide and introductory resource to using the SoNAR (IDH) network database with the help of Python and Jupyter notebooks.
It uses Jupyter Notebooks to document and explain how to do historical network analysis with the SoNAR (IDH) data. This first notebooks provides a high-level overview of the basic functionality of Jupyter notebooks as well as a quick introduction to basic Python. All the following notebooks will build up upon the fundamentals explained in this notebook.
If you are already familiar with Jupyter Notebooks and Python you can directly jump to Notebook 2 - Historical Network Analysis.
Jupyter Notebooks are basically a kind of code editor you can use inside your web browser. This means you can write, edit and run code and from your browser, the code is immediately interpreted and the results are visible directly under the executed code.
What makes Jupyter notebooks so versatile is that you can mix code blocks and text blocks within the same document and thus create interactive, transparent and reproducible documentations or guides.
Project Jupyter is the name of the non-profit and open-source project that develops the Jupyter notebook technology this curriculum is based on. Project Jupyter is maintained and developed by a big community with a focus on making scientific computing accessible, easy and free. The name Jupyter is derived from the names of the three programming languages Julia, Python and R. The main objective of Jupyter is supporting interactive data science and scientific computing across all programming languages. All notebooks in this curriculum use the Python programming language.
This curriculum consists of five Jupyter Notebooks. When you are new to Jupyter Notebooks you can take the "User Interface Tour" in the help menu on top of the screen. The screenshot below shows where to find the interactive tour:
Notebooks
Jupyter notebooks are documents that combine live runnable code with narrative text (Markdown), equations, images, interactive visualizations and other rich output.
The Toolbar
When you open up a notebook you can see the toolbar at the very top of the notebook. This is what the toolbar can do for you:
Category | Data Types |
---|---|
![]() |
Save the current state of the notebook |
![]() |
Insert a new cell below the current selection |
![]() |
Cut the selected cell |
![]() |
Copy the selected cell |
![]() |
Paste the copied or cut cell |
![]() |
Move cell up or down |
![]() |
Run the selected cell |
![]() |
Stop the execution of the selected cell |
![]() |
Restart the kernel of the notebook |
![]() |
Restart the kernel and execute all code cells chronologically |
![]() |
Select the cell type (Code, Markdown, Raw) |
The Cells
Jupyter Notebooks consist of cells that are arranged vertically. Every text you read so far is written inside a cell. When you double click any image or text in this notebook, the respective cell will jump into "edit mode" and you can change the contents to your liking. When you hit the Run button in the toolbar after you are done editing, the cell will be executed and it jumps back to what is called the "command mode" (Double click this cell to try it).
There are two relevant cell types Jupyter Notebooks provide. The first one being Code and the second one is Markdown. Here is how they differ from each other:
*Code Cell*
Code cells let you write and execute programming code. Make sure you select the Codecell type in the cell type drop down you find in the toolbar.
When you type code in a code cell and hit on the run button in the toolbar, the code is executed and the output of the code is beneath the code cell:
Executed code will be held in memory of the kernel. That means that your running notebook will hold some kind of state, depending of which code blocks you already executed. This will come in handy when we'll discuss variables later on.
*Markdown Cell*
Markdown is a markup language you can use to format text. Markdown has a very simple syntax you can use for formatting text, tables or lists. You can also embed images, HTML blocks and other media formats with Markdown.
A quick example of how Markdown works can be found in the images beneath. The left hand side shows the markdown syntax during editing mode of a cell. The right hand side shows the executed Markdown cell after you hit the run button.
raw markdown | rendered markdown |
---|---|
![]() |
![]() |
A good overview of how Markdown in Jupyter notebooks works can be found here.
Now, try to reproduce the example from above on your own following the steps below:
print("Hello world!")
inside this cell.Congratulations to your first line of code! 🎉
Jupyter Notebooks can be used with a multitude of programming languages. This curriculum uses Python, a language well known both for its friendliness towards beginners and its maturity as a professional tool. The following sections provide a quick introduction to Python.
For a more in depth introduction you can check out these resources:
Some very basic commands you need to know when coding with Python are arithmetic operators. You can use these operators to execute basic calculations. See the table below for an overview about the base operators Python provides
Operator | Description | Example | |
---|---|---|---|
+ |
addition | 1 + 1 is 2 |
|
- |
subtraction | 1 - 1 is 0 |
|
* |
multiplication | 1 * 1 is 1 |
|
/ |
division | 2 / 2 is 1 |
|
** |
exponentiation | 2 ** 2 is 4 |
|
% |
modulo operator (returns the remainder) | 5 % 2 is 1 |
|
// |
integer division (drops the remainder) | 5 // 2 is 2 |
Additionally Python also provides a set of logical and comparison operators. The evaluation of logical operators always result in True or False. See table below:
Operator | Description | Example | |
---|---|---|---|
< |
less than | 2 < 3 is True |
|
<= |
less than or equal to | 2 <= 3 is True |
|
> |
greater than | 2 > 3 is False |
|
>= |
greater than or equal to | 2 >= 3 is False |
|
== |
equal | 2 == 3 is False |
|
!= |
not equal | 2 != 3 is True |
|
not |
not x; reverses result | not (2 != 3) is False |
|
or |
x OR y must be true | 2 == 3 or 2 != 3 is True |
|
and |
x AND y must be true | 2 == 3 or 2 != 3 is False |
Let's use these operators:
Shift + Enter
Calculate 2*7
:
2*7
14
You also can use variables in Python to hold and manipulate values. The creation of a variable is very easy. All you need to do is coming up with a name for your variable and assign something to it. You can freely chose the name of the variable and you also can overwrite the data in a variable. Let's checkout what that means:
In the code block below we create two variables. The first one we name variable_1
and we assign the result of the calculation 1 + 1
to the variable. The second variable we name variable_2
and we store the result off 2*7
in it.
After that we can use the variables to manipulate them further or do some calculations with them. E.g. we can calculate variable_1 - variable_2
.
variable_1 = 1+1
variable_2 = 2*7
variable_1 - variable_2
-12
=
sign.
=
(assignment operator) with the double ==
(logical operator)Let's check whether variable2
is greater than variable1
.
variable_2 > variable_1
True
Now let's check whether variable2
is 14
and variable1
is 3
(both conditions must be correct to be True
, otherwise the condition is False
)
variable_2 == 14 and variable_1 == 3
False
In the previous section, numbers where used to calculate things. However, Python can handle a variety of other data types as well. In this section we check out three of the most important categories of data types Python can handle. This list is not exhaustive though. You can find a complete overview of data types Python natively supports here.
We will cover three categories of data types in this section, namely:
Category | Data Types |
---|---|
Text Type |
str |
Numeric Types |
int , float |
Sequence & Mapping Types |
list , range , dict |
The text type is used for character input. Whenever you want to work with character strings (words & text) Python uses the text type for doing so. The differentiation between different types is crucial since there are operations that are meaningful for text but not for numbers (e.g. capitalizing letters, splitting at line breaks).
In the following example we will assign a text to a variable. Afterwards we're gonna "ask" Python about the data type of this variable.
We use the following sentence:
Ada Lovelace was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine.
Text taken from: https://en.wikipedia.org/wiki/Ada_Lovelace
# We create a variable called 'ada_description'.
# By using the "=" sign, we assign the character string on the right side to the new variable
ada_description = "Ada Lovelace was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine"
There is no output of the cell above because we only assign something to a variable. This does not produce any output, we just created a new variable.
We can print the content of the variable by using the print()
function mentioned earlier. We also can just type the variable name into a new cell and Jupyter will show us what's in the variable:
ada_description
"Ada Lovelace was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine"
Let's check the data type of ada_description
by using a function called type()
type(ada_description)
str
Python returns str
as the type of ada_description
. This means that there is text (a character string
) stored inside the ada_description
object.
"hello"
). If you do not wrap the textual input into quotation marks, Python will interpret the input as variable names (e.g. hello
) - this can lead to errors or unwanted results.Integers
Integers are whole numbers.
age = -46
print(age)
type(age)
-46
int
Float
Floating point real values represent real numbers and are written with a decimal point.
temperature_celsius = 12.876
print(temperature_celsius)
type(temperature_celsius)
12.876
float
Python provides different built-in ways to store multiple values/items inside a single variable. These sequence and mapping types are ubiquitous when working with Python.
sequence_1 = [1, "2", 3.5, "6"]
print(sequence_1)
type(sequence_1)
[1, '2', 3.5, '6']
list
Ranges return a sequence of numbers. A range object itself only stores the information at which position the range starts (defaults to 0), where it ends and the step size of the range (defaults to 1).
Ranges are very useful when creating iterations. See the section about loops for more details an that. General documentation about the usage of ranges can be found here.
sequence_2 = range(0, 100)
print(sequence_2)
type(sequence_2)
range(0, 100)
range
Dictionaries are used when key-value pairs are needed. Key-value pairs basically bind two values to each other, one being the key the other one being the value. The key usually represents something like a category or a class and the value is a form or characteristic the key can take.
More details on dictionaries can be found here.
mapping_ada_1 = {"name": "Ada Lovelace",
"birth_year": 1815}
print(mapping_ada_1)
type(mapping_ada_1)
{'name': 'Ada Lovelace', 'birth_year': 1815}
dict
Dictionaries can be nested arbitrarily. So the value of an key-value pair can itself be an dictionary. This is very useful for describing complex data structures.
mapping_2 = {"Ada Lovelace": { "birth_year": 1815,
"gender": "female" },
"Alan Turing": { "birth_year": 1912,
"gender": "male",
"cause_of_death": "homophobia" }
}
print(mapping_2)
type(mapping_2)
{'Ada Lovelace': {'birth_year': 1815, 'gender': 'female'}, 'Alan Turing': {'birth_year': 1912, 'gender': 'male', 'cause_of_death': 'homophobia'}}
dict
For many computational tasks we need to define conditions and iterations, so the computer is able to do more than just executing a list of instructions from first to last.
Loops can be used to execute statements a desired number of times. This can be very helpful in reducing the amount of code needed for a specific task.
Let's start with a simple example. At first we create a list of five names. Afterwards we create a loop that outputs a personal greeting to each name:
names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]
for name in names:
print("Hi", name, "!")
Hi Ada ! Hi Cornélie ! Hi Stanisław ! Hi Mathew ! Hi Liss !
Let's break that down:
Part of Command | Meaning |
---|---|
for name in names: |
We ask Python to do something for each element name in the variable names . The singular name is an arbitrary choice to make this iteration more comprehensible. This is called an iteration variable. You can use any term you like for the iteration variable. |
print("Hi", name, "!") |
Here we tell Python to print the word Hi along with the respective name element of the names list and a ! afterwards. |
We can also loop over ranges
. Let's create a range
from 0
to 5
and do some simple calculation and print meaningful outputs. This time we use the name i
(short for iteration) as the name for the iteration variable:
for i in range(0, 5):
print("Current number is", i)
print("Let's divide it by 3")
print(i, "divided by 3 is", i/3)
print("") #this empty string results in a blank line after each iteration in the output below.
Current number is 0 Let's divide it by 3 0 divided by 3 is 0.0 Current number is 1 Let's divide it by 3 1 divided by 3 is 0.3333333333333333 Current number is 2 Let's divide it by 3 2 divided by 3 is 0.6666666666666666 Current number is 3 Let's divide it by 3 3 divided by 3 is 1.0 Current number is 4 Let's divide it by 3 4 divided by 3 is 1.3333333333333333
Another kind of loop you can do is the while loop.
The while loop works conditionally. This means you can define a condition and as long as this condition is true, the loop proceeds.
Let's tell Python to count up from zero until it reaches 4
.
count_variable = 0
while count_variable < 4:
print(count_variable)
count_variable += 1 # This line raises the count variable by 1 after each iteration
0 1 2 3
After each iteration we use the +=
operator which is equivalent to count_variable = count_variable + 1
.
There are many more scenarios in which you need to define other conditions for your code to run than while loops. In this case you can use If/Else statements. If/Else statements let you run any code conditionally. With if/else statements you can define what code Python should run when a condition is true and what should happen when the condition is not true.
Let's use the names list again from the for loop example above.
This time we only want to greet every second person. This means we need to embed an If/Else statement within the for loop.
names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]
greet = True
for name in names:
if greet:
print("Hi", name, "!")
else:
print("No greeting for", name)
greet = not greet
Hi Ada ! No greeting for Cornélie Hi Stanisław ! No greeting for Mathew Hi Liss !
Let's break that down:
Part of Command | Meaning |
---|---|
greet |
The variable greet is set to the initial value true before we enter the if/else statement. |
if greet: |
Here we check whether greet is True . If it is, the line print("Hi", name, "!") is executed. |
else: |
The else part of the code defines what happens, greet is not True . In this case the respective person is not greeted. |
greet = not greet |
After each iteration we use the not operator to reverse the current state of the greet variable. When greet is True it becomes False and the other way around. This way greet changes between True and False in every iteration. |
names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]
Desired output:
"Good Morning Ada" "Good Morning Cornélie" "Good Morning Stanisław" "Good Morning Mathew" "Good Morning Liss"
Generally speaking, functions in Python are blocks of code that only run, when they are called. Functions always follow the pattern:
function_name(arguments)
We already used some of the built-in functions of Python like print()
, list()
or range()
.
However, you can not only use function already existing in Python but you also can write your own functions.
Let's write a small function and see how it works, afterwards.
# with >to="Everyone"< we define a default value for the greeting. This way the function greets everyone by default.
def hello(to="Everyone"):
return "Hello {}, how are you?".format(to)
hello()
'Hello Everyone, how are you?'
# Now we pass in a different argument to the function, so it greets "Ada" and not "Everyone"
hello(to="Ada")
'Hello Ada, how are you?'
At first, let's quickly summarise what happened in the previous code blocks.
We wrote a function called hello()
. This function has one argument called to
and the default value of this argument is Everyone
. The function itself inserts this argument into a character string so the final return value of the function is "Hello Everyone, how are you?
"
Now, let's have a more detailed view at what happened:
1. def
When you want to create a new function you need to use def
to let Python know you want to define a new function.
2. hello(to='Everyone'):
This part of the code defines the function name (hello()
) as well as the argument it digests (to="Everyone"
).
An argument is any kind of data or information you want to pass on from outside the function into the function. It is not mandatory to use any arguments, nor is it mandatory to define a default value for an argument.
In the example above we defined a default value for the to=
argument. This means that when calling the function it is not necessary to define the to=
argument unless you want to use another value than the default Everyone
.
3. return "Hello {}, how are you?".format(to)
This is the logic of the function. return
defines what the function is supposed to return. In this case it is returning a character string called Hello {}, how are you?
.
Right after the character string there is .format(to)
. This is called a method
. As discussed in Section 2.2.1 there are different things you can do with variables in dependance of the respective data type or class. Character strings in Python have plenty different methods (a kind of sub-function that only works for character strings). One of these methods is format()
. format()
can be used to replace a placeholder in a character string with a specified value.
Inside the return string "Hello {}, how are you?"
there are curly brackets ({}
) - this is the placeholder format()
replaces with the specified value that is passed as an argument into format()
. In the examle above we pass the argument to
into format()
- to
has the deafult value Everyone
und thus the resulting string that our function returns when calling hello()
is Hello Everyone, how are you?
method
and a function
? method
is a function that is bound to an object. A function
is not bound and might be applied to any object. Method
s are called by attaching a .
to the object name, followed by the method
name: object_name.method_name(argument)
So far we only used the base functionality of Python. But you also can extend Python by using additional packages/libraries. Throughout this curriculum we will use quite a variety of different libraries.
Since Python is an open-source programming language, there are thousands of libraries written by the Python community. This means whenever you face an issue you want to solve with Python, the likelihood is rather high that there is a library that is exactly made for solving your specific problem.
The installation of new libraries is very easy. You just need to run pip install PACKAGE_NAME
from your command line.
!
at the beginning of the code cell, to inform Jupyter about your intention to run the command as a terminal command and not as a Python command.Let's try that out by installing a package called pandas
- pandas
is the most popular Python library for working with tabular data.
!pip install pandas
Collecting pandas Downloading pandas-1.2.3-cp38-cp38-manylinux1_x86_64.whl (9.7 MB) |████████████████████████████████| 9.7 MB 2.2 MB/s eta 0:00:01 Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/site-packages (from pandas) (2.8.1) Collecting numpy>=1.16.5 Downloading numpy-1.20.1-cp38-cp38-manylinux2010_x86_64.whl (15.4 MB) |████████████████████████████████| 15.4 MB 22.6 MB/s eta 0:00:01 Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/site-packages (from pandas) (2021.1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0) Installing collected packages: numpy, pandas Successfully installed numpy-1.20.1 pandas-1.2.3
After seccussfully installing pandas
we can load the library in a code cell and use it in our code.
You can import libraries to your project by using the library
call. Additionally you have some option when loading libraries. You can define a new name under which the library should be available in your project. And you also can select just a subsection of the full library in case you do not need every functionality inside the library.
Usually when using pandas
there is a convention to abbreviate the library name to pd
. So whenever you load the library and you want to stick to the convention you would load it as:
import pandas as pd
Scientific computing often involves tabular data representation. The advantage of tabular data representation is the clear structure. This data structure is often referred to as data frame or data table. The terminology around data frames differs slightly, depending on the scientific field of the speaker/author.
This table provides a quick overview of different terms for the same concepts:
Element | Term |
---|---|
Column | Feaure, Variable, Dimension |
Row | Instance, Observation |
Cell | Value, Data Point, Datum |
Key advantages of pandas data frames:
Let's create an example data frame with pandas. At first we create a dictionary that contains our data structure. Afterwards we transform the dictionary into a pandas data frame.
This is the example table we want to create:
name | birth_year | gender |
---|---|---|
Ada | 1815 | female |
Cornélie | 1965 | female |
Stanisław | 1987 | male |
Mathew | 1896 | male |
Liss | 1976 | non-binary |
data_dict = dict({"name": ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"],
"birth_year": [1815, 1965, 1987, 1896, 1976],
"gender": ["female", "female", "male", "male", "non-binary"]})
data_frame = pd.DataFrame(data=data_dict)
data_frame
name | birth_year | gender | |
---|---|---|---|
0 | Ada | 1815 | female |
1 | Cornélie | 1965 | female |
2 | Stanisław | 1987 | male |
3 | Mathew | 1896 | male |
4 | Liss | 1976 | non-binary |
This pandas data frame has some very convenient methods helping us to get an understanding what's in the data.
Summarizing data is crucial for exploring and understanding a data frame. There are different ways of summarizing or describing a data set.
This section presents four very useful ways pandas offers to summarize data.
The describe()
method generates descriptive statistics of the data frame. These statistics include measures of central tendency, dispersion and shape of a dataset's distribution. The default functionality of describe()
only generates descriptive statistics for numeric data. However, there is the argument include=
. When pass on the parameter include="all"
, there will be descriptive stats of character variables as well.
For a full documentation of the describe()
method check out the official documentation.
data_frame.describe()
birth_year | |
---|---|
count | 5.000000 |
mean | 1927.800000 |
std | 72.365047 |
min | 1815.000000 |
25% | 1896.000000 |
50% | 1965.000000 |
75% | 1976.000000 |
max | 1987.000000 |
The info()
method generates a more technical summary of the dataset. The output shows information about the index (row labels of the dataset) like the range and the type. Furthermore you get information about the columns, about the number of Non-null
values per column (null
in pandas is an umbrella term for any kind of missing value). You also get an information about the data type (dtype
) of any column.
For a full documentation of the info()
method check out the official documentation.
data_frame.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 5 non-null object 1 birth_year 5 non-null int64 2 gender 5 non-null object dtypes: int64(1), object(2) memory usage: 248.0+ bytes
The value_counts()
method generates a table that depicts the counts of unique rows in the dataframe. There is a subset
argument you can use to define a list of column names you want to use when counting unique combinations within this subset instead of using all available columns.
For a full documentation of the value_counts()
method check out the official documentation.
data_frame.value_counts()
name birth_year gender Ada 1815 female 1 Cornélie 1965 female 1 Liss 1976 non-binary 1 Mathew 1896 male 1 Stanisław 1987 male 1 dtype: int64
The groupby()
method enables you to apply data operations on subgroups of the data by using values of one or multiple variables to define the subgroups. When you use the groupby
method, you define subsections of the data and the operation you want to do will be done per subsection.
This is a very useful method for data aggregation and data cleaning.
For a full documentation of the groupby()
method check out the official documentation.
The example below groups the dataset by the variable gender
and counts the occurrence of each gender in the dataset per column.
data_frame.groupby("gender").count()
name | birth_year | |
---|---|---|
gender | ||
female | 2 | 2 |
male | 2 | 2 |
non-binary | 1 | 1 |
The sort_values()
method enables you to sort a dataset by either one or multiple values. You can sort by numerical and character variables and you can define whether you want the order to be ascending (ascending = True
) or descending (ascending = False
).
For a full documentation of the sort_values()
method check out the official documentation.
data_frame.sort_values(by="birth_year", ascending=True)
name | birth_year | gender | |
---|---|---|---|
0 | Ada | 1815 | female |
3 | Mathew | 1896 | male |
1 | Cornélie | 1965 | female |
4 | Liss | 1976 | non-binary |
2 | Stanisław | 1987 | male |
Transforming and manipulating data is a very important part of cleaning and tidying up raw data. It is often times crucial to change aspects of the data you want to use. Things you might want to do are numerous like creating and altering existing variables or changing naming conventions.
The rename()
method let's you change column names. You can pass in a dictionary to the columns=
argument. This dictionary describes the mapping of the present column name(s) and the new column name(s).
For a full documentation of the rename()
method check out the official documentation.
The example below changes the column name name
to first_name
.
data_frame.rename(columns={"name": "first_name"})
name | birth_year | gender | |
---|---|---|---|
0 | Ada | 1815 | female |
1 | Cornélie | 1965 | female |
2 | Stanisław | 1987 | male |
3 | Mathew | 1896 | male |
4 | Liss | 1976 | non-binary |
The assign()
method adds new columns to a dataframe. This method is very useful when you want to generate new variables based on existing ones or if you want to add entirely new data as a column to your dataset.
For a full documentation of the assign()
method check out the official documentation.
The example below shows how to add an age
column to the dataset.
data_frame.assign(age=2021-data_frame["birth_year"])
name | birth_year | gender | age | |
---|---|---|---|---|
0 | Ada | 1815 | female | 205 |
1 | Cornélie | 1965 | female | 55 |
2 | Stanisław | 1987 | male | 33 |
3 | Mathew | 1896 | male | 124 |
4 | Liss | 1976 | non-binary | 44 |
When you work with data you will most likely need to filter your data according to some specific logic depending on your task. There are numerous ways to filter and clean up your data. This section provides a brief introduction about the most versatile ways pandas offers.
You can easily select a subgroup of columns by passing a list of the column names into square brackets ([]
) at the end of the dataset object:
data_frame[["name", "gender"]]
name | gender | |
---|---|---|
0 | Ada | female |
1 | Cornélie | female |
2 | Stanisław | male |
3 | Mathew | male |
4 | Liss | non-binary |
The iloc()
method is short for integer location
and let's you select rows by using the numeric location information (row number).
For a full documentation of the iloc()
method check out the official documentation.
data_frame.iloc[[3]]
name | birth_year | gender | |
---|---|---|---|
3 | Mathew | 1896 | male |
The drop_duplicates()
method enables you to drop duplicated values and only keep distinct ones.
For a full documentation of the drop_duplicates()
method check out the official documentation.
The example below selects the column gender
and drops all duplicated values in this column.
data_frame[["gender"]].drop_duplicates()
gender | |
---|---|
0 | female |
2 | male |
4 | non-binary |
The sample()
method enables you to pick a random subgroup of your dataset.
For a full documentation of the sample()
method check out the official documentation.
The example below picks a random sample of size n=3
.
data_frame.sample(n=3)
name | birth_year | gender | |
---|---|---|---|
2 | Stanisław | 1987 | male |
4 | Liss | 1976 | non-binary |
1 | Cornélie | 1965 | female |
More complex data filters can easily be applied by using the query()
method. You can use logical expressions based on the variables in the dataset inside the query()
method to define very distinct filters.
For a full documentation of the query()
method check out the official documentation.
The example below applies a filter with the following logic:
data_frame.query(
"gender == 'female' | gender == 'non-binary' & birth_year > 1900")
name | birth_year | gender | |
---|---|---|---|
0 | Ada | 1815 | female |
1 | Cornélie | 1965 | female |
4 | Liss | 1976 | non-binary |
The last topic covered in this notebook is the process of exporting (saving) data and importing (loading) data. This section focuses on handling tabular data, since this is the most important data format for this curriculum.
Exporting a dataset with pandas is very easy. There is a variety of different output formats you can choose from. Every output format has a different method. The table below provides an overview of the most common export formats. You can click on any method name to see the official documentation of the respective method.
Method | Details |
---|---|
to_csv() |
CSV is the most popular and generic data format for tabular data. Click here for more details. |
to_excel() |
The excel format is ideal when you want to work on with the data in Microsoft Excel. |
to_pickle() |
Pickle is Python specific data format . Click here for more details. |
to_feather() |
Feather is a data format of Apache Arrow. It is well suited for exchanging data between Python and R. Click here for more details. |
to_parquet() |
Parquet is the data format of Apache Spark. To some extend it's similar to Feather and extensively used in cloud computing environments. Click here for more details. |
The example below shows how to save a dataset to a local file called "example_exports.csv".
data_frame.to_csv("./example_export.csv", index=False)
Importing a dataset to the current Python session is as easy as exporting the data. For every export method in the table above, there is a complementary import method.
Every import method starts with a read_
.
The file that was exported in the cell above can be imported by using the read_csv()
method as shown in the cell below.
example_import = pd.read_csv("./example_export.csv")
example_import
name | birth_year | gender | |
---|---|---|---|
0 | Ada | 1815 | female |
1 | Cornélie | 1965 | female |
2 | Stanisław | 1987 | male |
3 | Mathew | 1896 | male |
4 | Liss | 1976 | non-binary |
City | Country | Inhabitants | Wikipedia URL |
---|---|---|---|
Trincomalee | Sri Lanka | 99135 | https://en.wikipedia.org/wiki/Trincomalee |
Kołobrzeg | Poland | 46830 | https://en.wikipedia.org/wiki/Ko%C5%82obrzeg |
Manali | India | 8096 | https://en.wikipedia.org/wiki/Manali,_Himachal_Pradesh |
St. Paul's Bay | Malta | 29097 | https://en.wikipedia.org/wiki/St._Paul%27s_Bay |
This was the introduction to Jupyter notebooks and Python. This notebook introduced you into the basic programming concepts we are going to use in this curriculum. Come back to this notebook in case you want to refresh some Python basics. In the next notebook we learn the basics of graph theory and we are going to analyze a network of Nobel laureates.
This section provides the solutions for the exercises in this notebook.
for i in range(16):
print(100+i)
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]
Desired output:
"Good Morning Ada" "Good Morning Cornélie" "Good Morning Stanisław" "Good Morning Mathew" "Good Morning Liss"
names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]
for name in names:
print("Good Morning", name)
Good Morning Ada Good Morning Cornélie Good Morning Stanisław Good Morning Mathew Good Morning Liss
names = ["Ada", "Cornélie", "Stanisław", "Mathew", "Liss"]
night = True
for name in names:
if night:
print("Good Night", name)
else:
print("Good Morning", name)
night = not night
Good Night Ada Good Morning Cornélie Good Night Stanisław Good Morning Mathew Good Night Liss
City | Country | Inhabitants | Wikipedia URL |
---|---|---|---|
Trincomalee | Sri Lanka | 99135 | https://en.wikipedia.org/wiki/Trincomalee |
Kołobrzeg | Poland | 46830 | https://en.wikipedia.org/wiki/Ko%C5%82obrzeg |
Manali | India | 8096 | https://en.wikipedia.org/wiki/Manali,_Himachal_Pradesh |
St. Paul's Bay | Malta | 29097 | https://en.wikipedia.org/wiki/St._Paul%27s_Bay |
import pandas as pd
data_dict = dict({"City": ["Trincomalee", "Kołobrzeg", "Manali", "St. Paul's Bay"],
"Country": ["Sri Lanka", "Poland", "India", "Malta"],
"Inhabitants": [99135, 46830, 8096, 29097],
"Wikipedia URL": ["https://en.wikipedia.org/wiki/Trincomalee", "https://en.wikipedia.org/wiki/Ko%C5%82obrzeg",
"https://en.wikipedia.org/wiki/Manali,_Himachal_Pradesh","https://en.wikipedia.org/wiki/St._Paul%27s_Bay"]})
data_frame = pd.DataFrame(data=data_dict)
data_frame
City | Country | Inhabitants | Wikipedia URL | |
---|---|---|---|---|
0 | Trincomalee | Sri Lanka | 99135 | https://en.wikipedia.org/wiki/Trincomalee |
1 | Kołobrzeg | Poland | 46830 | https://en.wikipedia.org/wiki/Ko%C5%82obrzeg |
2 | Manali | India | 8096 | https://en.wikipedia.org/wiki/Manali,_Himachal... |
3 | St. Paul's Bay | Malta | 29097 | https://en.wikipedia.org/wiki/St._Paul%27s_Bay |
data_frame.sort_values(by="Inhabitants", ascending=False)
City | Country | Inhabitants | Wikipedia URL | |
---|---|---|---|---|
0 | Trincomalee | Sri Lanka | 99135 | https://en.wikipedia.org/wiki/Trincomalee |
1 | Kołobrzeg | Poland | 46830 | https://en.wikipedia.org/wiki/Ko%C5%82obrzeg |
3 | St. Paul's Bay | Malta | 29097 | https://en.wikipedia.org/wiki/St._Paul%27s_Bay |
2 | Manali | India | 8096 | https://en.wikipedia.org/wiki/Manali,_Himachal... |
sampled_rows = data_frame.sample(n=2)
sampled_rows
City | Country | Inhabitants | Wikipedia URL | |
---|---|---|---|---|
3 | St. Paul's Bay | Malta | 29097 | https://en.wikipedia.org/wiki/St._Paul%27s_Bay |
1 | Kołobrzeg | Poland | 46830 | https://en.wikipedia.org/wiki/Ko%C5%82obrzeg |
sampled_rows.to_csv("./sampled_rows.csv")