Pandas is python open source library for that is build on top of numpy, It allows you do fast analysis as well as data cleaning and preparation (it has been proved that data scientist spend more than 80% of their time cleaning and preparing data)

A. Installing Pandas

  • colab-users: Pandas is embeded in google colab
  • Condas Users: conda install pandas from comande line
  • No condas users: pip install pandas from comand line make sure you venv is activated if you are using venv
In [1]:
# importing make sure you run this code to cacth up with the tutorial

import numpy as np
import pandas as pd

B. Pandas Data Structures

B.1 Series

A Series is a one-dimensional array which is very similar to a NumPy array. As a matter of fact, Series are built on top of NumPy array objects. What differentiates Series from NumPy arrays is that series can have an access labels with which it can be indexed.

here is the basic syntax for cretaing a serie

my_series = pd.Series(data, index)

From the above, data can be any object type such as dictionary, list, or even a NumPy array while index signifies axis labels with which the Series will be indexed.

here is an example

In [23]:
countries = ['Kenya', 'Rwanda', 'Tanzania', 'Uganda', 'DRC']
country_codes = ['+254', '+250', '+255', '+256', '+243']
In [4]:
countries_serie = pd.Series(country_codes, countries)
In [5]:
countries_serie
Out[5]:
Kenya       +254
Rwanda      +250
Tanzania    +256
Uganda      +256
DRC         +243
dtype: object

Note : The index is optional it can be imply from data

We can also crreate series from dict, or numpy array

What differentiates a Pandas Series from a NumPy array is that Pandas Series can hold a variety of object types.

Grabbing information from Series

we grab information from a serie the same way we do for a dictionary

In [9]:
countries_serie.get('Ethiopia', '0') # this is the best way
Out[9]:
'0'
In [7]:
countries_serie['DRC']
Out[7]:
'+243'

Performing Arithmetic operations on Series

Operations on Series are done based off the index. When we use any of the mathematical operations such as -, +, /, *, pandas does the computation using the value of the index. The resulting value is thereafter converted to a float so that you do not loose any information.

In [12]:
prices1 = pd.Series([10, 23, 34, 35], ['tomatao', 'banana', 'avocados', 'beans'])
prices2 = pd.Series([12, 13, 54, 65], ['tomatao', 'banana', 'avocados', 'beans'])
In [13]:
prices1 + prices2
Out[13]:
tomatao      22
banana       36
avocados     88
beans       100
dtype: int64
In [14]:
prices1 - prices2
Out[14]:
tomatao     -2
banana      10
avocados   -20
beans      -30
dtype: int64

B.2. DataFrames

A DataFrame is a two-dimensional data structure in which the data is aligned in a tabular form i.e. in rows and columns. Pandas DataFrames make manipulating your data easy. You can select, replace columns and rows and even reshape your data.

A dataframe is the core data structure of pandas you can view it as a list of series sharing the same index , an excel sheet, a sql table or matrix with label

Here is the basic syntax for creating a DataFrame:

pd.DataFrame(data,index)

data can be any structural datatype:

- a dictionary where key a column names and values are list of values
- data can be a list of series or list of numpy arrays
- data can be a numpy 2D array 
-etc
In [24]:
countries
country_codes
Out[24]:
['+254', '+250', '+255', '+256', '+243']
In [25]:
capitals = ['NBO', 'KG', 'DES', 'KLA', 'KIN']
In [61]:
country_df = pd.DataFrame(data={'capital': capitals, 'codes':country_codes}, index=countries)
country_df
Out[61]:
capital codes
Kenya Nairobie +254
Rwanda Kigali +250
Tanzania Dar-el-Salama +255
Uganda Kampala +256
DRC Kinshasa +243

Most of the time in your data science project you will never create dataframe , but read them from diffrent datasource,

C. Usefuls function

1. Data Input and Output

Using the pd.read_ methods Pandas allows you access data from a wide variety of sources such as; excel sheet, csv, sql, , google sheet , Html etc... (For some format you need to install additional libraies)

To reference any of the files, you need to pass the path of the file you are reading

Let us do some data science job, I have created a form where you will fill it with some data and we are going to work with it

let read data from the sheet recentely created (need to be edited to be avialable to everyone with the link)

In [3]:
doc_id = '1bIhLt6BO4byo2VnqdIgzdEWWdfQU-eD2vsZfXeHGHjk'
sheet_id = 809226885
path = 'https://docs.google.com/spreadsheets/d/{}/export?gid={}&format=csv'.format(doc_id, sheet_id)
In [64]:
data = pd.read_csv(path,
                   # Set first column as rownames in data frame
                   index_col=0,
                   # Parse column values to datetim
                  )
In [65]:
data.head()
Out[65]:
Email address First Name Last Name Country of Origin Python proficiency Numpy Proficiency Gender Date of inscription
Timestamp
07/09/2018 20:31:18 [email protected] Espoir Mur Democratic Republic of the Congo 7 7 Male 09/12/2018
08/09/2018 13:42:39 [email protected] Mick Ganza Rwanda 1 1 Male 09/04/2018
08/09/2018 13:42:45 [email protected] Joseph Manzi Rwanda 7 7 Male 30/08/1983
08/09/2018 13:42:45 [email protected] Mizerere Jean Rwanda 5 3 Male 08/09/2018
08/09/2018 13:46:16 [email protected] Anonymous Metasploit South Sudan 10 10 Female 06/05/2005

you can either download the document and read it from your laptop or read

In [66]:
data.reset_index(inplace=True)
In [67]:
data.head()
Out[67]:
Timestamp Email address First Name Last Name Country of Origin Python proficiency Numpy Proficiency Gender Date of inscription
0 07/09/2018 20:31:18 [email protected] Espoir Mur Democratic Republic of the Congo 7 7 Male 09/12/2018
1 08/09/2018 13:42:39 [email protected] Mick Ganza Rwanda 1 1 Male 09/04/2018
2 08/09/2018 13:42:45 [email protected] Joseph Manzi Rwanda 7 7 Male 30/08/1983
3 08/09/2018 13:42:45 [email protected] Mizerere Jean Rwanda 5 3 Male 08/09/2018
4 08/09/2018 13:46:16 [email protected] Anonymous Metasploit South Sudan 10 10 Female 06/05/2005
In [69]:
data.set_index('First Name', inplace=True)
In [70]:
data.head()
Out[70]:
Timestamp Email address Last Name Country of Origin Python proficiency Numpy Proficiency Gender Date of inscription
First Name
Espoir 07/09/2018 20:31:18 [email protected] Mur Democratic Republic of the Congo 7 7 Male 09/12/2018
Mick 08/09/2018 13:42:39 [email protected] Ganza Rwanda 1 1 Male 09/04/2018
Joseph 08/09/2018 13:42:45 [email protected] Manzi Rwanda 7 7 Male 30/08/1983
Mizerere 08/09/2018 13:42:45 [email protected] Jean Rwanda 5 3 Male 08/09/2018
Anonymous 08/09/2018 13:46:16 [email protected] Metasploit South Sudan 10 10 Female 06/05/2005

Once we have our dataframe we can :

1. Selecting Columns from DataFrames

In [43]:
#eg : select: Last Name name from our df

Using bracket notation [], we can easily grab objects from a DataFrame same way it’s done with Series. Let’s grab a column name

Because we grabbed a single column, it returns a Series. Go ahead and confirm the data type returned using

In [71]:
data['Last Name']
Out[71]:
First Name
Espoir              Mur
Mick              Ganza
Joseph            Manzi
Mizerere           Jean
Anonymous    Metasploit
Kenneth        Kamurali
Aimable          GAKIRE
John                Doe
Spider              Man
pac                 bob
kiki               Paul
Jay               Brown
Dushime         Aimable
Sylvia          Asiimwe
testa             atset
Tunmise            Raji
stef               kijo
David            Butera
Kenneth        Kamurali
Musafiri     ildephonse
Reponse            Jean
Jessica        Ingabire
Jean          uwumukiza
Karamba          Gaston
Name: Last Name, dtype: object

2. Renaming columns name

We can rename py passing a dictionary with colums name and axis: df.rename

In [50]:
## rename the column for proficiency in python to python 
### and proficiency in numpy to numpy
In [73]:
data.columns
Out[73]:
Index(['Timestamp', 'Email address', 'Last Name', 'Country of Origin',
       'Python proficiency', 'Numpy Proficiency', 'Gender',
       'Date of inscription'],
      dtype='object')
In [75]:
data.rename({'Python proficiency': 'python', 
              'Numpy Proficiency': 'numpy'}, axis=1, inplace=True)

3. Adding Columns to a DataFrame

We can create a new one or creating from existing one :

In [72]:
# eg : add proficiency in pyhton and numpy to create a new column.
In [77]:
data['data_proficiency'] = data['python'] + data['numpy']
In [4]:
data.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-304fa4ce4ebd> in <module>()
----> 1 data.head()

NameError: name 'data' is not defined

4. Removing rows/columns from a DataFrame

We can remove a row or a column using the .drop() function. In doing this, we have to specify the axis=0 for row, and axis=1 for column.

very important to know

In [82]:
##eg remove the new row recently created 

 5. Selecting Rows in a DataFrame

To select rows, we have to call the location of the rows using .loc[] which takes in the label name or .iloc[] which takes in the index position of the row.

hint : click me

In [46]:
#eg get row number with your name as index

6. Conditional selection

Pandas allows you to perform conditional selection using bracket notation [] . The example below returns the rows where 'W'>0:

In [47]:
# get row for ladies with proficiency in pandas superior  to 6
In [48]:
# get just their name and country of origin

hint : click me and me

6.a : GO deeper find the difference between loc, iloc, at, ix, iat

hint : loc: only work on index

iloc : work on position

ix: this is the most general and supports index and position based retrieval

at: get scalar values , it 's a very fast loc

iat: get scalar values , it 's a very fast iloc

7. query method

Also, use the query method where you can embed boolean expressions on columns within quotes Example df. query ('one > 0') one two

6.Missing Data

A lot of times, when you’re using Pandas to read-in data and there are missing points, Pandas will automatically fill-in those missing points with a NaN or Null value. Hence, we can either drop those auto-filled values using .dropna() or fill them using.fillna().

let find missing data in non required columns and either fill or drop the corresponding row

Say you have a large dataset, Pandas has made it very easy to locate null values using .isnull():

In [49]:
# fill columns with  empty values in pandas and numpy
# drop columns with na in last  name

hint : click me

 7. GroupBy

Grouby allows you group together rows based off a column so that you can perform aggregate functions (such as sum, mean, median, standard deviation, etc) on them.

Using the .groupby() method, we can group rows based on the 'country' column and call the aggregate function .mean()on it and get the values profidiciency in pandas and python:

we can apply others function such as count, decribe (for satistical description)

In [51]:
## group by country and get the mean for score in python
## group by gender and get the lady with max score in pyhton

Hint : click me

8. The apply() Method

The .apply() method is used to call custom functions on a DataFrame. Imagine we have a function:

In [52]:
## get the square of prociciency in pyhton

hint : click me

9. Map method

can apply map to change values from a colums:

In [53]:
## map gender and return m for male and F for Female

hint: find yourself

10. Sorting and Ordering DataFrame

Imagine we wanted to display the DataFrame with a certain column being displayed in ascending order, we could easily sort it using .sort_values():

In [55]:
### let sort our data by country

hint: google is your friend

Advanced topics

11.Concatenating, Merging, and Joining DataFrames

Concatenation basically glues DataFrames together. When concatenating DataFrames, keep in mind that dimensions should match along the axis you are concatenating on. Having, a list of DataFrames:

In [58]:
## let works with the following df
In [59]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

hint??: Use google

12. pivot tables

13. Take Advantage of Accessor Methods(str, dt, cat)

more infos here

14. Working with dates

 15. ploting