Pandas is python open source library for that is build on top of numpy,
It allows you do fast analysis as well as data cleaning and preparation
**(it has been proved that data scientist spend more than 80% of their time cleaning and preparing data)**

**colab-users:**Pandas is embeded in google colab**Condas Users:**`conda install pandas`

from comande line**No condas users:**`pip install pandas`

from comand line make sure you venv is activated if you are using venv

In [1]:

```
# importing make sure you run this code to cacth up with the tutorial
import numpy as np
import pandas as pd
```

A Series is a one-dimensional array which is very similar to a NumPy array. As a matter of fact, Series are built on top of NumPy array objects. What differentiates Series from NumPy arrays is that series can have an access labels with which it can be indexed.

here is the basic syntax for cretaing a serie

`my_series = pd.Series(data, index)`

From the above, data can be any object type such as dictionary, list, or even a NumPy array while index signifies axis labels with which the Series will be indexed.

here is an example

In [23]:

```
countries = ['Kenya', 'Rwanda', 'Tanzania', 'Uganda', 'DRC']
country_codes = ['+254', '+250', '+255', '+256', '+243']
```

In [4]:

```
countries_serie = pd.Series(country_codes, countries)
```

In [5]:

```
countries_serie
```

Out[5]:

Note : The index is optional it can be imply from data

We can also crreate series from dict, or numpy array

What differentiates a Pandas Series from a NumPy array is that Pandas Series can hold a variety of object types.

we grab information from a serie the same way we do for a dictionary

In [9]:

```
countries_serie.get('Ethiopia', '0') # this is the best way
```

Out[9]:

In [7]:

```
countries_serie['DRC']
```

Out[7]:

Operations on Series are done based off the index. When we use any of the mathematical operations such as -, +, /, *, pandas does the computation using the value of the index. The resulting value is thereafter converted to a float so that you do not loose any information.

In [12]:

```
prices1 = pd.Series([10, 23, 34, 35], ['tomatao', 'banana', 'avocados', 'beans'])
prices2 = pd.Series([12, 13, 54, 65], ['tomatao', 'banana', 'avocados', 'beans'])
```

In [13]:

```
prices1 + prices2
```

Out[13]:

In [14]:

```
prices1 - prices2
```

Out[14]:

A DataFrame is a two-dimensional data structure in which the data is aligned in a tabular form i.e. in rows and columns. Pandas DataFrames make manipulating your data easy. You can select, replace columns and rows and even reshape your data.

A dataframe is the core data structure of pandas you can view it as a list of series sharing the same index , an excel sheet, a sql table or matrix with label

Here is the basic syntax for creating a DataFrame:

`pd.DataFrame(data,index)`

data can be any structural datatype:

```
- a dictionary where key a column names and values are list of values
- data can be a list of series or list of numpy arrays
- data can be a numpy 2D array
-etc
```

In [24]:

```
countries
country_codes
```

Out[24]:

In [25]:

```
capitals = ['NBO', 'KG', 'DES', 'KLA', 'KIN']
```

In [61]:

```
country_df = pd.DataFrame(data={'capital': capitals, 'codes':country_codes}, index=countries)
country_df
```

Out[61]:

Most of the time in your data science project you will never create dataframe , but read them from diffrent datasource,

Using the pd.read_ methods Pandas allows you access data from a wide variety of sources such as; excel sheet, csv, sql, , google sheet , Html etc... (For some format you need to install additional libraies)

To reference any of the files, you need to pass the path of the file you are reading

Let us do some data science job, I have created a form where you will fill it with some data and we are going to work with it

let read data from the sheet recentely created (need to be edited to be avialable to everyone with the link)

In [3]:

```
doc_id = '1bIhLt6BO4byo2VnqdIgzdEWWdfQU-eD2vsZfXeHGHjk'
sheet_id = 809226885
path = 'https://docs.google.com/spreadsheets/d/{}/export?gid={}&format=csv'.format(doc_id, sheet_id)
```

In [64]:

```
data = pd.read_csv(path,
# Set first column as rownames in data frame
index_col=0,
# Parse column values to datetim
)
```

In [65]:

```
data.head()
```

Out[65]:

you can either download the document and read it from your laptop or read

In [66]:

```
data.reset_index(inplace=True)
```

In [67]:

```
data.head()
```

Out[67]:

In [69]:

```
data.set_index('First Name', inplace=True)
```

In [70]:

```
data.head()
```

Out[70]:

Once we have our dataframe we can :

In [43]:

```
#eg : select: Last Name name from our df
```

Using bracket notation [], we can easily grab objects from a DataFrame same way it’s done with Series. Let’s grab a column name

Because we grabbed a single column, it returns a Series. Go ahead and confirm the data type returned using

In [71]:

```
data['Last Name']
```

Out[71]:

We can rename py passing a dictionary with colums name and axis: df.rename

In [50]:

```
## rename the column for proficiency in python to python
### and proficiency in numpy to numpy
```

In [73]:

```
data.columns
```

Out[73]:

In [75]:

```
data.rename({'Python proficiency': 'python',
'Numpy Proficiency': 'numpy'}, axis=1, inplace=True)
```

We can create a new one or creating from existing one :

In [72]:

```
# eg : add proficiency in pyhton and numpy to create a new column.
```

In [77]:

```
data['data_proficiency'] = data['python'] + data['numpy']
```

In [4]:

```
data.head()
```

**Hint: click tohere for the documentation**

very important to know

In [82]:

```
##eg remove the new row recently created
```

In [46]:

```
#eg get row number with your name as index
```

Pandas allows you to perform conditional selection using bracket notation [] . The example below returns the rows where 'W'>0:

In [47]:

```
# get row for ladies with proficiency in pandas superior to 6
```

In [48]:

```
# get just their name and country of origin
```

hint :
**loc:** only work on index

**iloc :** work on position

**ix:** this is the most general and
supports index and position based
retrieval

**at:** get scalar values , it 's a very fast
loc

**iat:** get scalar values , it 's a very fast
iloc

Also, use the query method where you can embed boolean expressions on columns within quotes Example df. query ('one > 0') one two

A lot of times, when you’re using Pandas to read-in data and there are missing points, Pandas will automatically fill-in those missing points with a NaN or Null value. Hence, we can either drop those auto-filled values using .dropna() or fill them using.fillna().

let find missing data in non required columns and either fill or drop the corresponding row

Say you have a large dataset, Pandas has made it very easy to locate null values using .isnull():

In [49]:

```
# fill columns with empty values in pandas and numpy
# drop columns with na in last name
```

**hint : click me**

Grouby allows you group together rows based off a column so that you can perform aggregate functions (such as sum, mean, median, standard deviation, etc) on them.

Using the .groupby() method, we can group rows based on the 'country' column and call the aggregate function .mean()on it and get the values profidiciency in pandas and python:

we can apply others function such as count, decribe (for satistical description)

In [51]:

```
## group by country and get the mean for score in python
## group by gender and get the lady with max score in pyhton
```

**Hint : click me**

The .apply() method is used to call custom functions on a DataFrame. Imagine we have a function:

In [52]:

```
## get the square of prociciency in pyhton
```

**hint : click me**

can apply map to change values from a colums:

In [53]:

```
## map gender and return m for male and F for Female
```

**hint: find yourself**

Imagine we wanted to display the DataFrame with a certain column being displayed in ascending order, we could easily sort it using .sort_values():

In [55]:

```
### let sort our data by country
```

**hint: google is your friend**

Concatenation basically glues DataFrames together. When concatenating DataFrames, keep in mind that dimensions should match along the axis you are concatenating on. Having, a list of DataFrames:

In [58]:

```
## let works with the following df
```

In [59]:

```
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
```

**hint??: Use google**

more infos here