Replace Not available with NaN value in pandas? (Example.csv)

In [1]:

import pandas as pd

df=pd.read_csv('Example.csv',na_values=['not available'])
df.head()

Out[1]:

	name	class	total marks
0	Mukul	12	454.0
1	Rohan	12	433.0
2	Shivam	11	NaN
3	Ragav	11	NaN
4	Monu	10	456.0

To get the first 3 rows of a given DataFrame.

Sample Python dictionary data and list labels: exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']} labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [8]:

dict_data = {'name': list(df.loc[:2,'name']),'class': list(df.loc[:2,'class'])}
dict_data

Out[8]:

{'name': ['Mukul', 'Rohan', 'Shivam'], 'class': [12, 12, 11]}

In [2]:

import numpy as np
import pandas as pd
dictx= {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
df2=pd.DataFrame(dictx,index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
df2

Out[2]:

	name	score	attempts	qualify
a	Anastasia	12.5	1	yes
b	Dima	9.0	3	no
c	Katherine	16.5	2	yes
d	James	NaN	3	no
e	Emily	9.0	2	no
f	Michael	20.0	3	yes
g	Matthew	14.5	1	yes
h	Laura	NaN	1	no
i	Kevin	8.0	2	no
j	Jonas	19.0	1	yes

In [4]:

df2.iloc[:3]

Out[4]:

	name	score	attempts	qualify
a	Anastasia	12.5	1	yes
b	Dima	9.0	3	no
c	Katherine	16.5	2	yes

To select the 'name' and 'score' columns from frame

In [10]:

df2.loc[:,['name','score']]

Out[10]:

	name	score
a	Anastasia	12.5
b	Dima	9.0
c	Katherine	16.5
d	James	NaN
e	Emily	9.0
f	Michael	20.0
g	Matthew	14.5
h	Laura	NaN
i	Kevin	8.0
j	Jonas	19.0

To select the specified columns and rows from a given data frame.

Select 'name' and 'score' columns in rows 1, 3, 5, 6 from the following data frame.

In [11]:

df2.index =[0,1,2,3,4,5,6,7,8,9]

In [12]:

df2.loc[[1,3,5,6],['name','score']]

Out[12]:

	name	score
1	Dima	9.0
3	James	NaN
5	Michael	20.0
6	Matthew	14.5

In [13]:

df2.loc[:,['name','score']]

Out[13]:

	name	score
0	Anastasia	12.5
1	Dima	9.0
2	Katherine	16.5
3	James	NaN
4	Emily	9.0
5	Michael	20.0
6	Matthew	14.5
7	Laura	NaN
8	Kevin	8.0
9	Jonas	19.0

In [7]:

df2.groupby('name').aggregate({'attempts':max}).reset_index()

Out[7]:

	name	attempts
0	Anastasia	1
1	Dima	3
2	Emily	2
3	James	3
4	Jonas	1
5	Katherine	2
6	Kevin	2
7	Laura	1
8	Matthew	1
9	Michael	3

In [3]:

df2.groupby('attempts').max().reset_index().sort_values(by='attempts',ascending=False).head(1)

Out[3]:

	attempts	name	score	qualify
2	3	Michael	20.0	yes

To select the rows where the number of attempts in the examination is greater than 2.

In [12]:

df2[df2['attempts']>2]['name']

Out[12]:

b       Dima
d      James
f    Michael
Name: name, dtype: object

In [5]:

df2.loc[df2['attempts']>2]

Out[5]:

	name	score	attempts	qualify
b	Dima	9.0	3	no
d	James	NaN	3	no
f	Michael	20.0	3	yes

In [10]:

df2.loc[df2['attempts']>2,['attempts','name']]

Out[10]:

	attempts	name
b	3.0	Dima
d	3.0	James
f	3.0	Michael

To count the number of rows and columns of a DataFrame.

In [16]:

df2.shape[0]

Out[16]:

To select the rows where the score is missing, i.e. is NaN.

In [15]:

df2.loc[df2['score'].isna()==True]

Out[15]:

	name	score	attempts	qualify
d	James	NaN	3.0	no
h	Laura	NaN	1.0	no

In [22]:

df2['score'].value_counts()

Out[22]:

9.0     2
19.0    1
8.0     1
14.5    1
20.0    1
16.5    1
12.5    1
Name: score, dtype: int64

In [29]:

df2['score'].isna().sum()

Out[29]:

In [18]:

df2[df2['score'].isna()]

Out[18]:

	name	score	attempts	qualify
3	James	NaN	3	no
7	Laura	NaN	1	no

To select the rows the score is between 15 and 20 (inclusive).

In [19]:

df2[df2['score'].between(15,20)]

Out[19]:

	name	score	attempts	qualify
2	Katherine	16.5	2	yes
5	Michael	20.0	3	yes
9	Jonas	19.0	1	yes

To select the rows where number of attempts in the examination is less than 2 and score greater than 15.

In [20]:

df2.loc[(df2['attempts']<2) & (df2['score']>15)]

Out[20]:

	name	score	attempts	qualify
9	Jonas	19.0	1	yes

To change the score in row 'd' to 11.5.

In [21]:

df2.loc[4,'score']=11.5
df2

Out[21]:

	name	score	attempts	qualify
0	Anastasia	12.5	1	yes
1	Dima	9.0	3	no
2	Katherine	16.5	2	yes
3	James	NaN	3	no
4	Emily	11.5	2	no
5	Michael	20.0	3	yes
6	Matthew	14.5	1	yes
7	Laura	NaN	1	no
8	Kevin	8.0	2	no
9	Jonas	19.0	1	yes

To calculate the sum of the examination attempts by the students.

In [22]:

df2['attempts'].aggregate(sum)

Out[22]:

To calculate the mean score for each different student in DataFrame.

In [16]:

df2.groupby('name').aggregate({'score':np.mean})

Out[16]:

	score
name
Anastasia	12.5
Dima	9.0
Emily	9.0
James	NaN
Jonas	19.0
Katherine	16.5
Kevin	8.0
Laura	NaN
Matthew	14.5
Michael	20.0

To append a new row 'k' to data frame with given values for each column. Now delete the new row and return the original DataFrame.

In [18]:

df2.loc['k'] = ['smesh',15,1,'yes']
print(df2)

        name  score  attempts qualify
a  Anastasia   12.5         1     yes
b       Dima    9.0         3      no
c  Katherine   16.5         2     yes
d      James    NaN         3      no
e      Emily    9.0         2      no
f    Michael   20.0         3     yes
g    Matthew   14.5         1     yes
h      Laura    NaN         1      no
i      Kevin    8.0         2      no
j      Jonas   19.0         1     yes
k      smesh   15.0         1     yes

In [19]:

df2.loc['k']
df2.drop('k')

Out[19]:

	name	score	attempts	qualify
a	Anastasia	12.5	1	yes
b	Dima	9.0	3	no
c	Katherine	16.5	2	yes
d	James	NaN	3	no
e	Emily	9.0	2	no
f	Michael	20.0	3	yes
g	Matthew	14.5	1	yes
h	Laura	NaN	1	no
i	Kevin	8.0	2	no
j	Jonas	19.0	1	yes

In [26]:

df2['extra']=np.arange(11)
df2

Out[26]:

	name	score	attempts	qualify	extra
0	Anastasia	12.5	1	yes	0
1	Dima	9.0	3	no	1
2	Katherine	16.5	2	yes	2
3	James	NaN	3	no	3
4	Emily	11.5	2	no	4
5	Michael	20.0	3	yes	5
6	Matthew	14.5	1	yes	6
7	Laura	NaN	1	no	7
8	Kevin	8.0	2	no	8
9	Jonas	19.0	1	yes	9
k	smesh	15.0	1	yes	10

In [27]:

df2.drop('extra',axis=1,inplace=True)

To Sort the data frame first by 'name' in descending order, then by 'score' in ascending order:

In [28]:

df2.sort_values(by=['name','score'],ascending=[True,False])

Out[28]:

	name	score	attempts	qualify
0	Anastasia	12.5	1	yes
1	Dima	9.0	3	no
4	Emily	11.5	2	no
3	James	NaN	3	no
9	Jonas	19.0	1	yes
2	Katherine	16.5	2	yes
8	Kevin	8.0	2	no
7	Laura	NaN	1	no
6	Matthew	14.5	1	yes
5	Michael	20.0	3	yes
k	smesh	15.0	1	yes

To replace the 'qualify' column contains the values 'yes' and 'no' with True and False.

In [29]:

df3=df2.replace({'yes':True,'no':False})
df3

Out[29]:

	name	score	attempts	qualify
0	Anastasia	12.5	1	True
1	Dima	9.0	3	False
2	Katherine	16.5	2	True
3	James	NaN	3	False
4	Emily	11.5	2	False
5	Michael	20.0	3	True
6	Matthew	14.5	1	True
7	Laura	NaN	1	False
8	Kevin	8.0	2	False
9	Jonas	19.0	1	True
k	smesh	15.0	1	True

To change the name 'James' to 'Suresh' in name column of the DataFrame.

In [30]:

df3.loc[df3['name']=='James','name']='Suresh'

In [31]:

df2[df2['name']=='James']['name'] ='Suresh'

<ipython-input-31-365e2a0b86da>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[df2['name']=='James']['name'] ='Suresh'

In [32]:

df3

Out[32]:

	name	score	attempts	qualify
0	Anastasia	12.5	1	True
1	Dima	9.0	3	False
2	Katherine	16.5	2	True
3	Suresh	NaN	3	False
4	Emily	11.5	2	False
5	Michael	20.0	3	True
6	Matthew	14.5	1	True
7	Laura	NaN	1	False
8	Kevin	8.0	2	False
9	Jonas	19.0	1	True
k	smesh	15.0	1	True

To delete the 'attempts' column from the DataFrame.

In [33]:

df3=df3.drop('attempts',axis=1)

To insert a new column in existing DataFrame.

In [34]:

df3['color']=np.nan

Bonus

In [35]:

df3.loc[[True,False,False,False,False,False,False,False,False,False,False]]

Out[35]:

	name	score	qualify	color
0	Anastasia	12.5	True	NaN

Iterating through DataFrame

In [36]:

for key,value in df.iteritems():
    print('key',key)
    print('value',value)

key name
value 0     Mukul
1     Rohan
2    Shivam
3     Ragav
4      Monu
Name: name, dtype: object
key class
value 0    12
1    12
2    11
3    11
4    10
Name: class, dtype: int64
key total marks
value 0    454.0
1    433.0
2      NaN
3      NaN
4    456.0
Name: total marks, dtype: float64

In [37]:

for key,value in df.iterrows():
    print('key',key)
    print('value',value)

key 0
value name           Mukul
class             12
total marks      454
Name: 0, dtype: object
key 1
value name           Rohan
class             12
total marks      433
Name: 1, dtype: object
key 2
value name           Shivam
class              11
total marks       NaN
Name: 2, dtype: object
key 3
value name           Ragav
class             11
total marks      NaN
Name: 3, dtype: object
key 4
value name           Monu
class            10
total marks     456
Name: 4, dtype: object

In [38]:

for value in df.itertuples():
    print('value',value)

value Pandas(Index=0, name='Mukul', _2=12, _3=454.0)
value Pandas(Index=1, name='Rohan', _2=12, _3=433.0)
value Pandas(Index=2, name='Shivam', _2=11, _3=nan)
value Pandas(Index=3, name='Ragav', _2=11, _3=nan)
value Pandas(Index=4, name='Monu', _2=10, _3=456.0)

In [39]:

import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.corr(),cmap='GnBu',annot=np.array([['',''],list(df.corr()['total marks'])]),fmt = 's',linewidths=10,linecolor='r')
plt.show()

In [40]:

df.corr()

Out[40]:

	class	total marks
class	1.000000	-0.566429
total marks	-0.566429	1.000000

In [41]:

df.corr()['total marks']

Out[41]:

class         -0.566429
total marks    1.000000
Name: total marks, dtype: float64

In [42]:

df.corr().loc['class',['total marks']]

Out[42]:

total marks   -0.566429
Name: class, dtype: float64

In [ ]: