ChEn-1070: Introduction to Chemical Engineering Spring 2019 UMass Lowell; Profs. Manohar and de Almeida 26Mar2019
A significant Python
package that supports tabular calculations (spreadsheet) is called: Pandas
we will use Pandas
to build tabular data and perform statistical analysis. We will access Pandas
with the import
keyword
import pandas
'''Import pandas'''
import pandas as pd
It is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. For example, note how to create a pandas
series:
'''Create basic Pandas series'''
data = [1.10134,2.29211,3.383457,4.678454]
s0 = pd.Series(data)
print(s0)
0 1.101340 1 2.292110 2 3.383457 3 4.678454 dtype: float64
Note that the overall type is float64
(i.e. floating point with double precision), and the default labels are added: 0,1,2,3
'''Display number of significant digits'''
pd.options.display.float_format = '{:.2e}'.format
print(s0)
0 1.10e+00 1 2.29e+00 2 3.38e+00 3 4.68e+00 dtype: float64
There are various statistics in pandas
:
'''Statistics'''
print('s0 mean =',s0.mean())
print('s0 sum =',s0.sum())
print('s0 std =',s0.std())
print('s0 var =',s0.var())
print('s0 sem =',s0.sem())
s0 mean = 2.86384025 s0 sum = 11.455361 s0 std = 1.5271004374941803 s0 var = 2.3320357461949173 s0 sem = 0.7635502187470902
Function | Description |
---|---|
count | Number of non-NA observations |
sum | Sum of values |
mean | Mean of values |
mad | Mean absolute deviation |
median | Arithmetic median of values |
min | Minimum |
max | Maximum |
mode | Mode |
abs | Absolute Value |
prod | Product of values |
std | Bessel-corrected sample standard deviation |
var | Unbiased variance |
sem | Standard error of the mean |
skew | Sample skewness (3rd moment) |
kurt | Sample kurtosis (4th moment) |
quantile | Sample quantile (value at %) |
cumsum | Cumulative sum |
cumprod | Cumulative product |
cummax | Cumulative maximum |
cummin | Cumulative minimum |
There is a convenient describe()
function which computes a variety of summary statistics:
'''Describe statistics summary of data'''
s0.describe()
count 4.00e+00 mean 2.86e+00 std 1.53e+00 min 1.10e+00 25% 1.99e+00 50% 2.84e+00 75% 3.71e+00 max 4.68e+00 dtype: float64
'''Data access with index operator'''
print("s0[3] = ",s0[3])
s0[3] = 4.678454
'''Data assignment with index operator'''
s0[2] = 'hello'
print(s0)
0 1.10e+00 1 2.29e+00 2 hello 3 4.68e+00 dtype: object
Note that the overall type is now object
because of the mixed data type.
'''Fast data access with integer index'''
s0.iat[-1]
4.678454
'''Slower than iat[], more generic (see below in Data Frames) access with integer index'''
s0.iloc[-2]
'hello'
'''Create a Pandas series w/ a dictionary'''
data = {'a':1,'b':2,'c':3,'d':4}
s1 = pd.Series(data)
print(s1)
a 1 b 2 c 3 d 4 dtype: int64
The overall type is int64
(i.e integer with double precision), but notice the labels are taken from the data dictionary (the keys are made as labels).
'''Create a Pandas series w/ the same value'''
s2 = pd.Series( 8.0, index=['a', 'b', 'c', 'd'] )
print(s2)
a 8.00e+00 b 8.00e+00 c 8.00e+00 d 8.00e+00 dtype: float64
The data can be a repeated value (a scalar).
'''Data access as a dictionary using the index operator'''
print("s1['b'] = ",s1['b'])
print("s2['d'] = ",s2['d'])
s1['b'] = 2 s2['d'] = 8.0
'''Fast data access with integer index'''
s2.iat[-1]
8.0
'''Create a Pandas series using lists'''
import pandas as pd
labels = ['symbol','atomic number','group','period','mass number']
data = ['H',1,1,1,1] # hydrogen
h1 = pd.Series( data, index=labels)
print(h1)
print('')
data = ['D',1,1,1,2] # deuterium
h2 = pd.Series( data, index=labels)
print(h2)
print('')
data = ['Be-9',4,2,2,9] # beryllium
be_9 = pd.Series( data, index=labels)
print(be_9)
print('')
data = ['Be-10',4,2,2,10] # beryllium
be_10 = pd.Series( data, index=labels)
print(be_10)
symbol H atomic number 1 group 1 period 1 mass number 1 dtype: object symbol D atomic number 1 group 1 period 1 mass number 2 dtype: object symbol Be-9 atomic number 4 group 2 period 2 mass number 9 dtype: object symbol Be-10 atomic number 4 group 2 period 2 mass number 10 dtype: object
The overall type is object
because the data types are mixed. Notice the index labels are more elaborate and provided as a list of str
.
It is a two-dimensional (table) labeled data structure with various data types. The row labels are collectively referred to as the index. The column labels appear as a new structure. For example, note how to create a pandas
data frame:
'''Using existing series as columns'''
data = {'S1':s1,'S2':s2}
df1 = pd.DataFrame( data )
print(df1)
S1 S2 a 1 8.00e+00 b 2 8.00e+00 c 3 8.00e+00 d 4 8.00e+00
df1.columns
Index(['S1', 'S2'], dtype='object')
'''Change column labels'''
df1.columns = ['x','y']
print(df1)
x y a 1 8.00e+00 b 2 8.00e+00 c 3 8.00e+00 d 4 8.00e+00
'''Using a dictionary to create a table'''
hydrogen = {'symbol':'H', 'atomic_number':1, 'group':1, 'period':1, 'isotopes':[1,2,3]}
helium = {'symbol':'He', 'atomic_number':2, 'group':18, 'period':1, 'isotopes':[3,4]}
lithium = {'symbol':'Li', 'atomic_number':3, 'group':1, 'period':2, 'isotopes':[6,7]}
beryllium = {'symbol':'Be', 'atomic_number':4, 'group':2, 'period':2, 'isotopes':[9,10]}
boron = {'symbol':'B', 'atomic_number':5, 'group':13, 'period':2, 'isotopes':[10,11]}
data = {'hydrogen':hydrogen,'helium':helium,'lithium':lithium,'beryllium':beryllium,'boron':boron}
df2 = pd.DataFrame(data)
df2
hydrogen | helium | lithium | beryllium | boron | |
---|---|---|---|---|---|
atomic_number | 1 | 2 | 3 | 4 | 5 |
group | 1 | 18 | 1 | 2 | 13 |
isotopes | [1, 2, 3] | [3, 4] | [6, 7] | [9, 10] | [10, 11] |
period | 1 | 1 | 2 | 2 | 2 |
symbol | H | He | Li | Be | B |
'''Column selection'''
type(df2['helium'])
pandas.core.series.Series
'''Rows of a column'''
df2['helium'][1:-1]
group 18 isotopes [3, 4] period 1 Name: helium, dtype: object
Some data access operation:
Operation | Syntax | Result |
---|---|---|
Select column | df[col] | Series |
Select row by label | df.loc[label] | Series |
Select row by integer | location df.iloc[loc] | Series |
Slice rows df[5:10] | DataFrame | |
Select rows by boolean vector | df[bool_vec] | DataFrame |
There is more on data access and slicing.
'''Access of "isotopes" row by label or also called index'''
df2.loc['isotopes']
hydrogen [1, 2, 3] helium [3, 4] lithium [6, 7] beryllium [9, 10] boron [10, 11] Name: isotopes, dtype: object
'''Access of a table entry using row and column labels'''
df2.loc['isotopes','boron']
[10, 11]
'''Access of "isotopes" row by index'''
df2.iloc[2]
hydrogen [1, 2, 3] helium [3, 4] lithium [6, 7] beryllium [9, 10] boron [10, 11] Name: isotopes, dtype: object
'''Access the last two rows by index'''
df2.iloc[-2:]
hydrogen | helium | lithium | beryllium | boron | |
---|---|---|---|---|---|
period | 1 | 1 | 2 | 2 | 2 |
symbol | H | He | Li | Be | B |
'''Note that iat will not work'''
#df2.iat[-2:]
#df2.iat[-2]
'Note that iat will not work'
'''This pure index access will work'''
df2.iat[1,3] # row index 1, column index 3, that is: second row, fourth column
2