Welcome to this lecture notebook! In this week's graded assignment, you will be using pandas quite often to work with dataframes.
pandas
library, along with their data types.# import libraries
import pandas as pd
pandas.read_csv takes in a file name, assuming that the file is formatted as comma separated values (csv).
# Read the csv data, setting the 0th column as the row index
data = pd.read_csv("dummy_data.csv", index_col=0)
# Display the data's number of rows and columns
print(f"Data has {data.shape[0]} rows and {data.shape[1]} columns.\n")
data.head()
Data has 50 rows and 5 columns.
sex | age | obstruct | outcome | TRTMT | |
---|---|---|---|---|---|
1 | 0 | 57 | 0 | 1 | True |
2 | 1 | 68 | 0 | 0 | False |
3 | 0 | 72 | 0 | 0 | True |
4 | 0 | 66 | 1 | 1 | True |
5 | 1 | 69 | 0 | 1 | False |
Below is a description of all the fields:
sex (binary): 1 if Male, 0 otherwise
age (int): age of patient at start of the study
obstruct (binary): obstruction of colon by tumor
outcome (binary): 1 if died within 5 years
TRTMT (binary): patient was treated
# show the data type of the dataframe
print(type(data))
<class 'pandas.core.frame.DataFrame'>
You can see that your data is of type DataFrame
. A DataFrame
is a two-dimensional, labeled data structure with columns that can be of different data types. Dataframes are a great way to organize your data, and are the most common object in pandas
. If you are unfamiliar with them, check the official documentation.
In case you're only interested in a single column (or feature) of the data, access that single column by using the "." dot notation, in which you specify the dataframe followed by a dot and the name of the column you are interested in, like this:
data.TRTMT.head()
1 True 2 False 3 True 4 True 5 False Name: TRTMT, dtype: bool
Notice the head()
method. This method prints only the first five rows, so the output of the cell can be quickly and easily read. Try removing it and see what happens.
print(type(data.TRTMT))
<class 'pandas.core.series.Series'>
Each column of a DataFrame is of type Series
, which are one-dimensional, labeled arrays that can contain any data type, plus its index. Series are similar to lists in Python, with one important difference: each Series can only contain one type of data.
Many of the methods and operations supported by DataFrames are also supported by Series. When in doubt, always check the documentation!
There are several ways of accessing a single column of a DataFrame. The methods you're about to see all do the same thing.
# Use dot notation to access the TRTMT column
data.TRTMT
# Use .loc to get all rows using ":", for column TRTMT
data.loc[:,"TRTMT"]
# Use bracket notation to get the TRTMT column
data["TRTMT"]
print(data.TRTMT.equals(data.loc[:,"TRTMT"]))
print(data.TRTMT.equals(data["TRTMT"]))
True True
Most of the time you'll want a subset (or a slice) of the DataFrame that meets some criteria. For example, if you wanted to analyze all of the features for patients who are 50 years or younger, you can slice the DataFrame like this:
data[data.age <= 50]
sex | age | obstruct | outcome | TRTMT | |
---|---|---|---|---|---|
6 | 1 | 43 | 0 | 1 | True |
15 | 1 | 46 | 1 | 0 | False |
19 | 0 | 34 | 1 | 1 | True |
24 | 0 | 50 | 0 | 0 | True |
32 | 0 | 33 | 1 | 0 | True |
33 | 0 | 49 | 0 | 1 | False |
34 | 1 | 47 | 0 | 0 | False |
42 | 0 | 39 | 1 | 0 | False |
45 | 1 | 40 | 0 | 0 | True |
67 | 1 | 49 | 0 | 0 | True |
70 | 0 | 40 | 0 | 0 | False |
What if you wanted to filter a DataFrame based on multiple conditions?
and
.or
.# Trying to combine two conditions using `and` won't work
data[(data.age <= 50) and (data.TRTMT == True)]
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# Trying to combine two conditions without parentheses results in an error
data[ data.age <= 50 & data.TRTMT == True]
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# Get patients aged 50 or less who received treatment
data[(data.age <= 50) & (data.TRTMT == True)]
sex | age | obstruct | outcome | TRTMT | |
---|---|---|---|---|---|
6 | 1 | 43 | 0 | 1 | True |
19 | 0 | 34 | 1 | 1 | True |
24 | 0 | 50 | 0 | 0 | True |
32 | 0 | 33 | 1 | 0 | True |
45 | 1 | 40 | 0 | 0 | True |
67 | 1 | 49 | 0 | 0 | True |
When slicing a DataFrame the resulting type will be a DataFrame as well:
type(data[(data.age <= 50) & (data.TRTMT == True)])
pandas.core.frame.DataFrame
Now let's dive into some useful properties of DataFrames and Series that allow for more advanced calculations.
# Applying len() to the df yields the number of rows
print(f"len: {len(data[(data.age <= 50)])}")
# Accessing the 'shape' attribute of the df yields a tuple of the form (rows, cols)
print(f"shape (rows, cols) {data[(data.age <= 50)].shape}")
# Accessing the 'size' attribute of the df yields the number of elements in the df:
print(f"size: {data[(data.age <= 50)].size}")
len: 11 shape (rows, cols) (11, 5) size: 55
# Applying len() to the df yields the number of rows
print(f"{len(data.TRTMT)}")
# Accessing the 'shape' attribute of the df yields a tuple of the form (rows, cols)
print(f"{data.TRTMT.shape}")
# Accessing the 'size' attribute of the df yields the number of elements in the df:
print(f"{data.TRTMT.size}")
50 (50,) 50
Using what you've seen so far, can you calculate the proportion of the patients who are male?
prop_male_patients = data[(data.sex == 1)].size / data.size
print(f"Your answer: {prop_male_patients:.2f}, Expected: {21/50}")
One handy hack you can use when dealing with binary data is to use the mean()
method of a Series to calculate the proportion of occurrences that are equal to 1.
Note this should also work with bool data since Python treats booleans as numbers when applying math operations.
True
is treated as the number 1
False
is treated as the number 0
# Calculate the proportion of the `sex` column that is `True` (1).
data.sex.mean()
So far you've only accessed values of a DataFrame or Series. Sometimes you may need to update these values.
Let's look at the original DataFrame one more time:
# View dataframe
data.head()
Let's say you detected an error in the data, where the second patient was actually treated.
# Try to access patient 0, and note the error message
data.loc[0,'TRTMT']
KeyError: 0
data.loc[2,'TRTMT']
data.loc[2, "TRTMT"] = True
data.head()
Now, you've found out that there was another issue with the data that needs to be corrected. This study only includes females, so the sex
column should be set to 0 for all patients.
You can update the whole column (or Series) using .loc[row, col] once again, but this time using ":" for rows.
data.loc[:, "sex"] = 0
data.head()
You can access a range of rows by specifying the start:end
, where the end
index is included.
# Access patients at index 3 to 4, including 4.
data.loc[3:4,:]
Welcome to the wonderful world of Pandas! You will be using these pandas functions in this week's graded assignment.