Instructions:
biom262_2015_01_29_cleaning_data.zip
file emailed outzip
file (this will create a folder)biom262_2015_01_29_cleaning_data
ipython notebook
into the terminalFor this activity, work in pairs. I will give you a blue and a pink sticky note. Put the pink one on one of your laptops to show that you're stuck or have a question. Put the blue one on your laptop if you and your partner are cruisin' through the exercises and don't want to be bothered.
Whenever you work with any kind of data, such as your own Excel spreadsheets or data downloaded from a paper that you'd like to analyze, 99.9999999999% of the time it's not formatted well, and you have to do a bunch of manual cleaning. This is a real-world example of four different metadata files describing post-mortem RNA-sequencing data from 212 post-mortem subjects and 2 cell lines, from 32 different tissues, for a total of 4500 samples. In other words, this is not a dataset you would enjoy cleaning up in Excel! You can read more about this project at the GTEx portal.
The goal of this exercise is to create a single, easily human-readable table from two .txt
files, and two Excel files. Because this is such a big project, they had to standardize everything and their tables are at least consistently formatted. But there's a lot of jargon in the data that's specific to this session, which we will replace with human-readable terms. And in 2 main tables, GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt
and GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt
, the sample identifiers aren't exactly the same, so we'll have to do extra work to merge them.
One package that doesn't come with the Anaconda Python distribution is seaborn
, so to install it, run this command.
pd.read_table()
, i.e. the ones that have parentheses, you can get it by moving your cursor in between the parentheses, and pressing "Shift" and "Tab". This will pop up a help window next to the parentheses. Press "Tab" once more, and the window will get bigger. Press "Tab" a third time, and it will pop up a big help screen at the bottom.! pip install seaborn
Let's import everything we need for this session.
# Pandas or "Panel Data Analysis" toolkit for Data Frames/Data Tables
import pandas as pd
# Numpy or "Numerical Python"
import numpy as np
# Powerful R-style/statistical plotting
import seaborn as sns
# These styles are my personal preferences
# For more options, see this page: http://web.stanford.edu/~mwaskom/software/seaborn/tutorial/aesthetics.html
sns.set(style='whitegrid', context='notebook')
# Show the figures directly in the IPython notebok
%matplotlib inline
Note: all these commands are meant to be run within** the notebook, not on the terminal**
The unix command cd
moves where the notebook looks for data, and ls
will also list the files. These are some key Unix commands that can be used without the exclamation point "!
", as we will use later.
Change directories to where you downloaded the example data. For me, that's the directory "~/Downloads/biom262_2015_01_29_cleaning_data
". Remember that the character "~
" (pronounced "tilde") indicates your home directory, which for me is /Users/olga
, but I didn't feel like typing that out, so I used the tilde instead.
cd ~/Downloads/biom262_2015_01_29_cleaning_data
And make sure you have all the right files around
ls
Let's start with the SubjectPhenotypes
file, GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt
. The first step is to look at the first 10 lines of the file with the Unix command head
. In the IPython notebook, you can call unix/bash commands by starting the line with an exclamation point, !
.
! head GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt
So there's some generic sample, a gender of 1
or 2
, a range of years, and some kind of DTTHDY
thing. I don't know what that means, but we'll deal with that once we open the file in pandas
. It looks like it ranges from 0
to 4
, but I don't see any entries of 1
. By default, head
outputs the first 10 lines. We can modify the number of lines with the flag -n
and then provide a number. For example, let's look at the first 23 lines of the file instead.
! head -n 23 GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt
Still no entries with a DTHHRDY
of 1. Maybe we need to look at the end of the file. We can do that with the command tail
.
! tail GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt
Ooh, our first 1
in DTHHRDY
! Hmm, some of the rows don't have a value for the DTHHRDY
column! This will come into play in the future.
Let's stop for a brief exercise. How would you look at the last 17 lines of the file?
# Exercise: modify the `tail` command below to look at the last 17 lines of the file.
! tail GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt
Another question we may have about this file is how many lines are in it. We can do this with the Unix command wc
, or "word, line, character and byte counter." Specifically wc -l
will count the number of lines in the file.
! wc -l GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt
# Exercise: do a web search for "unix count number of columns" and
# check the command on the file, GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt
! # Column-counting code goes here
pandas
library in Python¶We will make heavy use of the pandas
library, which is a godsend to Pythonista Data Scientists. It makes working with weirdly formatted data much easier, as you will soon see.
subject_phenotypes = pd.read_table('GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt')
This has created a pandas
DataFrame
variable called subject_phenotypes
. Python's standards are to name variables lowercase_with_underscores
, so we'll stick to that :)
Let's look at the top of subject_phenotypes
, again with a command called head
, but we call it a little differently now that we're in Python and not Unix (notice no "!
" at the beginnings of the lines anymore)
subject_phenotypes.head()
The pandas
head shows the first 5 rows, instead the first 10 rows like Unix head
.
We can also access individual columns in two ways. The first way is by using square brackets around the string of the column name, like this for SUBJID
:
# Run this cell. Try entering 'subjid' and 'subject id' as well.
subject_phenotypes['SUBJID']
If a column name consists of only letters, numbers, and underscores, and starts with a letter, you can also access it with the column name, no quotes, after the dataframe name and a dot.
subject_phenotypes.SUBJID
You may hear the word "series" get tossed around. A "series" is the pandas
-specific technical name for a column of a dataframe.
AGE
?¶Hint: head
and tail
work for series as well as whole dataframes.
# Exercise: Show the last 8 rows.
# Code for looking at the last 8 rows goes here.
Let's get to these GENDER
and DTHHRDY
columns. Open up the file, GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DD.xlsx
in Excel, and see what a gender of 1
and 2
means.
# A python "dictionary", or mapping from one thing to the next.
# In this case, we're mapping the numbers 1 and 2 to strings
# indicating the gender.
gender = {1: 'fillmein',
2: 'fillmein'}
You can access items of a dictionary with square brackets, much like the columns of a dataframe.
# Run this cell. How do you access the other gender?
gender[2]
Now, we can create a new column called 'gender'
(all lowercase, because I feel like ALL CAPS COLUMNS ARE YELLING AT ME), using the gender
dictionary.
In pandas
, you can create a new column by pretending to access an existing column in the dataframe, and assigning it to some value. Here's an example of creating a new column called "don't worry"
with the value "be yonce"
in every cell.
subject_phenotypes["don't worry"] = "be yonce"
Look at the top of the dataframe to see what that did.
# Code to look at the top of the dataframe goes here
# hint: remember the command "head"? How did we use it to look at the dataframe when we first loaded it?
# Code to add a column with programmer's choice of name and value goes here
Another convenient operation is map
on a series, which performs the operation specified on every element of a column. It's as if you wrote a for
-loop to access every item of the GENDER
column, and use that item to access the gender
dictionary, and replace the value.
subject_phenotypes.GENDER.map(gender)
Using map
, it's as if we wrote the following for
-loop, but map
is less code, and more concise.
for g in subject_phenotypes.GENDER:
print gender[g]
Has this changed the dataframe subject_phenotypes
?
# Code to check if the dataframe subject_phenotypes has changed goes here
# You can check if it has changed by looking at it using your favorite body endpoint
The dataframe shouldn't have changed.
map
the dictionary gender
onto the column DTHHRDY
?¶# Code to `map` `gender` onto DTHHRDY goes here
map
on a series¶Your next exercise is to combine these previous two concepts of creating a column and using map
, to create a column called "gender"
(all lowercase) which is the result of using map
with the dictionary gender
on the GENDER
column of subject_phenotypes
.
# Code to create a new column called "gender" in `subject_phenotypes` that is the result of using `map`
# with the `gender` dictionary on the "GENDER" column of `subject_phenotypes`.
# Code to check if the dataframe has changed goes here
Great! Now we have a column called gender
, that makes sense to a human without having to look something up in some other table.
DTHHRDY
column values into human-readable values¶Using the GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DD.xlsx
spreadsheet again, find out what the different numbers in DTHHRDY
mean, make a dictionary mapping numbers to words like we did with gender
, and create a new column with a human-understandable name.
# Code to create a new column in `subject_phenotypes` that is a human-readable version of the column `DTHHRDY` goes here
Now that we have our cleaned-up dataframe, let's do some plotting!
Let's use seaborn
(which we imported as the variable sns
for brevity) to plot this. We will use the function factorplot
which has a bunch of options, but for now we'll just focus on two.
The first argument is the name of the column you want to plot, and then we provide the keyword argument data=subject_phenotypes
, to specify the dataframe we want to get this column from.
sns.factorplot('gender', data=subject_phenotypes)
Nice, so we can now see the distribution of the number of subjects of these two genders.
What if we also want to see how many people of the two genders, have different DTHHRDY
categorizations?
DTHHRDY
¶Change human_variable_DTHHRDY
to the new column you created.
# Edit the argument `hue=...`
sns.factorplot('gender', data=subject_phenotypes, hue=human_readable_DTHHRDY)
You can break this down even more by also plotting the age of the subjects, and showing a separate plot, as below. The argument col='AGE'
means to plot each age group onto a separate column of plots.
sns.factorplot('gender', data=subject_phenotypes, hue='DTHHRDY', col='AGE')
x
variable, and 'gender'
as each column¶# Code goes below
So far, we've been working with one pandas
dataframe, and an external *.xlsx
file. Now we're going to work on combining sample_phenotypes
with a new dataframe, which we will call sample_attributes
.
# Code to look at the top of GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt goes here
# Code to count the number of lines in GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt goes here
# Code to count the number of columns in GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt goes here
# Code to read the table GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt goes here
pd.read_table('GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt')
# Code to look at the top of the `DataFrame` you just created
SAMPID | SMATSSCR | SMCENTER | SMPTHNTS | SMRIN | SMTS | SMTSD | SMTSISCH | SMNABTCH | SMNABTCHT | ... | SME1ANTI | SMSPLTRD | SMBSMMRT | SME1SNSE | SME1PCTS | SMRRNART | SME1MPRT | SMNUM5CD | SMDPMPRT | SME2PCTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GTEX-N7MS-0007-SM-26GME | NaN | C1 | NaN | 8.2 | Blood | Whole Blood | 16-19 hours | BP-16653 | RNA isolation_PAXgene Blood RNA (Manual) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | GTEX-N7MS-0007-SM-26GMV | NaN | C1 | NaN | 8.2 | Blood | Whole Blood | 16-19 hours | BP-16653 | RNA isolation_PAXgene Blood RNA (Manual) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | GTEX-N7MS-0007-SM-2D43E | NaN | C1 | NaN | 8.2 | Blood | Whole Blood | 16-19 hours | BP-16653 | RNA isolation_PAXgene Blood RNA (Manual) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | GTEX-N7MS-0007-SM-2D7W1 | NaN | C1 | NaN | 8.2 | Blood | Whole Blood | 16-19 hours | BP-16653 | RNA isolation_PAXgene Blood RNA (Manual) | ... | 13705136 | 18432744 | 0.002456 | 13447728 | 49.526005 | 0.041526 | 0.835199 | 840 | 0.563503 | 51.361324 |
4 | GTEX-N7MS-0008-SM-4E3JI | NaN | C1 | NaN | 10.0 | Skin | Cells - Transformed fibroblasts | NaN | BP-37581 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 17962165 | 20910366 | 0.004087 | 18012435 | 50.069874 | 0.028395 | 0.948329 | 879 | 0.226835 | 50.270794 |
5 | GTEX-N7MS-0009-SM-2BWY4 | NaN | C1 | NaN | NaN | Blood | Whole Blood | 16-19 hours | BP-16657 | DNA isolation_Whole Blood _QIAGEN Puregene (Ma... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | GTEX-N7MS-0009-SM-2XK1D | NaN | C1 | NaN | NaN | Blood | Whole Blood | 16-19 hours | BP-16657 | DNA isolation_Whole Blood _QIAGEN Puregene (Ma... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | GTEX-N7MS-0011-R10A-SM-2HMJK | NaN | C1, A1 | NaN | 7.1 | Brain | Brain - Frontal Cortex (BA9) | NaN | BP-19253 | RNA isolation_QIAGEN miRNeasy | ... | 18948398 | 12221905 | 0.004294 | 18747238 | 49.733180 | 0.051237 | 0.875680 | 859 | 0.330709 | 50.619534 |
8 | GTEX-N7MS-0011-R10A-SM-2IZJW | NaN | C1, A1 | NaN | 7.1 | Brain | Brain - Frontal Cortex (BA9) | NaN | BP-19253 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | GTEX-N7MS-0011-R11A-SM-2HMJS | NaN | C1, A1 | NaN | 6.6 | Brain | Brain - Cerebellar Hemisphere | NaN | BP-19253 | RNA isolation_QIAGEN miRNeasy | ... | 19024292 | 12200496 | 0.003643 | 18954711 | 49.908398 | 0.016711 | 0.893391 | 851 | 0.193112 | 50.387028 |
10 | GTEX-N7MS-0011-R11A-SM-2IZJZ | NaN | C1, A1 | NaN | 6.6 | Brain | Brain - Cerebellar Hemisphere | NaN | BP-19253 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 | GTEX-N7MS-0011-R1a-SM-2AXVJ | NaN | C1, A1 | NaN | 7.3 | Brain | Brain - Hippocampus | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12 | GTEX-N7MS-0011-R1a-SM-2HMJG | NaN | C1, A1 | NaN | 7.3 | Brain | Brain - Hippocampus | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | 15226514 | 8806379 | 0.004332 | 15122489 | 49.828620 | 0.041028 | 0.790736 | 835 | 0.324148 | 50.618090 |
13 | GTEX-N7MS-0011-R2a-SM-2HML6 | NaN | C1, A1 | NaN | 7.0 | Brain | Brain - Substantia nigra | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | 11678187 | 8637924 | 0.004073 | 11575007 | 49.778137 | 0.028304 | 0.628574 | 837 | 0.275110 | 50.561234 |
14 | GTEX-N7MS-0011-R2a-SM-2IZK7 | NaN | C1, A1 | NaN | 7.0 | Brain | Brain - Substantia nigra | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | GTEX-N7MS-0011-R3a-SM-2AXVU | NaN | C1, A1 | NaN | 7.6 | Brain | Brain - Anterior cingulate cortex (BA24) | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16 | GTEX-N7MS-0011-R3a-SM-2HMKD | NaN | C1, A1 | NaN | 7.6 | Brain | Brain - Anterior cingulate cortex (BA24) | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | 8114816 | 4205357 | 0.004996 | 8085575 | 49.909750 | 0.004037 | 0.151460 | 806 | 0.364988 | 50.379780 |
17 | GTEX-N7MS-0011-R3a-SM-33HC6 | NaN | C1, A1 | NaN | 7.6 | Brain | Brain - Anterior cingulate cortex (BA24) | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | 36663415 | 22956649 | 0.003256 | 36619121 | 49.969780 | 0.033465 | 0.916319 | 875 | 0.341985 | 50.874683 |
18 | GTEX-N7MS-0011-R4a-SM-2AXW2 | NaN | C1, A1 | NaN | 6.2 | Brain | Brain - Amygdala | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
19 | GTEX-N7MS-0011-R4a-SM-2HMKW | NaN | C1, A1 | NaN | 6.2 | Brain | Brain - Amygdala | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | 11651563 | 8137082 | 0.004180 | 11545498 | 49.771380 | 0.035598 | 0.657749 | 811 | 0.334207 | 50.741714 |
20 | GTEX-N7MS-0011-R5a-SM-2AXW7 | NaN | C1, A1 | NaN | 7.8 | Brain | Brain - Caudate (basal ganglia) | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21 | GTEX-N7MS-0011-R5a-SM-2HMK8 | NaN | C1, A1 | NaN | 7.8 | Brain | Brain - Caudate (basal ganglia) | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | 21641531 | 13525503 | 0.004141 | 21325223 | 49.631920 | 0.050445 | 0.889377 | 891 | 0.312833 | 50.695858 |
22 | GTEX-N7MS-0011-R6a-SM-2AXWD | NaN | C1, A1 | NaN | 7.6 | Brain | Brain - Nucleus accumbens (basal ganglia) | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
23 | GTEX-N7MS-0011-R6a-SM-2HMJ4 | NaN | C1, A1 | NaN | 7.6 | Brain | Brain - Nucleus accumbens (basal ganglia) | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | 17668410 | 12313288 | 0.003942 | 17447282 | 49.685143 | 0.040960 | 0.883223 | 854 | 0.262580 | 50.615242 |
24 | GTEX-N7MS-0011-R7a-SM-2AXV5 | NaN | C1, A1 | NaN | 6.4 | Brain | Brain - Putamen (basal ganglia) | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25 | GTEX-N7MS-0011-R7a-SM-2HMKN | NaN | C1, A1 | NaN | 6.4 | Brain | Brain - Putamen (basal ganglia) | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | 10366169 | 5352672 | 0.004739 | 9997472 | 49.094720 | 0.058973 | 0.664810 | 791 | 0.400345 | 51.510887 |
26 | GTEX-N7MS-0011-R8a-SM-2AXVD | NaN | C1, A1 | NaN | 6.8 | Brain | Brain - Hypothalamus | NaN | BP-17395 | RNA isolation_QIAGEN miRNeasy | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
27 | GTEX-N7MS-0011-R8a-SM-2YUMK | NaN | C1, A1 | NaN | 6.8 | Brain | Brain - Hypothalamus | NaN | BP-17395 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 12317489 | 9750793 | 0.002769 | 12167817 | 49.694363 | 0.054535 | 0.817677 | 820 | 0.366260 | 50.655490 |
28 | GTEX-N7MS-0126-SM-3TW8O | 3 | C1 | OK | 9.1 | Testis | Testis | 16-19 hours | BP-16740 | RNA isolation_PAXgene Tissue miRNA | ... | 19539848 | 15703873 | 0.002877 | 19525988 | 49.982260 | 0.051392 | 0.934544 | 937 | 0.185286 | 50.262527 |
29 | GTEX-N7MS-0225-SM-4E3HO | 1 | C1 | OK for analysis | 7.7 | Skin | Skin - Sun Exposed (Lower leg) | 16-19 hours | BP-36182 | RNA isolation_PAXgene Tissue miRNA | ... | 17258030 | 13502100 | 0.004377 | 17381617 | 50.178387 | 0.010061 | 0.937283 | 862 | 0.223826 | 50.288605 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4471 | K-562-SM-3GADY | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 20293042 | 21127232 | 0.003853 | 20085667 | 49.743217 | 0.010052 | 0.912978 | 877 | 0.323958 | 50.618893 |
4472 | K-562-SM-3GAFC | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 19387186 | 19565057 | 0.003505 | 19503152 | 50.149097 | 0.011422 | 0.890707 | 854 | 0.420687 | 50.387650 |
4473 | K-562-SM-3GIKB | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 12498139 | 12769804 | 0.003243 | 12654894 | 50.311604 | 0.010686 | 0.852001 | 849 | 0.388517 | 50.260590 |
4474 | K-562-SM-3GILO | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 18089388 | 18151281 | 0.004954 | 17946903 | 49.802307 | 0.010809 | 0.865095 | 876 | 0.278859 | 50.707745 |
4475 | K-562-SM-3K2BF | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 10709316 | 11153056 | 0.005191 | 10650786 | 49.862990 | 0.017607 | 0.801476 | 841 | 0.225620 | 50.603752 |
4476 | K-562-SM-3LK7S | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 19775330 | 21833559 | 0.002277 | 19642715 | 49.831787 | 0.018965 | 0.922582 | 882 | 0.312202 | 50.353050 |
4477 | K-562-SM-3MJHH | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 22850569 | 25074169 | 0.002691 | 22654052 | 49.784070 | 0.014306 | 0.938719 | 887 | 0.299666 | 50.487100 |
4478 | K-562-SM-3NB3I | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 21923991 | 23322261 | 0.003130 | 22010308 | 50.098236 | 0.022786 | 0.928603 | 872 | 0.299371 | 50.367115 |
4479 | K-562-SM-3NMAP | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 22210968 | 24363787 | 0.003103 | 22339747 | 50.144530 | 0.018574 | 0.936381 | 890 | 0.284208 | 50.140530 |
4480 | K-562-SM-3NMDG | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 20447451 | 22524234 | 0.002659 | 20583281 | 50.165520 | 0.017522 | 0.950120 | 873 | 0.267642 | 50.144337 |
4481 | K-562-SM-3P61Y | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 23221097 | 25541338 | 0.002788 | 23358263 | 50.147243 | 0.022186 | 0.946919 | 894 | 0.257213 | 50.123116 |
4482 | K-562-SM-46MWI | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 12960037 | 14028074 | 0.002139 | 13042438 | 50.158447 | 0.024899 | 0.956672 | 873 | 0.188689 | 50.133953 |
4483 | K-562-SM-47JYY | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 15251988 | 17187136 | 0.002230 | 15395095 | 50.233475 | 0.022935 | 0.957918 | 862 | 0.218666 | 49.976242 |
4484 | K-562-SM-48FEU | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 11679926 | 13207067 | 0.002256 | 11762756 | 50.176662 | 0.028136 | 0.953367 | 832 | 0.393895 | 50.098743 |
4485 | K-562-SM-48TE3 | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 14556993 | 16245672 | 0.001645 | 14501457 | 49.904438 | 0.013772 | 0.964781 | 873 | 0.197786 | 50.232384 |
4486 | K-562-SM-4AD4F | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 14758394 | 16187369 | 0.002012 | 14775511 | 50.028980 | 0.013740 | 0.962448 | 863 | 0.197407 | 50.208282 |
4487 | K-562-SM-4AT3W | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 10916665 | 12078403 | 0.005620 | 10964475 | 50.109250 | 0.013850 | 0.950125 | 828 | 0.172564 | 50.208042 |
4488 | K-562-SM-4B66B | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 22550061 | 24921589 | 0.002292 | 22611795 | 50.068350 | 0.022443 | 0.956932 | 892 | 0.230228 | 50.232353 |
4489 | K-562-SM-4BONS | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 15745843 | 17474655 | 0.003985 | 15727040 | 49.970127 | 0.020865 | 0.956510 | 876 | 0.215475 | 50.267246 |
4490 | K-562-SM-4BRWK | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 12423333 | 13669729 | 0.002084 | 12383723 | 49.920166 | 0.011200 | 0.958494 | 861 | 0.184230 | 50.298664 |
4491 | K-562-SM-4DM4W | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 11350047 | 12454038 | 0.005404 | 11400063 | 50.109924 | 0.029977 | 0.945595 | 828 | 0.181774 | 50.162186 |
4492 | K-562-SM-4EDPU | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 19954703 | 21371627 | 0.003999 | 20055440 | 50.125885 | 0.042664 | 0.944981 | 890 | 0.243209 | 50.225540 |
4493 | K-562-SM-4GICD | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 21863245 | 24097377 | 0.003084 | 21979888 | 50.133026 | 0.012700 | 0.929063 | 893 | 0.242783 | 50.196144 |
4494 | K-562-SM-4IHK7 | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 14231410 | 15796005 | 0.004426 | 14188282 | 49.924120 | 0.022133 | 0.949039 | 870 | 0.214134 | 50.295345 |
4495 | K-562-SM-4JBIQ | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 18279772 | 20403076 | 0.004726 | 18332662 | 50.072230 | 0.014611 | 0.954230 | 871 | 0.226838 | 50.212060 |
4496 | K-562-SM-4KKZ9 | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 17621411 | 19649733 | 0.003199 | 17603160 | 49.974094 | 0.017336 | 0.951078 | 855 | 0.208081 | 50.299175 |
4497 | K-562-SM-4LMI2 | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 16494770 | 18425439 | 0.003787 | 16498971 | 50.006367 | 0.011050 | 0.952641 | 866 | 0.197396 | 50.248615 |
4498 | K-562-SM-4LVKX | NaN | NaN | NaN | 9.5 | Bone Marrow | Cells - Leukemia cell line (CML) | NaN | BP-17177 | RNA isolation_Trizol Manual (Cell Pellet) | ... | 13735552 | 15244841 | 0.002508 | 13719276 | 49.970356 | 0.011353 | 0.952748 | 870 | 0.185469 | 50.244150 |
4499 | NA12878-SM-2XJZN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Cell Line DNA (Derived from Blood Cells) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4500 | NA12878_C-SM-2VCTR | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Cell Line DNA (Derived from Blood Cells) | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4501 rows × 59 columns
Ugh, this has the UGLIEST column names. What does SMTSISCH
really mean, anyway? Plus, there's 59 columns and I don't want to have to go through and copy/paste something 59 different times.
Turns out the excel file GTEx_Data_2014-01-17_Annotations_SampleAttributesDD.xlsx
has the mapping between these weird names and human-readable concepts. Open it Excel.
Since there's so many things to rename, we'll do it programmatically rather than copy/pasting by hand.
GTEx_Data_2014-01-17_Annotations_SampleAttributesDD.xlsx
using pandas
, and set the first column as the "index" or row names¶Hint 1: index_col
is the argument you want for setting the column number that should be the index
Hint 2: In computer science, we count from 0, so the third column is indicated by the number 2
Hint 3: "how to open excel in pandas" is a great search term :)
# Code to read the excel file GTEx_Data_2014-01-17_Annotations_SampleAttributesDD.xlsx, and set the first column as the "index" goes here
sample_attributes_dd = pd.read_excel('GTEx_Data_2014-01-17_Annotations_SampleAttributesDD.xlsx', index_col=0)
# Code to look at the top of the file goes here
sample_attributes_dd.head()
VARDESC | DOCFILE | TYPE | UNITS | COMMENT1 | COMMENT2 | VALUES | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | ... | Unnamed: 47 | Unnamed: 48 | Unnamed: 49 | Unnamed: 50 | Unnamed: 51 | Unnamed: 52 | Unnamed: 53 | Unnamed: 54 | Unnamed: 55 | Unnamed: 56 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VARNAME | |||||||||||||||||||||
SAMPID | Sample ID, GTEx Public Sample ID | NaN | string | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SMATSSCR | Autolysis Score | PRC Case Summary Report | integer, encoded value | NaN | Autolysis | The destruction of organism cells or tissues b... | 0=None | 1=Mild | 2=Moderate | 3=Severe | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SMNABTCH | Nucleic Acid Isolation Batch ID | LDACC | string | NaN | Generated at LDACC | Batch when DNA/RNA was isolated and extracted ... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SMNABTCHT | Type of nucleic acid isolation batch | LDACC | string | NaN | Generated at LDACC | The process by which DNA/RNA was isolated | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
SMNABTCHD | Date of nucleic acid isolation batch | LDACC | string | NaN | Generated at LDACC | The date on which DNA/RNA was isolated | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 56 columns
Which column do we want to use to rename the weird column names in the other dataframe? Let's print it.
sample_attributes_dd
, that describes the column names in sample_attribute
¶# Code for showing the descriptive column in the dataframe `sample_attributes_dd`
Since the index
of sample_attributes_dd
is the column names of sample_attributes
, any column from sample_attributes_dd
is a mapping from column names to some other values, depending on what column you use.
What's really nice about this is that we can then use one of these columns in sample_attributes_dd
to rename the column names in sample_attributes
. Here's an example of renaming the columns of sample_attributes
using the column 'TYPE'
in sample_attributes_dd
.
sample_attributes.rename(columns=sample_attributes_dd['TYPE'])
sample_attributes
using the variable description column from sample_attribute_dd
.¶# Code to rename the columns of sample_attributes using a column from sample_attributes_dd
sample_attributes = sample_attributes.rename(columns=sample_attributes_dd.VARDESC)
# Code for looking at the top of the new sample_attributes goes here
sample_attributes
sample_attributes.ix[:5, :20]
Excellent! Now we have a dataframe with human-readable column names.
Remember that first dataframe that we created, subject_phenotypes
? We want to unify that first dataframe with this new one. Let's take a gander at it to remember what's in it in the first place.
# Code for looking at the top of `subject_phenotypes` goes here
subject_phenotypes
and sample_attributes
look like they could be matched up?¶Note: they don't have to be exactly the same values, because we can modify them, but look for commonalities.
# Exercise: Code to show a column of `subject_phenotypes` goes here
# Exercise: Code to show a column of `sample_attributes` goes here
Before we can make a new column in sample_attributes
with the corresponding sample id of sample_phenotypes
, we need to go over some string manipulation techniques.
For example, we can take a string and split
it. By default, they will be split on the whitespace (like spaces and new lines)
s = 'you have as many hours in a day as beyonce'
s.split()
If you don't want to split on whitespace, you can specify a specific letter or character to split on, too.
s.split('d')
Complete sentence = it has a subject, object and a verb.
s
above on the letter "a"¶# Code to split `s` goes here
The result of s.split()
is a list
, which is a special name in Python. Look, it even comes in a special color, different from black, so you can see how special it is:
list
We can access elements from the split using a number in square brackets.
s.split()[4]
'day'
from s.split()
¶# Code to get "day" from s.split() goes here
What if we want to access mutiple items at once? We can use the colon ":
" to indicate we want everything up to (but not including) the Nth item. For example, if we want the first 5 words, we can do:
s.split()[:5]
s
on "a", and get the first 3 elements¶# Code to split `s` on "a", and get the first 3 elements
Now we're able to split a string, but what if we want to put it back together? We can join
the results.
' '.join(s.split()[:5])
Notice that we used a space to join the words. We could have used any character to join
them (as well as any character to split
):
'!'.join(s.split('e')[:4])
s
on 3 different characters, and join on 3 different characters.¶# Code for split 1, join 1 goes here
# Code for split 2, join 2 goes here
# Code for split 3, join 3 goes here
We can use what we just learned about manipulating strings, on columns of dataframe using lambda
, which allows us to create small functions. For example, if we wanted to take column 'SUBJID'
from the dataframe subject_phenotypes
, split every item on the dash character '-'
, and get the first item, we would do this:
subject_phenotypes['SUBJID'].map(lambda x: x.split('-')[0])
"Type of nucleic acid isolation batch"
in sample_attributes
on whitespace, and get the 3rd element.¶# Code to split the column 'Type of nucleic acid isolation batch' on whitespace and get the third element goes here
The data may not have a string in every element of a column. For example, this code produces an AttributeError
.
sample_attributes['Tissue Type, more specific detail of tissue type'].map(lambda x: x.split('-')[0])
The above code produces the error,
<ipython-input-157-6714d86271d7> in <lambda>(x)
----> 1 sample_attributes['Tissue Type, more specific detail of tissue type'].map(lambda x: x.split('-')[0])
AttributeError: 'float' object has no attribute 'split'
Which happens because instead of a nice string, there is a float
there, and float
s don't know how to be split
. Why is that? Well, NAs are of type float
, so this indicates that there's an NA there.
To deal with this, we can add an if
statement to our lambda
to make it deal with these situations. We will use the function isinstance
to check if x
is a string (the special word for a string in Python is str
), and replace it with an NA using the numpy
library (which we imported as np
for shorthand) np.nan
.
sample_attributes['Tissue Type, more specific detail of tissue type'].map(lambda x: x.split('-')[0] if isinstance(x, str) else np.nan)
'Code for BSS collection site'
on a comma ",
" and get the first two items of the split, accounting for NAs¶# Code goes here
In addition to split
-ing elements, we can join
within the map
/lambda
combo too!
sample_attributes['Tissue Type, more specific detail of tissue type'].map(lambda x: '_'.join(x.split()[:3])
if isinstance(x, str) else np.nan)
"Type of nucleic acid isolation batch"
in sample_attributes
on underscores, and join the first two elements using a space¶# Code goes here
Now we have all the tools to add a "subject_id"
column to sample_attributes
! Remember, we want to create a column which has exactly the same entries as the column "SUBJID"
in subject_phenotypes
. What was that again? It's been so long that I forgot what those IDs look like. To remind yourself what the subject IDs in subject_phenotypes
and sample ids in sample_attributes
look like, take a look at the top of each of those dataframes.
# Code to look at the top of `subject_phenotypes`
# Code to look at the top of `sample_attributes`
sample_attributes
called "subject_id"
, using one of its existing columns, that matches the "SUBJID"
in subject_phenotypes
exactly¶# Code goes here
Excellent! Now we have a column which exactly matches the rows of subject_phenotypes
to the rows of sample_attributes
. Now we want to merge these two dataframes together. How do we do that? We will use the function merge
, which is a function of the dataframe. Merge is a little complicated, so let's break it down with a few examples.
First, we'll create a couple example dataframes.
dataframe1 = pd.DataFrame([['cucumber', 'watery'], ['broccoli', 'crunchy'], ['kale', 'chewy'],
['mango', 'sweet'] ], columns=['vegetable', 'description'])
dataframe1
dataframe2 = pd.DataFrame([['broccoli', 'harvested', 8], ['broccoli', 'planted', 5],
['kale', 'planted', 6], ['kale', 'harvested', 9],
['cucumber', 'harvested', 7], ['cucumber', 'planted', 4],
['strawberry', 'planted', 10], ['strawberry', 'harvested', 2]],
columns=['crop', 'action', 'number'])
dataframe2
We want to merge
these two dataframes on their common column, which is "vegetable"
in dataframe1
and crop
in dataframe2
. We can do this using merge
, and specifying what we want to merge the left and right dataframes on.
dataframe1.merge(dataframe2, left_on='vegetable', right_on='crop')
Use "Shift"+"Tab" to read the documentation behind merge
. What's the default way that two dataframes are merged?
dataframe1
and dataframe2
, but use dataframe2
as the "left" dataframe, and merge using "outer"
¶# Code goes here
Now we're ready to try this with real data! Let's go back to our sample_attributes
and subject_phenotypes
dataframes. To recap, We added a column to sample_attributes
to match up with subject_phenotypes
. Now, merge the two dataframes together using their columns with common values.
sample_attributes
and subject_phenotypes
on their common column.¶# Code goes here
Congratulations! You have now performed a DATABASE MERGE!!! Now you can't be afraid of databases! Mwahahahah!
All a "database" really is, is a bunch of tables linked together by certain 'keys', aka the values in the columns. What you've done today is manipulate some tables (aka databases), changing column names and adding columns so they're mergable, and merging them together.
Concepts from today (there's a lot!):
head
and tail
to look at the beginings and ends of files-n N
to modify the number of lines output by head
and tail
wc -l
to count the number of lines in a filepandas
, specifically pd.read_table
.head()
and .tail()
to look at the tops and bottoms of dataframes.head(N)
and .tail(N)
to look at the top and bottom N
rowspandas
DataFrames
map
instead of for
-loops to operate on every item of a columnpandas
DataFrames
pandas
pandas
index
(aka row names) when you read in the filesplit
on any characterssplit
is a list[3]
shows the 4th elementN
elements of a list using square brackets, a colon, and the number, e.g. [:5]
shows up to, but not including, the 6th elementjoin
join
can be on any characterlambda
to create an "anonymous" function to use within map
lambda
to split and join strings within a columnfloat
isinstance
lambda
can contain an if
statement for alternative outputselse
statement as wellmap
and lambda
to do a complicated operation on each item of the column