BIOM262 January 29th, 2015: Cleaning data

Instructions:

  1. Download this IPython notebook from NBViewer (upper right corner)
  2. Download the biom262_2015_01_29_cleaning_data.zip file emailed out
  3. Unzip the zip file (this will create a folder)
  4. Open up the terminal and navigate to the folder biom262_2015_01_29_cleaning_data
  5. Start an IPython notebook server by typing ipython notebook into the terminal

For this activity, work in pairs. I will give you a blue and a pink sticky note. Put the pink one on one of your laptops to show that you're stuck or have a question. Put the blue one on your laptop if you and your partner are cruisin' through the exercises and don't want to be bothered.

Background

Whenever you work with any kind of data, such as your own Excel spreadsheets or data downloaded from a paper that you'd like to analyze, 99.9999999999% of the time it's not formatted well, and you have to do a bunch of manual cleaning. This is a real-world example of four different metadata files describing post-mortem RNA-sequencing data from 212 post-mortem subjects and 2 cell lines, from 32 different tissues, for a total of 4500 samples. In other words, this is not a dataset you would enjoy cleaning up in Excel! You can read more about this project at the GTEx portal.

The goal of this exercise is to create a single, easily human-readable table from two .txt files, and two Excel files. Because this is such a big project, they had to standardize everything and their tables are at least consistently formatted. But there's a lot of jargon in the data that's specific to this session, which we will replace with human-readable terms. And in 2 main tables, GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt and GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt, the sample identifiers aren't exactly the same, so we'll have to do extra work to merge them.

One package that doesn't come with the Anaconda Python distribution is seaborn, so to install it, run this command.

IPython Tips:

  • When it says "run this command," it means to press "Shift" and "Enter" together. If there's a cell with some code in it, that implies to run the cell and look at the output.
  • If you want help on any Python function or object, you can type "list??" and it will pop up help at the bottom of the screen.
  • IF you want help on any Python function, like pd.read_table(), i.e. the ones that have parentheses, you can get it by moving your cursor in between the parentheses, and pressing "Shift" and "Tab". This will pop up a help window next to the parentheses. Press "Tab" once more, and the window will get bigger. Press "Tab" a third time, and it will pop up a big help screen at the bottom.
In [ ]:
! pip install seaborn

Let's import everything we need for this session.

In [ ]:
# Pandas or "Panel Data Analysis" toolkit for Data Frames/Data Tables
import pandas as pd

# Numpy or "Numerical Python"
import numpy as np

# Powerful R-style/statistical plotting
import seaborn as sns

# These styles are my personal preferences
# For more options, see this page: http://web.stanford.edu/~mwaskom/software/seaborn/tutorial/aesthetics.html
sns.set(style='whitegrid', context='notebook')

# Show the figures directly in the IPython notebok
%matplotlib inline

Initial inspection of files with Unix commands

Note: all these commands are meant to be run *within* the notebook, not on the terminal

The unix command cd moves where the notebook looks for data, and ls will also list the files. These are some key Unix commands that can be used without the exclamation point "!", as we will use later.

Change directories to where you downloaded the example data. For me, that's the directory "~/Downloads/biom262_2015_01_29_cleaning_data". Remember that the character "~" (pronounced "tilde") indicates your home directory, which for me is /Users/olga, but I didn't feel like typing that out, so I used the tilde instead.

In [ ]:
cd ~/Downloads/biom262_2015_01_29_cleaning_data

And make sure you have all the right files around

In [ ]:
ls

Let's start with the SubjectPhenotypes file, GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt. The first step is to look at the first 10 lines of the file with the Unix command head. In the IPython notebook, you can call unix/bash commands by starting the line with an exclamation point, !.

In [ ]:
! head GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt

So there's some generic sample, a gender of 1 or 2, a range of years, and some kind of DTTHDY thing. I don't know what that means, but we'll deal with that once we open the file in pandas. It looks like it ranges from 0 to 4, but I don't see any entries of 1. By default, head outputs the first 10 lines. We can modify the number of lines with the flag -n and then provide a number. For example, let's look at the first 23 lines of the file instead.

In [ ]:
! head -n 23 GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt

Still no entries with a DTHHRDY of 1. Maybe we need to look at the end of the file. We can do that with the command tail.

In [ ]:
! tail GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt

Ooh, our first 1 in DTHHRDY! Hmm, some of the rows don't have a value for the DTHHRDY column! This will come into play in the future.

Exercise: Look at the last 17 lines of the file.

Let's stop for a brief exercise. How would you look at the last 17 lines of the file?

In [ ]:
# Exercise: modify the `tail` command below to look at the last 17 lines of the file.

! tail GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt

Another question we may have about this file is how many lines are in it. We can do this with the Unix command wc, or "word, line, character and byte counter." Specifically wc -l will count the number of lines in the file.

In [ ]:
! wc -l GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt

Exercise: How do you count the number of columns in a file?

In [ ]:
# Exercise: do a web search for "unix count number of columns" and 
# check the command on the file, GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt

! # Column-counting code goes here

Reading tabular data with the pandas library in Python

We will make heavy use of the pandas library, which is a godsend to Pythonista Data Scientists. It makes working with weirdly formatted data much easier, as you will soon see.

In [ ]:
subject_phenotypes = pd.read_table('GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DS.txt')

This has created a pandas DataFrame variable called subject_phenotypes. Python's standards are to name variables lowercase_with_underscores, so we'll stick to that :)

Let's look at the top of subject_phenotypes, again with a command called head, but we call it a little differently now that we're in Python and not Unix (notice no "!" at the beginnings of the lines anymore)

In [ ]:
subject_phenotypes.head()

The pandas head shows the first 5 rows, instead the first 10 rows like Unix head.

We can also access individual columns in two ways. The first way is by using square brackets around the string of the column name, like this for SUBJID:

In [ ]:
# Run this cell. Try entering 'subjid' and 'subject id' as well.
subject_phenotypes['SUBJID']

If a column name consists of only letters, numbers, and underscores, and starts with a letter, you can also access it with the column name, no quotes, after the dataframe name and a dot.

In [ ]:
subject_phenotypes.SUBJID

You may hear the word "series" get tossed around. A "series" is the pandas-specific technical name for a column of a dataframe.

Exercise: How would you look at the last 8 rows of the column AGE?

Hint: head and tail work for series as well as whole dataframes.

In [ ]:
# Exercise: Show the last 8 rows.

# Code for looking at the last 8 rows goes here.

Converting data types from one to another

Let's get to these GENDER and DTHHRDY columns. Open up the file, GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DD.xlsx in Excel, and see what a gender of 1 and 2 means.

In [ ]:
# A python "dictionary", or mapping from one thing to the next.
# In this case, we're mapping the numbers 1 and 2 to strings
# indicating the gender.
gender = {1: 'fillmein',
          2: 'fillmein'}

You can access items of a dictionary with square brackets, much like the columns of a dataframe.

In [ ]:
# Run this cell. How do you access the other gender?
gender[2]

Now, we can create a new column called 'gender' (all lowercase, because I feel like ALL CAPS COLUMNS ARE YELLING AT ME), using the gender dictionary.

In pandas, you can create a new column by pretending to access an existing column in the dataframe, and assigning it to some value. Here's an example of creating a new column called "don't worry" with the value "be yonce" in every cell.

In [ ]:
subject_phenotypes["don't worry"] = "be yonce"

Look at the top of the dataframe to see what that did.

In [199]:
# Code to look at the top of the dataframe goes here 
# hint: remember the command "head"? How did we use it to look at the dataframe when we first loaded it?

Exercise: Add a column with a name and value of your choice

In [ ]:
# Code to add a column with programmer's choice of name and value goes here

Another convenient operation is map on a series, which performs the operation specified on every element of a column. It's as if you wrote a for-loop to access every item of the GENDER column, and use that item to access the gender dictionary, and replace the value.

In [ ]:
subject_phenotypes.GENDER.map(gender)

Using map, it's as if we wrote the following for-loop, but map is less code, and more concise.

In [ ]:
for g in subject_phenotypes.GENDER:
    print gender[g]

Has this changed the dataframe subject_phenotypes?

In [ ]:
# Code to check if the dataframe subject_phenotypes has changed goes here
# You can check if it has changed by looking at it using your favorite body endpoint

The dataframe shouldn't have changed.

Exercise: What happens when you map the dictionary gender onto the column DTHHRDY?

In [197]:
# Code to `map` `gender` onto DTHHRDY goes here

Exercise: Combine bracket-based column creation and map on a series

Your next exercise is to combine these previous two concepts of creating a column and using map, to create a column called "gender" (all lowercase) which is the result of using map with the dictionary gender on the GENDER column of subject_phenotypes.

In [ ]:
# Code to create a new column called "gender" in `subject_phenotypes` that is the result of using `map` 
# with the `gender` dictionary on the "GENDER" column of `subject_phenotypes`.


# Code to check if the dataframe has changed goes here

Great! Now we have a column called gender, that makes sense to a human without having to look something up in some other table.

Exercise: Convert the DTHHRDY column values into human-readable values

Using the GTEx_Data_2014-01-17_Annotations_SubjectPhenotypes_DD.xlsx spreadsheet again, find out what the different numbers in DTHHRDY mean, make a dictionary mapping numbers to words like we did with gender, and create a new column with a human-understandable name.

  • Test for human-understandable: you could show the column name to someone who doesn't know the data, and they understand it without you doing any extra explaining
In [ ]:
# Code to create a new column in `subject_phenotypes` that is a human-readable version of the column `DTHHRDY` goes here

Now that we have our cleaned-up dataframe, let's do some plotting!

Let's use seaborn (which we imported as the variable sns for brevity) to plot this. We will use the function factorplot which has a bunch of options, but for now we'll just focus on two.

The first argument is the name of the column you want to plot, and then we provide the keyword argument data=subject_phenotypes, to specify the dataframe we want to get this column from.

In [ ]:
sns.factorplot('gender', data=subject_phenotypes)

Nice, so we can now see the distribution of the number of subjects of these two genders.

What if we also want to see how many people of the two genders, have different DTHHRDY categorizations?

Exercise: Plot the distribution of both gender and DTHHRDY

Change human_variable_DTHHRDY to the new column you created.

In [ ]:
# Edit the argument `hue=...`
sns.factorplot('gender', data=subject_phenotypes, hue=human_readable_DTHHRDY)

You can break this down even more by also plotting the age of the subjects, and showing a separate plot, as below. The argument col='AGE' means to plot each age group onto a separate column of plots.

In [ ]:
sns.factorplot('gender', data=subject_phenotypes, hue='DTHHRDY', col='AGE')

Exercise: Plot Age as the main x variable, and 'gender' as each column

In [ ]:
# Code goes below

Combining information across dataframes

So far, we've been working with one pandas dataframe, and an external *.xlsx file. Now we're going to work on combining sample_phenotypes with a new dataframe, which we will call sample_attributes.

Exercise: Inspect the new data table with Unix, and read it in using Python

In [ ]:
# Code to look at the top of GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt goes here
In [ ]:
# Code to count the number of lines in GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt goes here
In [ ]:
# Code to count the number of columns in GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt goes here
In [196]:
# Code to read the table GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt goes here
pd.read_table('GTEx_Data_2014-01-17_Annotations_SampleAttributesDS.txt')

# Code to look at the top of the `DataFrame` you just created
Out[196]:
SAMPID SMATSSCR SMCENTER SMPTHNTS SMRIN SMTS SMTSD SMTSISCH SMNABTCH SMNABTCHT ... SME1ANTI SMSPLTRD SMBSMMRT SME1SNSE SME1PCTS SMRRNART SME1MPRT SMNUM5CD SMDPMPRT SME2PCTS
0 GTEX-N7MS-0007-SM-26GME NaN C1 NaN 8.2 Blood Whole Blood 16-19 hours BP-16653 RNA isolation_PAXgene Blood RNA (Manual) ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 GTEX-N7MS-0007-SM-26GMV NaN C1 NaN 8.2 Blood Whole Blood 16-19 hours BP-16653 RNA isolation_PAXgene Blood RNA (Manual) ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 GTEX-N7MS-0007-SM-2D43E NaN C1 NaN 8.2 Blood Whole Blood 16-19 hours BP-16653 RNA isolation_PAXgene Blood RNA (Manual) ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 GTEX-N7MS-0007-SM-2D7W1 NaN C1 NaN 8.2 Blood Whole Blood 16-19 hours BP-16653 RNA isolation_PAXgene Blood RNA (Manual) ... 13705136 18432744 0.002456 13447728 49.526005 0.041526 0.835199 840 0.563503 51.361324
4 GTEX-N7MS-0008-SM-4E3JI NaN C1 NaN 10.0 Skin Cells - Transformed fibroblasts NaN BP-37581 RNA isolation_Trizol Manual (Cell Pellet) ... 17962165 20910366 0.004087 18012435 50.069874 0.028395 0.948329 879 0.226835 50.270794
5 GTEX-N7MS-0009-SM-2BWY4 NaN C1 NaN NaN Blood Whole Blood 16-19 hours BP-16657 DNA isolation_Whole Blood _QIAGEN Puregene (Ma... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 GTEX-N7MS-0009-SM-2XK1D NaN C1 NaN NaN Blood Whole Blood 16-19 hours BP-16657 DNA isolation_Whole Blood _QIAGEN Puregene (Ma... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 GTEX-N7MS-0011-R10A-SM-2HMJK NaN C1, A1 NaN 7.1 Brain Brain - Frontal Cortex (BA9) NaN BP-19253 RNA isolation_QIAGEN miRNeasy ... 18948398 12221905 0.004294 18747238 49.733180 0.051237 0.875680 859 0.330709 50.619534
8 GTEX-N7MS-0011-R10A-SM-2IZJW NaN C1, A1 NaN 7.1 Brain Brain - Frontal Cortex (BA9) NaN BP-19253 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 GTEX-N7MS-0011-R11A-SM-2HMJS NaN C1, A1 NaN 6.6 Brain Brain - Cerebellar Hemisphere NaN BP-19253 RNA isolation_QIAGEN miRNeasy ... 19024292 12200496 0.003643 18954711 49.908398 0.016711 0.893391 851 0.193112 50.387028
10 GTEX-N7MS-0011-R11A-SM-2IZJZ NaN C1, A1 NaN 6.6 Brain Brain - Cerebellar Hemisphere NaN BP-19253 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 GTEX-N7MS-0011-R1a-SM-2AXVJ NaN C1, A1 NaN 7.3 Brain Brain - Hippocampus NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 GTEX-N7MS-0011-R1a-SM-2HMJG NaN C1, A1 NaN 7.3 Brain Brain - Hippocampus NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... 15226514 8806379 0.004332 15122489 49.828620 0.041028 0.790736 835 0.324148 50.618090
13 GTEX-N7MS-0011-R2a-SM-2HML6 NaN C1, A1 NaN 7.0 Brain Brain - Substantia nigra NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... 11678187 8637924 0.004073 11575007 49.778137 0.028304 0.628574 837 0.275110 50.561234
14 GTEX-N7MS-0011-R2a-SM-2IZK7 NaN C1, A1 NaN 7.0 Brain Brain - Substantia nigra NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 GTEX-N7MS-0011-R3a-SM-2AXVU NaN C1, A1 NaN 7.6 Brain Brain - Anterior cingulate cortex (BA24) NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 GTEX-N7MS-0011-R3a-SM-2HMKD NaN C1, A1 NaN 7.6 Brain Brain - Anterior cingulate cortex (BA24) NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... 8114816 4205357 0.004996 8085575 49.909750 0.004037 0.151460 806 0.364988 50.379780
17 GTEX-N7MS-0011-R3a-SM-33HC6 NaN C1, A1 NaN 7.6 Brain Brain - Anterior cingulate cortex (BA24) NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... 36663415 22956649 0.003256 36619121 49.969780 0.033465 0.916319 875 0.341985 50.874683
18 GTEX-N7MS-0011-R4a-SM-2AXW2 NaN C1, A1 NaN 6.2 Brain Brain - Amygdala NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 GTEX-N7MS-0011-R4a-SM-2HMKW NaN C1, A1 NaN 6.2 Brain Brain - Amygdala NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... 11651563 8137082 0.004180 11545498 49.771380 0.035598 0.657749 811 0.334207 50.741714
20 GTEX-N7MS-0011-R5a-SM-2AXW7 NaN C1, A1 NaN 7.8 Brain Brain - Caudate (basal ganglia) NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 GTEX-N7MS-0011-R5a-SM-2HMK8 NaN C1, A1 NaN 7.8 Brain Brain - Caudate (basal ganglia) NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... 21641531 13525503 0.004141 21325223 49.631920 0.050445 0.889377 891 0.312833 50.695858
22 GTEX-N7MS-0011-R6a-SM-2AXWD NaN C1, A1 NaN 7.6 Brain Brain - Nucleus accumbens (basal ganglia) NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
23 GTEX-N7MS-0011-R6a-SM-2HMJ4 NaN C1, A1 NaN 7.6 Brain Brain - Nucleus accumbens (basal ganglia) NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... 17668410 12313288 0.003942 17447282 49.685143 0.040960 0.883223 854 0.262580 50.615242
24 GTEX-N7MS-0011-R7a-SM-2AXV5 NaN C1, A1 NaN 6.4 Brain Brain - Putamen (basal ganglia) NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 GTEX-N7MS-0011-R7a-SM-2HMKN NaN C1, A1 NaN 6.4 Brain Brain - Putamen (basal ganglia) NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... 10366169 5352672 0.004739 9997472 49.094720 0.058973 0.664810 791 0.400345 51.510887
26 GTEX-N7MS-0011-R8a-SM-2AXVD NaN C1, A1 NaN 6.8 Brain Brain - Hypothalamus NaN BP-17395 RNA isolation_QIAGEN miRNeasy ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 GTEX-N7MS-0011-R8a-SM-2YUMK NaN C1, A1 NaN 6.8 Brain Brain - Hypothalamus NaN BP-17395 RNA isolation_Trizol Manual (Cell Pellet) ... 12317489 9750793 0.002769 12167817 49.694363 0.054535 0.817677 820 0.366260 50.655490
28 GTEX-N7MS-0126-SM-3TW8O 3 C1 OK 9.1 Testis Testis 16-19 hours BP-16740 RNA isolation_PAXgene Tissue miRNA ... 19539848 15703873 0.002877 19525988 49.982260 0.051392 0.934544 937 0.185286 50.262527
29 GTEX-N7MS-0225-SM-4E3HO 1 C1 OK for analysis 7.7 Skin Skin - Sun Exposed (Lower leg) 16-19 hours BP-36182 RNA isolation_PAXgene Tissue miRNA ... 17258030 13502100 0.004377 17381617 50.178387 0.010061 0.937283 862 0.223826 50.288605
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4471 K-562-SM-3GADY NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 20293042 21127232 0.003853 20085667 49.743217 0.010052 0.912978 877 0.323958 50.618893
4472 K-562-SM-3GAFC NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 19387186 19565057 0.003505 19503152 50.149097 0.011422 0.890707 854 0.420687 50.387650
4473 K-562-SM-3GIKB NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 12498139 12769804 0.003243 12654894 50.311604 0.010686 0.852001 849 0.388517 50.260590
4474 K-562-SM-3GILO NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 18089388 18151281 0.004954 17946903 49.802307 0.010809 0.865095 876 0.278859 50.707745
4475 K-562-SM-3K2BF NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 10709316 11153056 0.005191 10650786 49.862990 0.017607 0.801476 841 0.225620 50.603752
4476 K-562-SM-3LK7S NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 19775330 21833559 0.002277 19642715 49.831787 0.018965 0.922582 882 0.312202 50.353050
4477 K-562-SM-3MJHH NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 22850569 25074169 0.002691 22654052 49.784070 0.014306 0.938719 887 0.299666 50.487100
4478 K-562-SM-3NB3I NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 21923991 23322261 0.003130 22010308 50.098236 0.022786 0.928603 872 0.299371 50.367115
4479 K-562-SM-3NMAP NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 22210968 24363787 0.003103 22339747 50.144530 0.018574 0.936381 890 0.284208 50.140530
4480 K-562-SM-3NMDG NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 20447451 22524234 0.002659 20583281 50.165520 0.017522 0.950120 873 0.267642 50.144337
4481 K-562-SM-3P61Y NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 23221097 25541338 0.002788 23358263 50.147243 0.022186 0.946919 894 0.257213 50.123116
4482 K-562-SM-46MWI NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 12960037 14028074 0.002139 13042438 50.158447 0.024899 0.956672 873 0.188689 50.133953
4483 K-562-SM-47JYY NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 15251988 17187136 0.002230 15395095 50.233475 0.022935 0.957918 862 0.218666 49.976242
4484 K-562-SM-48FEU NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 11679926 13207067 0.002256 11762756 50.176662 0.028136 0.953367 832 0.393895 50.098743
4485 K-562-SM-48TE3 NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 14556993 16245672 0.001645 14501457 49.904438 0.013772 0.964781 873 0.197786 50.232384
4486 K-562-SM-4AD4F NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 14758394 16187369 0.002012 14775511 50.028980 0.013740 0.962448 863 0.197407 50.208282
4487 K-562-SM-4AT3W NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 10916665 12078403 0.005620 10964475 50.109250 0.013850 0.950125 828 0.172564 50.208042
4488 K-562-SM-4B66B NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 22550061 24921589 0.002292 22611795 50.068350 0.022443 0.956932 892 0.230228 50.232353
4489 K-562-SM-4BONS NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 15745843 17474655 0.003985 15727040 49.970127 0.020865 0.956510 876 0.215475 50.267246
4490 K-562-SM-4BRWK NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 12423333 13669729 0.002084 12383723 49.920166 0.011200 0.958494 861 0.184230 50.298664
4491 K-562-SM-4DM4W NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 11350047 12454038 0.005404 11400063 50.109924 0.029977 0.945595 828 0.181774 50.162186
4492 K-562-SM-4EDPU NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 19954703 21371627 0.003999 20055440 50.125885 0.042664 0.944981 890 0.243209 50.225540
4493 K-562-SM-4GICD NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 21863245 24097377 0.003084 21979888 50.133026 0.012700 0.929063 893 0.242783 50.196144
4494 K-562-SM-4IHK7 NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 14231410 15796005 0.004426 14188282 49.924120 0.022133 0.949039 870 0.214134 50.295345
4495 K-562-SM-4JBIQ NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 18279772 20403076 0.004726 18332662 50.072230 0.014611 0.954230 871 0.226838 50.212060
4496 K-562-SM-4KKZ9 NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 17621411 19649733 0.003199 17603160 49.974094 0.017336 0.951078 855 0.208081 50.299175
4497 K-562-SM-4LMI2 NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 16494770 18425439 0.003787 16498971 50.006367 0.011050 0.952641 866 0.197396 50.248615
4498 K-562-SM-4LVKX NaN NaN NaN 9.5 Bone Marrow Cells - Leukemia cell line (CML) NaN BP-17177 RNA isolation_Trizol Manual (Cell Pellet) ... 13735552 15244841 0.002508 13719276 49.970356 0.011353 0.952748 870 0.185469 50.244150
4499 NA12878-SM-2XJZN NaN NaN NaN NaN NaN NaN NaN NaN Cell Line DNA (Derived from Blood Cells) ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4500 NA12878_C-SM-2VCTR NaN NaN NaN NaN NaN NaN NaN NaN Cell Line DNA (Derived from Blood Cells) ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

4501 rows × 59 columns

Ugh, this has the UGLIEST column names. What does SMTSISCH really mean, anyway? Plus, there's 59 columns and I don't want to have to go through and copy/paste something 59 different times.

Turns out the excel file GTEx_Data_2014-01-17_Annotations_SampleAttributesDD.xlsx has the mapping between these weird names and human-readable concepts. Open it Excel.

Since there's so many things to rename, we'll do it programmatically rather than copy/pasting by hand.

Exercise: Read the excel file GTEx_Data_2014-01-17_Annotations_SampleAttributesDD.xlsx using pandas, and set the first column as the "index" or row names

Hint 1: index_col is the argument you want for setting the column number that should be the index

Hint 2: In computer science, we count from 0, so the third column is indicated by the number 2

Hint 3: "how to open excel in pandas" is a great search term :)

In [195]:
# Code to read the excel file GTEx_Data_2014-01-17_Annotations_SampleAttributesDD.xlsx, and set the first column as the "index" goes here
sample_attributes_dd = pd.read_excel('GTEx_Data_2014-01-17_Annotations_SampleAttributesDD.xlsx', index_col=0)

# Code to look at the top of the file goes here
sample_attributes_dd.head()
Out[195]:
VARDESC DOCFILE TYPE UNITS COMMENT1 COMMENT2 VALUES Unnamed: 8 Unnamed: 9 Unnamed: 10 ... Unnamed: 47 Unnamed: 48 Unnamed: 49 Unnamed: 50 Unnamed: 51 Unnamed: 52 Unnamed: 53 Unnamed: 54 Unnamed: 55 Unnamed: 56
VARNAME
SAMPID Sample ID, GTEx Public Sample ID NaN string NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
SMATSSCR Autolysis Score PRC Case Summary Report integer, encoded value NaN Autolysis The destruction of organism cells or tissues b... 0=None 1=Mild 2=Moderate 3=Severe ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
SMNABTCH Nucleic Acid Isolation Batch ID LDACC string NaN Generated at LDACC Batch when DNA/RNA was isolated and extracted ... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
SMNABTCHT Type of nucleic acid isolation batch LDACC string NaN Generated at LDACC The process by which DNA/RNA was isolated NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
SMNABTCHD Date of nucleic acid isolation batch LDACC string NaN Generated at LDACC The date on which DNA/RNA was isolated NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 56 columns

Which column do we want to use to rename the weird column names in the other dataframe? Let's print it.

Exercise: Print the descriptive column in sample_attributes_dd, that describes the column names in sample_attribute

In [ ]:
# Code for showing the descriptive column in the dataframe `sample_attributes_dd`

Since the index of sample_attributes_dd is the column names of sample_attributes, any column from sample_attributes_dd is a mapping from column names to some other values, depending on what column you use.

What's really nice about this is that we can then use one of these columns in sample_attributes_dd to rename the column names in sample_attributes. Here's an example of renaming the columns of sample_attributes using the column 'TYPE' in sample_attributes_dd.

In [ ]:
sample_attributes.rename(columns=sample_attributes_dd['TYPE'])

Exercise: Rename the columns of sample_attributes using the variable description column from sample_attribute_dd.

In [ ]:
# Code to rename the columns of sample_attributes using a column from sample_attributes_dd
sample_attributes = sample_attributes.rename(columns=sample_attributes_dd.VARDESC)

# Code for looking at the top of the new sample_attributes goes here
sample_attributes
In [ ]:
sample_attributes.ix[:5, :20]

Excellent! Now we have a dataframe with human-readable column names.

Remember that first dataframe that we created, subject_phenotypes? We want to unify that first dataframe with this new one. Let's take a gander at it to remember what's in it in the first place.

In [ ]:
# Code for looking at the top of `subject_phenotypes` goes here

Exercise: Which columns of subject_phenotypes and sample_attributes look like they could be matched up?

Note: they don't have to be exactly the same values, because we can modify them, but look for commonalities.

In [ ]:
# Exercise: Code to show a column of `subject_phenotypes` goes here
In [ ]:
# Exercise: Code to show a column of `sample_attributes` goes here

Detour: working with strings in Python

Before we can make a new column in sample_attributes with the corresponding sample id of sample_phenotypes, we need to go over some string manipulation techniques.

For example, we can take a string and split it. By default, they will be split on the whitespace (like spaces and new lines)

In [ ]:
s = 'you have as many hours in a day as beyonce'
s.split()

If you don't want to split on whitespace, you can specify a specific letter or character to split on, too.

In [ ]:
s.split('d')

Exercise: What happened to the letter "d"? Write a complete sentence below.

Complete sentence = it has a subject, object and a verb.

Exercise: Split the string s above on the letter "a"

In [ ]:
# Code to split `s` goes here

The result of s.split() is a list, which is a special name in Python. Look, it even comes in a special color, different from black, so you can see how special it is:

In [ ]:
list

We can access elements from the split using a number in square brackets.

In [ ]:
s.split()[4]

Exercise: Get the element 'day' from s.split()

In [ ]:
# Code to get "day" from s.split() goes here

What if we want to access mutiple items at once? We can use the colon ":" to indicate we want everything up to (but not including) the Nth item. For example, if we want the first 5 words, we can do:

In [ ]:
s.split()[:5]

Exercise: Split the string s on "a", and get the first 3 elements

In [ ]:
# Code to split `s` on "a", and get the first 3 elements

Now we're able to split a string, but what if we want to put it back together? We can join the results.

In [ ]:
' '.join(s.split()[:5])

Notice that we used a space to join the words. We could have used any character to join them (as well as any character to split):

In [ ]:
'!'.join(s.split('e')[:4])

Exercise: Split the string s on 3 different characters, and join on 3 different characters.

In [ ]:
# Code for split 1, join 1 goes here
In [ ]:
# Code for split 2, join 2 goes here
In [ ]:
# Code for split 3, join 3 goes here

Back to dataframes!

We can use what we just learned about manipulating strings, on columns of dataframe using lambda, which allows us to create small functions. For example, if we wanted to take column 'SUBJID' from the dataframe subject_phenotypes, split every item on the dash character '-', and get the first item, we would do this:

In [ ]:
subject_phenotypes['SUBJID'].map(lambda x: x.split('-')[0])

Exercise: Split the column "Type of nucleic acid isolation batch" in sample_attributes on whitespace, and get the 3rd element.

In [ ]:
# Code to split the column 'Type of nucleic acid isolation batch' on whitespace and get the third element goes here

The data may not have a string in every element of a column. For example, this code produces an AttributeError.

In [ ]:
sample_attributes['Tissue Type, more specific detail of tissue type'].map(lambda x: x.split('-')[0])

The above code produces the error,

<ipython-input-157-6714d86271d7> in <lambda>(x)
----> 1 sample_attributes['Tissue Type, more specific detail of tissue type'].map(lambda x: x.split('-')[0])

AttributeError: 'float' object has no attribute 'split'

Which happens because instead of a nice string, there is a float there, and floats don't know how to be split. Why is that? Well, NAs are of type float, so this indicates that there's an NA there.

To deal with this, we can add an if statement to our lambda to make it deal with these situations. We will use the function isinstance to check if x is a string (the special word for a string in Python is str), and replace it with an NA using the numpy library (which we imported as np for shorthand) np.nan.

In [ ]:
sample_attributes['Tissue Type, more specific detail of tissue type'].map(lambda x: x.split('-')[0] if isinstance(x, str) else np.nan)

Exercise: Split the column 'Code for BSS collection site' on a comma "," and get the first two items of the split, accounting for NAs

In [ ]:
# Code goes here

In addition to split-ing elements, we can join within the map/lambda combo too!

In [ ]:
sample_attributes['Tissue Type, more specific detail of tissue type'].map(lambda x: '_'.join(x.split()[:3]) 
                                                                          if isinstance(x, str) else np.nan)

Exercise: Split items in the column "Type of nucleic acid isolation batch" in sample_attributes on underscores, and join the first two elements using a space

In [ ]:
# Code goes here

Now we have all the tools to add a "subject_id" column to sample_attributes! Remember, we want to create a column which has exactly the same entries as the column "SUBJID" in subject_phenotypes. What was that again? It's been so long that I forgot what those IDs look like. To remind yourself what the subject IDs in subject_phenotypes and sample ids in sample_attributes look like, take a look at the top of each of those dataframes.

In [ ]:
# Code to look at the top of `subject_phenotypes`
In [ ]:
# Code to look at the top of `sample_attributes`

Exercise: Add a column to sample_attributes called "subject_id", using one of its existing columns, that matches the "SUBJID" in subject_phenotypes exactly

In [ ]:
# Code goes here

Merging dataframes

Excellent! Now we have a column which exactly matches the rows of subject_phenotypes to the rows of sample_attributes. Now we want to merge these two dataframes together. How do we do that? We will use the function merge, which is a function of the dataframe. Merge is a little complicated, so let's break it down with a few examples.

First, we'll create a couple example dataframes.

In [ ]:
dataframe1 = pd.DataFrame([['cucumber', 'watery'], ['broccoli', 'crunchy'], ['kale', 'chewy'], 
                           ['mango', 'sweet'] ], columns=['vegetable', 'description'])
dataframe1
In [ ]:
dataframe2 = pd.DataFrame([['broccoli', 'harvested', 8], ['broccoli', 'planted', 5],
                           ['kale', 'planted', 6], ['kale', 'harvested', 9],
                           ['cucumber', 'harvested', 7], ['cucumber', 'planted', 4],
                           ['strawberry', 'planted', 10], ['strawberry', 'harvested', 2]], 
                          columns=['crop', 'action', 'number'])
dataframe2

We want to merge these two dataframes on their common column, which is "vegetable" in dataframe1 and crop in dataframe2. We can do this using merge, and specifying what we want to merge the left and right dataframes on.

In [ ]:
dataframe1.merge(dataframe2, left_on='vegetable', right_on='crop')

Exercise: What happened to the row "cucumber"? What about to "mango" and "strawberry"? Why?

Use "Shift"+"Tab" to read the documentation behind merge. What's the default way that two dataframes are merged?

Exercise: Merge dataframe1 and dataframe2, but use dataframe2 as the "left" dataframe, and merge using "outer"

In [ ]:
# Code goes here

Now we're ready to try this with real data! Let's go back to our sample_attributes and subject_phenotypes dataframes. To recap, We added a column to sample_attributes to match up with subject_phenotypes. Now, merge the two dataframes together using their columns with common values.

Exercise: Merge sample_attributes and subject_phenotypes on their common column.

In [ ]:
# Code goes here

Congratulations! You have now performed a DATABASE MERGE!!! Now you can't be afraid of databases! Mwahahahah!

All a "database" really is, is a bunch of tables linked together by certain 'keys', aka the values in the columns. What you've done today is manipulate some tables (aka databases), changing column names and adding columns so they're mergable, and merging them together.

Concepts from today (there's a lot!):

  • Unix
    • Using head and tail to look at the beginings and ends of files
      • Using -n N to modify the number of lines output by head and tail
    • Using wc -l to count the number of lines in a file
    • Searching the web and finding millions of results for a seemingly simply Unix question
    • Finding a method to count the number of columns in a file
  • Python
    • Pandas
      • Reading a tabular file using pandas, specifically pd.read_table
      • Using .head() and .tail() to look at the tops and bottoms of dataframes
        • Using .head(N) and .tail(N) to look at the top and bottom N rows
      • Accessing columns in pandas DataFrames
      • Python dictionaries as a way of mapping one item to another
      • Using map instead of for-loops to operate on every item of a column
      • Creating new columns in pandas DataFrames
        • Creating new columns as a result of operating on other columns
      • Searching the web for help with pandas
      • Reading an Excel file using pandas
      • Setting one of the columns as the index (aka row names) when you read in the file
      • A column in a DataFrame can be used as a mapping from the row name to the item in the column
      • Renaming column names in one table based on a mapping
    • String operations
      • A string is anything between quotes
      • Strings can be split on any characters
      • The result of a split is a list
      • Lists (and everything else in Python) start counting from 0 (aka "0-based")
      • Get individual elements of a list using square brackets and a number, e.g. [3] shows the 4th element
      • Access the first N elements of a list using square brackets, a colon, and the number, e.g. [:5] shows up to, but not including, the 6th element
      • Strings can be glued together using join
        • The join can be on any character
    • Pandas
      • Use lambda to create an "anonymous" function to use within map
        • Use lambda to split and join strings within a column
      • NAs are of type float
      • To check if a thing is of a certain type, use isinstance
      • A lambda can contain an if statement for alternative outputs
        • But it must also contain an else statement as well
      • Create a new column by combining map and lambda to do a complicated operation on each item of the column
      • Two dataframes can be merged together if they have columns with the same elements
        • For a merge, need to specify the columns to merge on in both dataframes
      • Shift-tab to read documentation for a function
      • Reading documentation is fun!
  • Databases are just tables!