You've run STAR to align and create a Sequence Alignment Map (SAM) file, samtools
to sort and index the SAM file and turn it into a binary SAM file (BAM), featureCounts
to count the number of reads of features, and kallisto
to perform all of the above using quasi-alignment and quantification.
The purpose of this homework is to compare the gene expression quantification strategies (we'll call them "Align-Count" and "Quasialign-Count")
We'll be using five additional libraries in Python:
numpy
- (pronounced "num-pie") which is basis for most scientific packages. It's basically a nice-looking Python interface to C code. It's very fast.pandas
- This is the "DataFrames in Python." (like R's nice dataframes) They're a super convenient form that's based on numpy
so they're fast. And you can do convenient things like calculate mea n and variance very easily.matplotlib
- This is the base plotting library in Python.scipy
- (pronounced "sigh-pie") Containsseaborn
- Statistical plotting library. To be completely honest, R's plotting and graphics capabilities are much better than Python's. However, Python is a really nice langauge to learn and use, it's very memory efficient, can be parallized well, and has a very robust machine learning library, scikit-learn
, which has a very nice and consistent interface. So this is Python's answer to ggplot2
(very popular R library for plotting) to try and make plotting in Python nicer looking and to make statistical plots easier to do.# We're doing "import superlongname as abbrev" for our laziness - this way we don't have to type out the whole thing each time.
# Numerical python library (pronounced "num-pie")
import numpy as np
# Dataframes in Python
import pandas as pd
# Python plotting library
import matplotlib.pyplot as plt
# Statistical plotting library we'll use
import seaborn as sns
sns.set(style='ticks', context='notebook')
# This is necessary to show the plotted figures inside the notebook -- "inline" with the notebook cells
%matplotlib inline
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))
Change directories to where we have our processed data
cd ~/projects/shalek2013/processed_data
[Errno 2] No such file or directory: '/Users/olga/projects/shalek2013/processed_data' /Users/olga/workspace-git/biom262-2016-homework/source/week04
Read the S10 sample kallisto file. The index_col="target_id"
tells us that column named "target_id" should be used as the "index" aka the row names. The head()
command helps us look at the top of the dataframe and shows the first 5 rows by default.
s10_kallisto = pd.read_table('S10_kallisto/abundance.tsv', index_col='target_id')
s10_kallisto.head()
length | eff_length | est_counts | tpm | |
---|---|---|---|---|
target_id | ||||
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634| | 3634 | 3389.58 | 0 | 0 |
ENSMUST00000194992.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127194.1|Rp1-002|Rp1|3047|UTR5:1-54|CDS:55-912|UTR3:913-3047| | 3047 | 2802.58 | 0 | 0 |
ENSMUST00000027032.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127195.2|Rp1-001|Rp1|6869|UTR5:1-127|CDS:128-6415|UTR3:6416-6869| | 6869 | 6624.58 | 0 | 0 |
ENSMUST00000027035.9|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127245.2|Sox17-001|Sox17|3127|UTR5:1-1082|CDS:1083-2342|UTR3:2343-3127| | 3127 | 2882.58 | 0 | 0 |
ENSMUST00000195555.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127249.1|Sox17-005|Sox17|1977|UTR5:1-635|CDS:636-1511|UTR3:1512-1977| | 1977 | 1732.58 | 0 | 0 |
You can also use "tail
" to show the last few rows: (also 5 by default)
s10_kallisto.tail()
length | eff_length | est_counts | tpm | |
---|---|---|---|---|
target_id | ||||
ENSMUST00000084013.1|ENSMUSG00000065947.3|-|-|mt-Nd4l-201|mt-Nd4l|297|CDS:1-297| | 297 | 70.8467 | 2438 | 1204.930 |
ENSMUST00000082414.1|ENSMUSG00000064363.1|-|-|mt-Nd4-201|mt-Nd4|1378|CDS:1-1378| | 1378 | 1133.5800 | 43114 | 1331.720 |
ENSMUST00000082418.1|ENSMUSG00000064367.1|-|-|mt-Nd5-201|mt-Nd5|1824|CDS:1-1824| | 1824 | 1579.5800 | 44220 | 980.222 |
ENSMUST00000082419.1|ENSMUSG00000064368.1|-|-|mt-Nd6-201|mt-Nd6|519|CDS:1-519| | 519 | 274.6980 | 4520 | 576.142 |
ENSMUST00000082421.1|ENSMUSG00000064370.1|-|-|mt-Cytb-201|mt-Cytb|1144|CDS:1-1144| | 1144 | 899.5790 | 120234 | 4679.880 |
.head()
¶Show the first 17 rows of s10_kallisto
.
### BEGIN SOLUTION
s10_kallisto.head(17)
### END SOLUTION
length | eff_length | est_counts | tpm | |
---|---|---|---|---|
target_id | ||||
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634| | 3634 | 3389.580 | 0.00000 | 0.000000 |
ENSMUST00000194992.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127194.1|Rp1-002|Rp1|3047|UTR5:1-54|CDS:55-912|UTR3:913-3047| | 3047 | 2802.580 | 0.00000 | 0.000000 |
ENSMUST00000027032.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127195.2|Rp1-001|Rp1|6869|UTR5:1-127|CDS:128-6415|UTR3:6416-6869| | 6869 | 6624.580 | 0.00000 | 0.000000 |
ENSMUST00000027035.9|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127245.2|Sox17-001|Sox17|3127|UTR5:1-1082|CDS:1083-2342|UTR3:2343-3127| | 3127 | 2882.580 | 0.00000 | 0.000000 |
ENSMUST00000195555.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127249.1|Sox17-005|Sox17|1977|UTR5:1-635|CDS:636-1511|UTR3:1512-1977| | 1977 | 1732.580 | 0.00000 | 0.000000 |
ENSMUST00000192650.5|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127247.2|Sox17-004|Sox17|3242|UTR5:1-1851|CDS:1852-2916|UTR3:2917-3242| | 3242 | 2997.580 | 0.00000 | 0.000000 |
ENSMUST00000116652.7|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127246.1|Sox17-002|Sox17|1512|UTR5:1-249|CDS:250-1509|UTR3:1510-1512| | 1512 | 1267.580 | 0.00000 | 0.000000 |
ENSMUST00000191647.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127267.2|Sox17-007|Sox17|406|UTR5:1-83|CDS:84-406| | 406 | 163.270 | 0.00000 | 0.000000 |
ENSMUST00000191939.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127266.2|Sox17-006|Sox17|840|UTR5:1-329|CDS:330-840| | 840 | 595.579 | 0.00000 | 0.000000 |
ENSMUST00000192913.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127248.2|Sox17-003|Sox17|1506|UTR5:1-997|CDS:998-1506| | 1506 | 1261.580 | 0.00000 | 0.000000 |
ENSMUST00000130201.7|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072660.1|Mrpl15-002|Mrpl15|1894|UTR5:1-33|CDS:34-648|UTR3:649-1894| | 1894 | 1649.580 | 10.00000 | 0.212263 |
ENSMUST00000156816.6|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072659.1|Mrpl15-001|Mrpl15|4203|UTR5:1-62|CDS:63-950|UTR3:951-4203| | 4203 | 3958.580 | 0.00000 | 0.000000 |
ENSMUST00000045689.13|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072661.1|Mrpl15-003|Mrpl15|497|UTR5:1-21|CDS:22-180|UTR3:181-497| | 497 | 252.804 | 0.00000 | 0.000000 |
ENSMUST00000146665.2|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072662.2|Mrpl15-004|Mrpl15|1569|UTR5:1-62|CDS:63-569|UTR3:570-1569| | 1569 | 1324.580 | 0.00000 | 0.000000 |
ENSMUST00000134384.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051163.2|Lypla1-002|Lypla1|1136|UTR5:1-126|CDS:127-801|UTR3:802-1136| | 1136 | 891.579 | 5.42556 | 0.213075 |
ENSMUST00000027036.10|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051162.1|Lypla1-001|Lypla1|2507|UTR5:1-91|CDS:92-784|UTR3:785-2507| | 2507 | 2262.580 | 518.90900 | 8.030360 |
ENSMUST00000150971.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051164.3|Lypla1-003|Lypla1|877|UTR5:1-84|CDS:85-750|UTR3:751-877| | 877 | 632.579 | 0.00000 | 0.000000 |
# "_" is the previous output
assert _.index.shape == (17,)
We can see the number of rows and columns using .shape
, which shows (nrows, ncols):
s10_kallisto.shape
(56504, 4)
This seems like a ton of genes, but don't worry, we'll filter on this.
Read the S13 sample kallisto file.
s13_kallisto = pd.read_table('S13_kallisto/abundance.tsv', index_col='target_id')
s13_kallisto.head()
length | eff_length | est_counts | tpm | |
---|---|---|---|---|
target_id | ||||
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634| | 3634 | 3395.24 | 2 | 0.03967 |
ENSMUST00000194992.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127194.1|Rp1-002|Rp1|3047|UTR5:1-54|CDS:55-912|UTR3:913-3047| | 3047 | 2808.24 | 0 | 0.00000 |
ENSMUST00000027032.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127195.2|Rp1-001|Rp1|6869|UTR5:1-127|CDS:128-6415|UTR3:6416-6869| | 6869 | 6630.24 | 0 | 0.00000 |
ENSMUST00000027035.9|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127245.2|Sox17-001|Sox17|3127|UTR5:1-1082|CDS:1083-2342|UTR3:2343-3127| | 3127 | 2888.24 | 0 | 0.00000 |
ENSMUST00000195555.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127249.1|Sox17-005|Sox17|1977|UTR5:1-635|CDS:636-1511|UTR3:1512-1977| | 1977 | 1738.24 | 0 | 0.00000 |
Use shape
to show how many rows and columns are in s13_kallisto
### BEGIN SOLUTION
s13_kallisto.shape
### END SOLUTION
(56504, 4)
# "_" is the previous output
assert _ == (56504, 4)
Let's plot their correlation to each other using jointplot
in seaborn:
sns.jointplot(s10_kallisto['tpm'], s13_kallisto['tpm'])
<seaborn.axisgrid.JointGrid at 0x10e528160>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
Oh right -- we have expression data and the scales are enormous... Notice the 200,000 maximum on the y-scale.
Let's add 1 to all values and take the log2 of the data. We add one because log(0) is undefined and then all our logged values start from zero too. This
"$\log_2(TPM + 1)$"
is a very common transformation of expression data so it's easier to analyze.
To do that, we'll create a new column in each of the s10_kallisto
and s13_kallisto
dataframes using the existing data. Here's an example of doing something similar but adding 1000.
s10_kallisto['log2_tpm1000'] = np.log2(s10_kallisto['tpm']+10000)
s10_kallisto.head()
length | eff_length | est_counts | tpm | log2_tpm1000 | |
---|---|---|---|---|---|
target_id | |||||
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634| | 3634 | 3389.58 | 0 | 0 | 13.287712 |
ENSMUST00000194992.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127194.1|Rp1-002|Rp1|3047|UTR5:1-54|CDS:55-912|UTR3:913-3047| | 3047 | 2802.58 | 0 | 0 | 13.287712 |
ENSMUST00000027032.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127195.2|Rp1-001|Rp1|6869|UTR5:1-127|CDS:128-6415|UTR3:6416-6869| | 6869 | 6624.58 | 0 | 0 | 13.287712 |
ENSMUST00000027035.9|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127245.2|Sox17-001|Sox17|3127|UTR5:1-1082|CDS:1083-2342|UTR3:2343-3127| | 3127 | 2882.58 | 0 | 0 | 13.287712 |
ENSMUST00000195555.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127249.1|Sox17-005|Sox17|1977|UTR5:1-635|CDS:636-1511|UTR3:1512-1977| | 1977 | 1732.58 | 0 | 0 | 13.287712 |
s13_kallisto
¶### BEGIN SOLUTION
s13_kallisto['log2_tpm'] = np.log2(s13_kallisto['tpm']+1)
s10_kallisto['log2_tpm'] = np.log2(s10_kallisto['tpm']+1)
### END SOLUTION
s13_kallisto.head()
length | eff_length | est_counts | tpm | log2_tpm | |
---|---|---|---|---|---|
target_id | |||||
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634| | 3634 | 3395.24 | 2 | 0.03967 | 0.056126 |
ENSMUST00000194992.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127194.1|Rp1-002|Rp1|3047|UTR5:1-54|CDS:55-912|UTR3:913-3047| | 3047 | 2808.24 | 0 | 0.00000 | 0.000000 |
ENSMUST00000027032.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127195.2|Rp1-001|Rp1|6869|UTR5:1-127|CDS:128-6415|UTR3:6416-6869| | 6869 | 6630.24 | 0 | 0.00000 | 0.000000 |
ENSMUST00000027035.9|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127245.2|Sox17-001|Sox17|3127|UTR5:1-1082|CDS:1083-2342|UTR3:2343-3127| | 3127 | 2888.24 | 0 | 0.00000 | 0.000000 |
ENSMUST00000195555.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127249.1|Sox17-005|Sox17|1977|UTR5:1-635|CDS:636-1511|UTR3:1512-1977| | 1977 | 1738.24 | 0 | 0.00000 | 0.000000 |
gene_id = 'ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634|'
assert s13_kallisto.loc[gene_id, 'log2_tpm'] == 0.056126371311968043
Use sns.jointplot
to plot the correlation between the logged TPMs we just made.
### BEGIN SOLUTION
sns.jointplot(s10_kallisto['log2_tpm'], s13_kallisto['log2_tpm'])
### END SOLUTION
<seaborn.axisgrid.JointGrid at 0x10d679048>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
assert isinstance(_, sns.axisgrid.JointGrid)
Interesting, we have quite a bit of correlation! What if we used rank-based, non-linear correlation such as spearman? We can specify a different statistical function with stat_func=
and specifying spearmanr
from scipy.stats
.
from scipy.stats import spearmanr
sns.jointplot(s10_kallisto['log2_tpm'], s13_kallisto['log2_tpm'], stat_func=spearmanr)
<seaborn.axisgrid.JointGrid at 0x10f136438>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
We'll now create a dataframe containing the two columns ("Series") of the separate s10_kallisto
and s13_kallisto
columns, using pd.concat
to concatenate the two series, and rename them so the names are s13
and s10
, using the keys=['s10', 's13']
. The axis=1
means to glue along the columns (axis=0 is rows, axis=1 is columns), so that we stack horizontally, versus vertically. Otherwise we'd get a really tall series that can't tell the difference between s10 and s13.
kallisto_log2_tpm = pd.concat([s10_kallisto['log2_tpm'], s13_kallisto['log2_tpm']], axis=1, keys=['s10', 's13'])
kallisto_log2_tpm.head()
s10 | s13 | |
---|---|---|
target_id | ||
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634| | 0 | 0.056126 |
ENSMUST00000194992.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127194.1|Rp1-002|Rp1|3047|UTR5:1-54|CDS:55-912|UTR3:913-3047| | 0 | 0.000000 |
ENSMUST00000027032.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127195.2|Rp1-001|Rp1|6869|UTR5:1-127|CDS:128-6415|UTR3:6416-6869| | 0 | 0.000000 |
ENSMUST00000027035.9|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127245.2|Sox17-001|Sox17|3127|UTR5:1-1082|CDS:1083-2342|UTR3:2343-3127| | 0 | 0.000000 |
ENSMUST00000195555.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127249.1|Sox17-005|Sox17|1977|UTR5:1-635|CDS:636-1511|UTR3:1512-1977| | 0 | 0.000000 |
So we have a ton of genes where the expression is near zero. This is not that helpful so let's only use genes with expression greater than one in at least one sample. We'll do this using the boolean (True/False) matrix we get from asking "kallisto_log2_tpm > 1
":
kallisto_log2_tpm > 1
s10 | s13 | |
---|---|---|
target_id | ||
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634| | False | False |
ENSMUST00000194992.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127194.1|Rp1-002|Rp1|3047|UTR5:1-54|CDS:55-912|UTR3:913-3047| | False | False |
ENSMUST00000027032.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127195.2|Rp1-001|Rp1|6869|UTR5:1-127|CDS:128-6415|UTR3:6416-6869| | False | False |
ENSMUST00000027035.9|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127245.2|Sox17-001|Sox17|3127|UTR5:1-1082|CDS:1083-2342|UTR3:2343-3127| | False | False |
ENSMUST00000195555.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127249.1|Sox17-005|Sox17|1977|UTR5:1-635|CDS:636-1511|UTR3:1512-1977| | False | False |
ENSMUST00000192650.5|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127247.2|Sox17-004|Sox17|3242|UTR5:1-1851|CDS:1852-2916|UTR3:2917-3242| | False | False |
ENSMUST00000116652.7|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127246.1|Sox17-002|Sox17|1512|UTR5:1-249|CDS:250-1509|UTR3:1510-1512| | False | False |
ENSMUST00000191647.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127267.2|Sox17-007|Sox17|406|UTR5:1-83|CDS:84-406| | False | False |
ENSMUST00000191939.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127266.2|Sox17-006|Sox17|840|UTR5:1-329|CDS:330-840| | False | False |
ENSMUST00000192913.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127248.2|Sox17-003|Sox17|1506|UTR5:1-997|CDS:998-1506| | False | False |
ENSMUST00000130201.7|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072660.1|Mrpl15-002|Mrpl15|1894|UTR5:1-33|CDS:34-648|UTR3:649-1894| | False | False |
ENSMUST00000156816.6|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072659.1|Mrpl15-001|Mrpl15|4203|UTR5:1-62|CDS:63-950|UTR3:951-4203| | False | False |
ENSMUST00000045689.13|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072661.1|Mrpl15-003|Mrpl15|497|UTR5:1-21|CDS:22-180|UTR3:181-497| | False | False |
ENSMUST00000146665.2|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072662.2|Mrpl15-004|Mrpl15|1569|UTR5:1-62|CDS:63-569|UTR3:570-1569| | False | True |
ENSMUST00000134384.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051163.2|Lypla1-002|Lypla1|1136|UTR5:1-126|CDS:127-801|UTR3:802-1136| | False | True |
ENSMUST00000027036.10|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051162.1|Lypla1-001|Lypla1|2507|UTR5:1-91|CDS:92-784|UTR3:785-2507| | True | True |
ENSMUST00000150971.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051164.3|Lypla1-003|Lypla1|877|UTR5:1-84|CDS:85-750|UTR3:751-877| | False | False |
ENSMUST00000119612.8|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051170.2|Lypla1-009|Lypla1|529|UTR5:1-18|CDS:19-297|UTR3:298-529| | False | False |
ENSMUST00000137887.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051168.2|Lypla1-007|Lypla1|444|UTR5:1-16|CDS:17-444| | False | False |
ENSMUST00000115529.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051169.1|Lypla1-008|Lypla1|930|UTR5:1-3|CDS:4-594|UTR3:595-930| | False | False |
ENSMUST00000131119.1|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051167.3|Lypla1-006|Lypla1|660|UTR5:1-234|CDS:235-660| | False | False |
ENSMUST00000155020.1|ENSMUSG00000104217.1|OTTMUSG00000050100.1|OTTMUST00000127419.1|Gm37988-001|Gm37988|825|UTR5:1-22|CDS:23-211|UTR3:212-825| | False | False |
ENSMUST00000081551.13|ENSMUSG00000033813.15|OTTMUSG00000042348.1|OTTMUST00000111602.1|Tcea1-001|Tcea1|2547|UTR5:1-100|CDS:101-1006|UTR3:1007-2547| | True | True |
ENSMUST00000165720.2|ENSMUSG00000033813.15|OTTMUSG00000042348.1|OTTMUST00000111603.1|Tcea1-002|Tcea1|2854|UTR5:1-370|CDS:371-1309|UTR3:1310-2854| | False | False |
ENSMUST00000002533.14|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072687.1|Rgs20-002|Rgs20|1778|UTR5:1-160|CDS:161-880|UTR3:881-1778| | True | True |
ENSMUST00000118000.7|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072688.2|Rgs20-001|Rgs20|2125|UTR5:1-108|CDS:109-1227|UTR3:1228-2125| | False | False |
ENSMUST00000119256.7|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072706.2|Rgs20-005|Rgs20|883|UTR5:1-184|CDS:185-811|UTR3:812-883| | False | False |
ENSMUST00000170566.1|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000092053.1|Rgs20-007|Rgs20|577|CDS:1-120|UTR3:121-577| | False | False |
ENSMUST00000147158.1|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072705.2|Rgs20-004|Rgs20|707|UTR5:1-105|CDS:106-707| | False | False |
ENSMUST00000192847.5|ENSMUSG00000033793.12|OTTMUSG00000050145.9|OTTMUST00000127492.1|Atp6v1h-003|Atp6v1h|1662|UTR5:1-161|CDS:162-1487|UTR3:1488-1662| | True | False |
... | ... | ... |
ENSMUST00000178889.1|ENSMUSG00000095650.2|OTTMUSG00000042966.1|-|Gm20854-201|Gm20854|1002|UTR5:1-175|CDS:176-859|UTR3:860-1002| | False | False |
ENSMUST00000181549.1|ENSMUSG00000095650.2|OTTMUSG00000042966.1|OTTMUST00000112802.1|Gm20854-001|Gm20854|1156|UTR5:1-335|CDS:336-1019|UTR3:1020-1156| | False | False |
ENSMUST00000188754.1|ENSMUSG00000100240.1|OTTMUSG00000047031.1|OTTMUST00000121805.1|Gm20820-001|Gm20820|921|UTR5:1-58|CDS:59-727|UTR3:728-921| | False | False |
ENSMUST00000189543.6|ENSMUSG00000094399.7|OTTMUSG00000047083.1|OTTMUST00000121876.1|Gm21477-001|Gm21477|924|UTR5:1-58|CDS:59-727|UTR3:728-924| | False | False |
ENSMUST00000179970.1|ENSMUSG00000094399.7|OTTMUSG00000047083.1|-|Gm21477-201|Gm21477|498|CDS:1-498| | False | False |
ENSMUST00000186493.1|ENSMUSG00000099856.1|OTTMUSG00000047138.1|OTTMUST00000121968.1|Gm20906-001|Gm20906|923|UTR5:1-58|CDS:59-727|UTR3:728-923| | False | False |
ENSMUST00000187146.1|ENSMUSG00000101915.1|OTTMUSG00000047149.1|OTTMUST00000121980.1|Gm28102-001|Gm28102|924|UTR5:1-58|CDS:59-727|UTR3:728-924| | False | False |
ENSMUST00000186443.1|ENSMUSG00000102045.1|OTTMUSG00000047309.1|OTTMUST00000122306.1|Gm21294-001|Gm21294|934|UTR5:1-73|CDS:74-742|UTR3:743-934| | False | False |
ENSMUST00000188269.1|ENSMUSG00000100608.1|OTTMUSG00000047316.1|OTTMUST00000122318.1|Gm21996-001|Gm21996|931|UTR5:1-58|CDS:59-727|UTR3:728-931| | False | False |
ENSMUST00000190558.6|ENSMUSG00000096178.7|OTTMUSG00000047352.1|OTTMUST00000122361.1|Gm20837-001|Gm20837|844|UTR5:1-73|CDS:74-640|UTR3:641-844| | False | False |
ENSMUST00000178446.1|ENSMUSG00000096178.7|OTTMUSG00000047352.1|-|Gm20837-201|Gm20837|498|CDS:1-498| | False | False |
ENSMUST00000177893.1|ENSMUSG00000095366.1|-|-|Gm21860-201|Gm21860|309|CDS:1-309| | False | False |
ENSMUST00000179483.7|ENSMUSG00000096768.7|-|-|Erdr1-204|Erdr1|688|UTR5:1-70|CDS:71-688| | False | False |
ENSMUST00000177591.1|ENSMUSG00000096768.7|-|-|Erdr1-201|Erdr1|774|UTR5:1-229|CDS:230-757|UTR3:758-774| | False | False |
ENSMUST00000177671.7|ENSMUSG00000096768.7|-|-|Erdr1-202|Erdr1|708|UTR5:1-159|CDS:160-528|UTR3:529-708| | False | False |
ENSMUST00000179077.1|ENSMUSG00000096768.7|-|-|Erdr1-203|Erdr1|887|UTR5:1-74|CDS:75-512|UTR3:513-887| | False | False |
ENSMUST00000179623.1|ENSMUSG00000096850.1|-|-|Gm21748-201|Gm21748|309|CDS:1-309| | False | False |
ENSMUST00000082392.1|ENSMUSG00000064341.1|-|-|mt-Nd1-201|mt-Nd1|957|CDS:1-957| | True | True |
ENSMUST00000082396.1|ENSMUSG00000064345.1|-|-|mt-Nd2-201|mt-Nd2|1038|CDS:1-1038| | True | True |
ENSMUST00000082402.1|ENSMUSG00000064351.1|-|-|mt-Co1-201|mt-Co1|1545|CDS:1-1545| | True | True |
ENSMUST00000082405.1|ENSMUSG00000064354.1|-|-|mt-Co2-201|mt-Co2|684|CDS:1-684| | True | True |
ENSMUST00000082407.1|ENSMUSG00000064356.3|-|-|mt-Atp8-201|mt-Atp8|204|CDS:1-204| | True | True |
ENSMUST00000082408.1|ENSMUSG00000064357.1|-|-|mt-Atp6-201|mt-Atp6|681|CDS:1-681| | True | True |
ENSMUST00000082409.1|ENSMUSG00000064358.1|-|-|mt-Co3-201|mt-Co3|784|CDS:1-784| | True | True |
ENSMUST00000082411.1|ENSMUSG00000064360.1|-|-|mt-Nd3-201|mt-Nd3|348|CDS:1-348| | True | True |
ENSMUST00000084013.1|ENSMUSG00000065947.3|-|-|mt-Nd4l-201|mt-Nd4l|297|CDS:1-297| | True | True |
ENSMUST00000082414.1|ENSMUSG00000064363.1|-|-|mt-Nd4-201|mt-Nd4|1378|CDS:1-1378| | True | True |
ENSMUST00000082418.1|ENSMUSG00000064367.1|-|-|mt-Nd5-201|mt-Nd5|1824|CDS:1-1824| | True | True |
ENSMUST00000082419.1|ENSMUSG00000064368.1|-|-|mt-Nd6-201|mt-Nd6|519|CDS:1-519| | True | True |
ENSMUST00000082421.1|ENSMUSG00000064370.1|-|-|mt-Cytb-201|mt-Cytb|1144|CDS:1-1144| | True | True |
56504 rows × 2 columns
False == 0
True
True == 1
True
If we use the convenient .sum()
function on the dataframe, we'll get the number of "expressed genes" (defining "Expressed genes" as genes with log2(TPM+1) greater than 1) per sample:
(kallisto_log2_tpm > 1).sum()
s10 6121 s13 6380 dtype: int64
If we sum with axis=1
, then we get the number of samples with expression greater than one for each gene.
(kallisto_log2_tpm > 1).sum(axis=1)
target_id ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634| 0 ENSMUST00000194992.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127194.1|Rp1-002|Rp1|3047|UTR5:1-54|CDS:55-912|UTR3:913-3047| 0 ENSMUST00000027032.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127195.2|Rp1-001|Rp1|6869|UTR5:1-127|CDS:128-6415|UTR3:6416-6869| 0 ENSMUST00000027035.9|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127245.2|Sox17-001|Sox17|3127|UTR5:1-1082|CDS:1083-2342|UTR3:2343-3127| 0 ENSMUST00000195555.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127249.1|Sox17-005|Sox17|1977|UTR5:1-635|CDS:636-1511|UTR3:1512-1977| 0 ENSMUST00000192650.5|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127247.2|Sox17-004|Sox17|3242|UTR5:1-1851|CDS:1852-2916|UTR3:2917-3242| 0 ENSMUST00000116652.7|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127246.1|Sox17-002|Sox17|1512|UTR5:1-249|CDS:250-1509|UTR3:1510-1512| 0 ENSMUST00000191647.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127267.2|Sox17-007|Sox17|406|UTR5:1-83|CDS:84-406| 0 ENSMUST00000191939.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127266.2|Sox17-006|Sox17|840|UTR5:1-329|CDS:330-840| 0 ENSMUST00000192913.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127248.2|Sox17-003|Sox17|1506|UTR5:1-997|CDS:998-1506| 0 ENSMUST00000130201.7|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072660.1|Mrpl15-002|Mrpl15|1894|UTR5:1-33|CDS:34-648|UTR3:649-1894| 0 ENSMUST00000156816.6|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072659.1|Mrpl15-001|Mrpl15|4203|UTR5:1-62|CDS:63-950|UTR3:951-4203| 0 ENSMUST00000045689.13|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072661.1|Mrpl15-003|Mrpl15|497|UTR5:1-21|CDS:22-180|UTR3:181-497| 0 ENSMUST00000146665.2|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072662.2|Mrpl15-004|Mrpl15|1569|UTR5:1-62|CDS:63-569|UTR3:570-1569| 1 ENSMUST00000134384.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051163.2|Lypla1-002|Lypla1|1136|UTR5:1-126|CDS:127-801|UTR3:802-1136| 1 ENSMUST00000027036.10|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051162.1|Lypla1-001|Lypla1|2507|UTR5:1-91|CDS:92-784|UTR3:785-2507| 2 ENSMUST00000150971.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051164.3|Lypla1-003|Lypla1|877|UTR5:1-84|CDS:85-750|UTR3:751-877| 0 ENSMUST00000119612.8|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051170.2|Lypla1-009|Lypla1|529|UTR5:1-18|CDS:19-297|UTR3:298-529| 0 ENSMUST00000137887.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051168.2|Lypla1-007|Lypla1|444|UTR5:1-16|CDS:17-444| 0 ENSMUST00000115529.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051169.1|Lypla1-008|Lypla1|930|UTR5:1-3|CDS:4-594|UTR3:595-930| 0 ENSMUST00000131119.1|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051167.3|Lypla1-006|Lypla1|660|UTR5:1-234|CDS:235-660| 0 ENSMUST00000155020.1|ENSMUSG00000104217.1|OTTMUSG00000050100.1|OTTMUST00000127419.1|Gm37988-001|Gm37988|825|UTR5:1-22|CDS:23-211|UTR3:212-825| 0 ENSMUST00000081551.13|ENSMUSG00000033813.15|OTTMUSG00000042348.1|OTTMUST00000111602.1|Tcea1-001|Tcea1|2547|UTR5:1-100|CDS:101-1006|UTR3:1007-2547| 2 ENSMUST00000165720.2|ENSMUSG00000033813.15|OTTMUSG00000042348.1|OTTMUST00000111603.1|Tcea1-002|Tcea1|2854|UTR5:1-370|CDS:371-1309|UTR3:1310-2854| 0 ENSMUST00000002533.14|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072687.1|Rgs20-002|Rgs20|1778|UTR5:1-160|CDS:161-880|UTR3:881-1778| 2 ENSMUST00000118000.7|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072688.2|Rgs20-001|Rgs20|2125|UTR5:1-108|CDS:109-1227|UTR3:1228-2125| 0 ENSMUST00000119256.7|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072706.2|Rgs20-005|Rgs20|883|UTR5:1-184|CDS:185-811|UTR3:812-883| 0 ENSMUST00000170566.1|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000092053.1|Rgs20-007|Rgs20|577|CDS:1-120|UTR3:121-577| 0 ENSMUST00000147158.1|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072705.2|Rgs20-004|Rgs20|707|UTR5:1-105|CDS:106-707| 0 ENSMUST00000192847.5|ENSMUSG00000033793.12|OTTMUSG00000050145.9|OTTMUST00000127492.1|Atp6v1h-003|Atp6v1h|1662|UTR5:1-161|CDS:162-1487|UTR3:1488-1662| 1 .. ENSMUST00000178889.1|ENSMUSG00000095650.2|OTTMUSG00000042966.1|-|Gm20854-201|Gm20854|1002|UTR5:1-175|CDS:176-859|UTR3:860-1002| 0 ENSMUST00000181549.1|ENSMUSG00000095650.2|OTTMUSG00000042966.1|OTTMUST00000112802.1|Gm20854-001|Gm20854|1156|UTR5:1-335|CDS:336-1019|UTR3:1020-1156| 0 ENSMUST00000188754.1|ENSMUSG00000100240.1|OTTMUSG00000047031.1|OTTMUST00000121805.1|Gm20820-001|Gm20820|921|UTR5:1-58|CDS:59-727|UTR3:728-921| 0 ENSMUST00000189543.6|ENSMUSG00000094399.7|OTTMUSG00000047083.1|OTTMUST00000121876.1|Gm21477-001|Gm21477|924|UTR5:1-58|CDS:59-727|UTR3:728-924| 0 ENSMUST00000179970.1|ENSMUSG00000094399.7|OTTMUSG00000047083.1|-|Gm21477-201|Gm21477|498|CDS:1-498| 0 ENSMUST00000186493.1|ENSMUSG00000099856.1|OTTMUSG00000047138.1|OTTMUST00000121968.1|Gm20906-001|Gm20906|923|UTR5:1-58|CDS:59-727|UTR3:728-923| 0 ENSMUST00000187146.1|ENSMUSG00000101915.1|OTTMUSG00000047149.1|OTTMUST00000121980.1|Gm28102-001|Gm28102|924|UTR5:1-58|CDS:59-727|UTR3:728-924| 0 ENSMUST00000186443.1|ENSMUSG00000102045.1|OTTMUSG00000047309.1|OTTMUST00000122306.1|Gm21294-001|Gm21294|934|UTR5:1-73|CDS:74-742|UTR3:743-934| 0 ENSMUST00000188269.1|ENSMUSG00000100608.1|OTTMUSG00000047316.1|OTTMUST00000122318.1|Gm21996-001|Gm21996|931|UTR5:1-58|CDS:59-727|UTR3:728-931| 0 ENSMUST00000190558.6|ENSMUSG00000096178.7|OTTMUSG00000047352.1|OTTMUST00000122361.1|Gm20837-001|Gm20837|844|UTR5:1-73|CDS:74-640|UTR3:641-844| 0 ENSMUST00000178446.1|ENSMUSG00000096178.7|OTTMUSG00000047352.1|-|Gm20837-201|Gm20837|498|CDS:1-498| 0 ENSMUST00000177893.1|ENSMUSG00000095366.1|-|-|Gm21860-201|Gm21860|309|CDS:1-309| 0 ENSMUST00000179483.7|ENSMUSG00000096768.7|-|-|Erdr1-204|Erdr1|688|UTR5:1-70|CDS:71-688| 0 ENSMUST00000177591.1|ENSMUSG00000096768.7|-|-|Erdr1-201|Erdr1|774|UTR5:1-229|CDS:230-757|UTR3:758-774| 0 ENSMUST00000177671.7|ENSMUSG00000096768.7|-|-|Erdr1-202|Erdr1|708|UTR5:1-159|CDS:160-528|UTR3:529-708| 0 ENSMUST00000179077.1|ENSMUSG00000096768.7|-|-|Erdr1-203|Erdr1|887|UTR5:1-74|CDS:75-512|UTR3:513-887| 0 ENSMUST00000179623.1|ENSMUSG00000096850.1|-|-|Gm21748-201|Gm21748|309|CDS:1-309| 0 ENSMUST00000082392.1|ENSMUSG00000064341.1|-|-|mt-Nd1-201|mt-Nd1|957|CDS:1-957| 2 ENSMUST00000082396.1|ENSMUSG00000064345.1|-|-|mt-Nd2-201|mt-Nd2|1038|CDS:1-1038| 2 ENSMUST00000082402.1|ENSMUSG00000064351.1|-|-|mt-Co1-201|mt-Co1|1545|CDS:1-1545| 2 ENSMUST00000082405.1|ENSMUSG00000064354.1|-|-|mt-Co2-201|mt-Co2|684|CDS:1-684| 2 ENSMUST00000082407.1|ENSMUSG00000064356.3|-|-|mt-Atp8-201|mt-Atp8|204|CDS:1-204| 2 ENSMUST00000082408.1|ENSMUSG00000064357.1|-|-|mt-Atp6-201|mt-Atp6|681|CDS:1-681| 2 ENSMUST00000082409.1|ENSMUSG00000064358.1|-|-|mt-Co3-201|mt-Co3|784|CDS:1-784| 2 ENSMUST00000082411.1|ENSMUSG00000064360.1|-|-|mt-Nd3-201|mt-Nd3|348|CDS:1-348| 2 ENSMUST00000084013.1|ENSMUSG00000065947.3|-|-|mt-Nd4l-201|mt-Nd4l|297|CDS:1-297| 2 ENSMUST00000082414.1|ENSMUSG00000064363.1|-|-|mt-Nd4-201|mt-Nd4|1378|CDS:1-1378| 2 ENSMUST00000082418.1|ENSMUSG00000064367.1|-|-|mt-Nd5-201|mt-Nd5|1824|CDS:1-1824| 2 ENSMUST00000082419.1|ENSMUSG00000064368.1|-|-|mt-Nd6-201|mt-Nd6|519|CDS:1-519| 2 ENSMUST00000082421.1|ENSMUSG00000064370.1|-|-|mt-Cytb-201|mt-Cytb|1144|CDS:1-1144| 2 dtype: int64
We can now use this to get all genes expressed in at least one sample with ">= 1
"
(kallisto_log2_tpm > 1).sum(axis=1) >= 1
target_id ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|UTR5:1-150|CDS:151-2094|UTR3:2095-3634| False ENSMUST00000194992.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127194.1|Rp1-002|Rp1|3047|UTR5:1-54|CDS:55-912|UTR3:913-3047| False ENSMUST00000027032.5|ENSMUSG00000025900.10|OTTMUSG00000049985.2|OTTMUST00000127195.2|Rp1-001|Rp1|6869|UTR5:1-127|CDS:128-6415|UTR3:6416-6869| False ENSMUST00000027035.9|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127245.2|Sox17-001|Sox17|3127|UTR5:1-1082|CDS:1083-2342|UTR3:2343-3127| False ENSMUST00000195555.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127249.1|Sox17-005|Sox17|1977|UTR5:1-635|CDS:636-1511|UTR3:1512-1977| False ENSMUST00000192650.5|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127247.2|Sox17-004|Sox17|3242|UTR5:1-1851|CDS:1852-2916|UTR3:2917-3242| False ENSMUST00000116652.7|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127246.1|Sox17-002|Sox17|1512|UTR5:1-249|CDS:250-1509|UTR3:1510-1512| False ENSMUST00000191647.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127267.2|Sox17-007|Sox17|406|UTR5:1-83|CDS:84-406| False ENSMUST00000191939.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127266.2|Sox17-006|Sox17|840|UTR5:1-329|CDS:330-840| False ENSMUST00000192913.1|ENSMUSG00000025902.13|OTTMUSG00000050014.7|OTTMUST00000127248.2|Sox17-003|Sox17|1506|UTR5:1-997|CDS:998-1506| False ENSMUST00000130201.7|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072660.1|Mrpl15-002|Mrpl15|1894|UTR5:1-33|CDS:34-648|UTR3:649-1894| False ENSMUST00000156816.6|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072659.1|Mrpl15-001|Mrpl15|4203|UTR5:1-62|CDS:63-950|UTR3:951-4203| False ENSMUST00000045689.13|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072661.1|Mrpl15-003|Mrpl15|497|UTR5:1-21|CDS:22-180|UTR3:181-497| False ENSMUST00000146665.2|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072662.2|Mrpl15-004|Mrpl15|1569|UTR5:1-62|CDS:63-569|UTR3:570-1569| True ENSMUST00000134384.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051163.2|Lypla1-002|Lypla1|1136|UTR5:1-126|CDS:127-801|UTR3:802-1136| True ENSMUST00000027036.10|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051162.1|Lypla1-001|Lypla1|2507|UTR5:1-91|CDS:92-784|UTR3:785-2507| True ENSMUST00000150971.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051164.3|Lypla1-003|Lypla1|877|UTR5:1-84|CDS:85-750|UTR3:751-877| False ENSMUST00000119612.8|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051170.2|Lypla1-009|Lypla1|529|UTR5:1-18|CDS:19-297|UTR3:298-529| False ENSMUST00000137887.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051168.2|Lypla1-007|Lypla1|444|UTR5:1-16|CDS:17-444| False ENSMUST00000115529.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051169.1|Lypla1-008|Lypla1|930|UTR5:1-3|CDS:4-594|UTR3:595-930| False ENSMUST00000131119.1|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051167.3|Lypla1-006|Lypla1|660|UTR5:1-234|CDS:235-660| False ENSMUST00000155020.1|ENSMUSG00000104217.1|OTTMUSG00000050100.1|OTTMUST00000127419.1|Gm37988-001|Gm37988|825|UTR5:1-22|CDS:23-211|UTR3:212-825| False ENSMUST00000081551.13|ENSMUSG00000033813.15|OTTMUSG00000042348.1|OTTMUST00000111602.1|Tcea1-001|Tcea1|2547|UTR5:1-100|CDS:101-1006|UTR3:1007-2547| True ENSMUST00000165720.2|ENSMUSG00000033813.15|OTTMUSG00000042348.1|OTTMUST00000111603.1|Tcea1-002|Tcea1|2854|UTR5:1-370|CDS:371-1309|UTR3:1310-2854| False ENSMUST00000002533.14|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072687.1|Rgs20-002|Rgs20|1778|UTR5:1-160|CDS:161-880|UTR3:881-1778| True ENSMUST00000118000.7|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072688.2|Rgs20-001|Rgs20|2125|UTR5:1-108|CDS:109-1227|UTR3:1228-2125| False ENSMUST00000119256.7|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072706.2|Rgs20-005|Rgs20|883|UTR5:1-184|CDS:185-811|UTR3:812-883| False ENSMUST00000170566.1|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000092053.1|Rgs20-007|Rgs20|577|CDS:1-120|UTR3:121-577| False ENSMUST00000147158.1|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072705.2|Rgs20-004|Rgs20|707|UTR5:1-105|CDS:106-707| False ENSMUST00000192847.5|ENSMUSG00000033793.12|OTTMUSG00000050145.9|OTTMUST00000127492.1|Atp6v1h-003|Atp6v1h|1662|UTR5:1-161|CDS:162-1487|UTR3:1488-1662| True ... ENSMUST00000178889.1|ENSMUSG00000095650.2|OTTMUSG00000042966.1|-|Gm20854-201|Gm20854|1002|UTR5:1-175|CDS:176-859|UTR3:860-1002| False ENSMUST00000181549.1|ENSMUSG00000095650.2|OTTMUSG00000042966.1|OTTMUST00000112802.1|Gm20854-001|Gm20854|1156|UTR5:1-335|CDS:336-1019|UTR3:1020-1156| False ENSMUST00000188754.1|ENSMUSG00000100240.1|OTTMUSG00000047031.1|OTTMUST00000121805.1|Gm20820-001|Gm20820|921|UTR5:1-58|CDS:59-727|UTR3:728-921| False ENSMUST00000189543.6|ENSMUSG00000094399.7|OTTMUSG00000047083.1|OTTMUST00000121876.1|Gm21477-001|Gm21477|924|UTR5:1-58|CDS:59-727|UTR3:728-924| False ENSMUST00000179970.1|ENSMUSG00000094399.7|OTTMUSG00000047083.1|-|Gm21477-201|Gm21477|498|CDS:1-498| False ENSMUST00000186493.1|ENSMUSG00000099856.1|OTTMUSG00000047138.1|OTTMUST00000121968.1|Gm20906-001|Gm20906|923|UTR5:1-58|CDS:59-727|UTR3:728-923| False ENSMUST00000187146.1|ENSMUSG00000101915.1|OTTMUSG00000047149.1|OTTMUST00000121980.1|Gm28102-001|Gm28102|924|UTR5:1-58|CDS:59-727|UTR3:728-924| False ENSMUST00000186443.1|ENSMUSG00000102045.1|OTTMUSG00000047309.1|OTTMUST00000122306.1|Gm21294-001|Gm21294|934|UTR5:1-73|CDS:74-742|UTR3:743-934| False ENSMUST00000188269.1|ENSMUSG00000100608.1|OTTMUSG00000047316.1|OTTMUST00000122318.1|Gm21996-001|Gm21996|931|UTR5:1-58|CDS:59-727|UTR3:728-931| False ENSMUST00000190558.6|ENSMUSG00000096178.7|OTTMUSG00000047352.1|OTTMUST00000122361.1|Gm20837-001|Gm20837|844|UTR5:1-73|CDS:74-640|UTR3:641-844| False ENSMUST00000178446.1|ENSMUSG00000096178.7|OTTMUSG00000047352.1|-|Gm20837-201|Gm20837|498|CDS:1-498| False ENSMUST00000177893.1|ENSMUSG00000095366.1|-|-|Gm21860-201|Gm21860|309|CDS:1-309| False ENSMUST00000179483.7|ENSMUSG00000096768.7|-|-|Erdr1-204|Erdr1|688|UTR5:1-70|CDS:71-688| False ENSMUST00000177591.1|ENSMUSG00000096768.7|-|-|Erdr1-201|Erdr1|774|UTR5:1-229|CDS:230-757|UTR3:758-774| False ENSMUST00000177671.7|ENSMUSG00000096768.7|-|-|Erdr1-202|Erdr1|708|UTR5:1-159|CDS:160-528|UTR3:529-708| False ENSMUST00000179077.1|ENSMUSG00000096768.7|-|-|Erdr1-203|Erdr1|887|UTR5:1-74|CDS:75-512|UTR3:513-887| False ENSMUST00000179623.1|ENSMUSG00000096850.1|-|-|Gm21748-201|Gm21748|309|CDS:1-309| False ENSMUST00000082392.1|ENSMUSG00000064341.1|-|-|mt-Nd1-201|mt-Nd1|957|CDS:1-957| True ENSMUST00000082396.1|ENSMUSG00000064345.1|-|-|mt-Nd2-201|mt-Nd2|1038|CDS:1-1038| True ENSMUST00000082402.1|ENSMUSG00000064351.1|-|-|mt-Co1-201|mt-Co1|1545|CDS:1-1545| True ENSMUST00000082405.1|ENSMUSG00000064354.1|-|-|mt-Co2-201|mt-Co2|684|CDS:1-684| True ENSMUST00000082407.1|ENSMUSG00000064356.3|-|-|mt-Atp8-201|mt-Atp8|204|CDS:1-204| True ENSMUST00000082408.1|ENSMUSG00000064357.1|-|-|mt-Atp6-201|mt-Atp6|681|CDS:1-681| True ENSMUST00000082409.1|ENSMUSG00000064358.1|-|-|mt-Co3-201|mt-Co3|784|CDS:1-784| True ENSMUST00000082411.1|ENSMUSG00000064360.1|-|-|mt-Nd3-201|mt-Nd3|348|CDS:1-348| True ENSMUST00000084013.1|ENSMUSG00000065947.3|-|-|mt-Nd4l-201|mt-Nd4l|297|CDS:1-297| True ENSMUST00000082414.1|ENSMUSG00000064363.1|-|-|mt-Nd4-201|mt-Nd4|1378|CDS:1-1378| True ENSMUST00000082418.1|ENSMUSG00000064367.1|-|-|mt-Nd5-201|mt-Nd5|1824|CDS:1-1824| True ENSMUST00000082419.1|ENSMUSG00000064368.1|-|-|mt-Nd6-201|mt-Nd6|519|CDS:1-519| True ENSMUST00000082421.1|ENSMUSG00000064370.1|-|-|mt-Cytb-201|mt-Cytb|1144|CDS:1-1144| True dtype: bool
Now we can use the .loc[]
notation to get the rows we want as a subset. Notice that this outputs the dataframe itself - we haven't assigned it to anything. This is equivalent to extracting your RNA and throwing it on the ground. You can look at it but you didn't put it anywhere useful so you can't do anything with it.
expressed_genes = (kallisto_log2_tpm > 1).sum(axis=1) >= 1
kallisto_log2_tpm.loc[expressed_genes]
s10 | s13 | |
---|---|---|
target_id | ||
ENSMUST00000146665.2|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072662.2|Mrpl15-004|Mrpl15|1569|UTR5:1-62|CDS:63-569|UTR3:570-1569| | 0.000000 | 3.952259 |
ENSMUST00000134384.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051163.2|Lypla1-002|Lypla1|1136|UTR5:1-126|CDS:127-801|UTR3:802-1136| | 0.278669 | 2.769812 |
ENSMUST00000027036.10|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051162.1|Lypla1-001|Lypla1|2507|UTR5:1-91|CDS:92-784|UTR3:785-2507| | 3.174784 | 2.992448 |
ENSMUST00000081551.13|ENSMUSG00000033813.15|OTTMUSG00000042348.1|OTTMUST00000111602.1|Tcea1-001|Tcea1|2547|UTR5:1-100|CDS:101-1006|UTR3:1007-2547| | 6.947351 | 1.058026 |
ENSMUST00000002533.14|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072687.1|Rgs20-002|Rgs20|1778|UTR5:1-160|CDS:161-880|UTR3:881-1778| | 2.795065 | 4.141637 |
ENSMUST00000192847.5|ENSMUSG00000033793.12|OTTMUSG00000050145.9|OTTMUST00000127492.1|Atp6v1h-003|Atp6v1h|1662|UTR5:1-161|CDS:162-1487|UTR3:1488-1662| | 1.774490 | 0.000000 |
ENSMUST00000159906.7|ENSMUSG00000025907.14|OTTMUSG00000033467.12|OTTMUST00000084214.2|Rb1cc1-009|Rb1cc1|656|UTR5:1-375|CDS:376-656| | 0.000000 | 7.568701 |
ENSMUST00000088658.10|ENSMUSG00000025912.16|OTTMUSG00000034736.3|OTTMUST00000088260.1|Mybl1-001|Mybl1|4980|UTR5:1-255|CDS:256-2511|UTR3:2512-4980| | 0.775518 | 1.389170 |
ENSMUST00000052843.11|ENSMUSG00000046101.16|OTTMUSG00000021854.3|OTTMUST00000051859.1|Mcmdc2-001|Mcmdc2|2826|UTR5:1-97|CDS:98-1141|UTR3:1142-2826| | 2.036728 | 0.237117 |
ENSMUST00000118098.2|ENSMUSG00000046101.16|OTTMUSG00000021854.3|OTTMUST00000078010.2|Mcmdc2-003|Mcmdc2|2072|UTR5:1-405|CDS:406-1482|UTR3:1483-2072| | 0.234414 | 2.070753 |
ENSMUST00000183059.1|ENSMUSG00000046101.16|OTTMUSG00000021854.3|OTTMUST00000113822.1|Mcmdc2-005|Mcmdc2|749|CDS:1-77|UTR3:78-749| | 7.079271 | 8.468359 |
ENSMUST00000027050.9|ENSMUSG00000025917.9|OTTMUSG00000029459.3|OTTMUST00000072996.2|Cops5-001|Cops5|1698|UTR5:1-315|CDS:316-1320|UTR3:1321-1698| | 6.723326 | 0.000000 |
ENSMUST00000186528.6|ENSMUSG00000025917.9|OTTMUSG00000029459.3|OTTMUST00000118982.1|Cops5-006|Cops5|962|UTR5:1-129|CDS:130-276|UTR3:277-962| | 4.856129 | 0.246389 |
ENSMUST00000191012.6|ENSMUSG00000056763.16|OTTMUSG00000027401.8|OTTMUST00000118986.2|Cspp1-019|Cspp1|345|UTR5:1-135|CDS:136-345| | 4.375331 | 0.000000 |
ENSMUST00000122156.7|ENSMUSG00000056763.16|OTTMUSG00000027401.8|OTTMUST00000067947.1|Cspp1-006|Cspp1|2535|UTR5:1-117|CDS:118-1077|UTR3:1078-2535| | 3.466614 | 0.000000 |
ENSMUST00000155974.7|ENSMUSG00000056763.16|OTTMUSG00000027401.8|OTTMUST00000067949.4|Cspp1-008|Cspp1|445|UTR5:1-139|CDS:140-445| | 3.035826 | 0.000000 |
ENSMUST00000188449.6|ENSMUSG00000056763.16|OTTMUSG00000027401.8|OTTMUST00000118993.1|Cspp1-024|Cspp1|538|CDS:1-317|UTR3:318-538| | 6.619077 | 0.000000 |
ENSMUST00000088615.10|ENSMUSG00000067851.11|OTTMUSG00000033923.2|OTTMUST00000085577.1|Arfgef1-002|Arfgef1|6978|UTR5:1-176|CDS:177-5717|UTR3:5718-6978| | 6.223523 | 2.219540 |
ENSMUST00000027056.11|ENSMUSG00000048960.13|OTTMUSG00000047927.1|OTTMUST00000123246.1|Prex2-001|Prex2|11053|UTR5:1-328|CDS:329-5125|UTR3:5126-11053| | 1.736847 | 0.056283 |
ENSMUST00000188154.1|ENSMUSG00000048960.13|OTTMUSG00000047927.1|OTTMUST00000123253.1|Prex2-006|Prex2|843|CDS:1-122|UTR3:123-843| | 1.547519 | 5.895499 |
ENSMUST00000171690.8|ENSMUSG00000057715.13|OTTMUSG00000033915.2|-|A830018L16Rik-201|A830018L16Rik|6158|UTR5:1-464|CDS:465-1331|UTR3:1332-6158| | 1.597393 | 2.705801 |
ENSMUST00000188454.6|ENSMUSG00000025938.16|OTTMUSG00000022327.2|OTTMUST00000053228.2|Slco5a1-001|Slco5a1|8463|UTR5:1-591|CDS:592-3144|UTR3:3145-8463| | 0.056284 | 3.506221 |
ENSMUST00000136197.7|ENSMUSG00000025938.16|OTTMUSG00000022327.2|OTTMUST00000053230.2|Slco5a1-003|Slco5a1|5123|UTR5:1-640|CDS:641-2617|UTR3:2618-5123| | 0.027134 | 6.674800 |
ENSMUST00000027068.10|ENSMUSG00000025935.10|OTTMUSG00000049113.1|OTTMUST00000125384.1|Tram1-001|Tram1|2938|UTR5:1-188|CDS:189-1313|UTR3:1314-2938| | 5.431071 | 0.201256 |
ENSMUST00000027071.6|ENSMUSG00000025937.6|OTTMUSG00000021599.3|OTTMUST00000051280.2|Lactb2-001|Lactb2|4494|UTR5:1-96|CDS:97-963|UTR3:964-4494| | 3.940599 | 4.778224 |
ENSMUST00000041447.4|ENSMUSG00000032769.4|OTTMUSG00000049207.1|OTTMUST00000125545.1|Trpa1-001|Trpa1|4263|UTR5:1-27|CDS:28-3405|UTR3:3406-4263| | 0.000000 | 2.086301 |
ENSMUST00000188371.6|ENSMUSG00000025925.14|OTTMUSG00000047984.2|OTTMUST00000123359.2|Terf1-001|Terf1|2268|UTR5:1-32|CDS:33-1298|UTR3:1299-2268| | 0.535847 | 5.333087 |
ENSMUST00000115367.7|ENSMUSG00000067795.13|OTTMUSG00000022165.2|OTTMUST00000052635.1|4930444P10Rik-003|4930444P10Rik|868|UTR5:1-163|CDS:164-721|UTR3:722-868| | 3.591404 | 4.592032 |
ENSMUST00000058437.13|ENSMUSG00000043716.13|OTTMUSG00000016936.3|OTTMUST00000041073.2|Rpl7-001|Rpl7|1163|UTR5:1-284|CDS:285-1097|UTR3:1098-1163| | 9.557749 | 9.656554 |
ENSMUST00000149566.1|ENSMUSG00000043716.13|OTTMUSG00000016936.3|OTTMUST00000041075.2|Rpl7-003|Rpl7|838|CDS:1-838| | 8.027635 | 7.302383 |
... | ... | ... |
ENSMUST00000112172.3|ENSMUSG00000049775.16|OTTMUSG00000019572.3|OTTMUST00000046738.1|Tmsb4x-001|Tmsb4x|768|UTR5:1-225|CDS:226-360|UTR3:361-768| | 13.768288 | 14.830347 |
ENSMUST00000112176.7|ENSMUSG00000049775.16|OTTMUSG00000019572.3|OTTMUST00000046739.2|Tmsb4x-002|Tmsb4x|698|UTR5:1-139|CDS:140-292|UTR3:293-698| | 4.358734 | 6.675759 |
ENSMUST00000112175.1|ENSMUSG00000049775.16|OTTMUSG00000019572.3|OTTMUST00000046740.1|Tmsb4x-003|Tmsb4x|864|UTR5:1-502|CDS:503-655|UTR3:656-864| | 0.653129 | 8.543295 |
ENSMUST00000112137.1|ENSMUSG00000031358.17|OTTMUSG00000019558.2|OTTMUST00000046702.2|Msl3-002|Msl3|2316|UTR5:1-163|CDS:164-1564|UTR3:1565-2316| | 0.094384 | 1.071969 |
ENSMUST00000112129.7|ENSMUSG00000031355.16|OTTMUSG00000019560.4|OTTMUST00000046950.1|Arhgap6-006|Arhgap6|2487|UTR5:1-211|CDS:212-1369|UTR3:1370-2487| | 2.217563 | 2.394953 |
ENSMUST00000154638.8|ENSMUSG00000031352.10|OTTMUSG00000019596.2|OTTMUST00000046816.2|Hccs-003|Hccs|3310|UTR5:1-239|CDS:240-521|UTR3:522-3310| | 0.693446 | 1.291150 |
ENSMUST00000033717.8|ENSMUSG00000031352.10|OTTMUSG00000019596.2|OTTMUST00000046815.1|Hccs-002|Hccs|2320|UTR5:1-207|CDS:208-1026|UTR3:1027-2320| | 5.285036 | 0.090573 |
ENSMUST00000115894.2|ENSMUSG00000069053.11|OTTMUSG00000045273.1|-|Uba1y-201|Uba1y|3981|UTR5:1-102|CDS:103-3279|UTR3:3280-3981| | 2.265662 | 3.909082 |
ENSMUST00000189069.6|ENSMUSG00000056673.14|OTTMUSG00000045277.1|OTTMUST00000118912.1|Kdm5d-004|Kdm5d|2177|UTR5:1-178|CDS:179-1609|UTR3:1610-2177| | 0.000000 | 1.415645 |
ENSMUST00000055032.13|ENSMUSG00000056673.14|OTTMUSG00000045277.1|OTTMUST00000118913.1|Kdm5d-001|Kdm5d|7974|UTR5:1-176|CDS:177-4823|UTR3:4824-7974| | 3.009652 | 3.433922 |
ENSMUST00000186726.6|ENSMUSG00000056673.14|OTTMUSG00000045277.1|OTTMUST00000118916.1|Kdm5d-002|Kdm5d|5916|UTR5:1-128|CDS:129-944|UTR3:945-5916| | 1.715568 | 2.865436 |
ENSMUST00000143958.7|ENSMUSG00000068457.14|OTTMUSG00000032814.5|OTTMUST00000081730.2|Uty-002|Uty|2903|UTR5:1-126|CDS:127-363|UTR3:364-2903| | 1.169130 | 0.000000 |
ENSMUST00000091190.11|ENSMUSG00000069045.11|OTTMUSG00000045279.1|OTTMUST00000118924.1|Ddx3y-001|Ddx3y|4600|UTR5:1-106|CDS:107-2083|UTR3:2084-4600| | 0.420181 | 1.121188 |
ENSMUST00000091178.1|ENSMUSG00000069036.3|OTTMUSG00000045384.2|OTTMUST00000119113.2|Sry-001|Sry|1188|CDS:1-1188| | 2.305509 | 1.958223 |
ENSMUST00000190391.1|ENSMUSG00000096706.2|OTTMUSG00000045647.2|OTTMUST00000119534.2|Gm21854-001|Gm21854|1480|UTR5:1-662|CDS:663-1361|UTR3:1362-1480| | 1.386447 | 0.904086 |
ENSMUST00000185926.1|ENSMUSG00000096036.2|OTTMUSG00000045661.2|OTTMUST00000119556.2|Gm21778-001|Gm21778|1482|UTR5:1-660|CDS:661-1359|UTR3:1360-1482| | 3.411056 | 0.000000 |
ENSMUST00000189201.1|ENSMUSG00000094484.2|OTTMUSG00000045867.2|OTTMUST00000119866.2|Gm21244-001|Gm21244|3115|UTR5:1-405|CDS:406-1089|UTR3:1090-3115| | 6.108443 | 7.190151 |
ENSMUST00000082392.1|ENSMUSG00000064341.1|-|-|mt-Nd1-201|mt-Nd1|957|CDS:1-957| | 10.386110 | 12.218548 |
ENSMUST00000082396.1|ENSMUSG00000064345.1|-|-|mt-Nd2-201|mt-Nd2|1038|CDS:1-1038| | 9.128672 | 11.083306 |
ENSMUST00000082402.1|ENSMUSG00000064351.1|-|-|mt-Co1-201|mt-Co1|1545|CDS:1-1545| | 14.110687 | 13.061822 |
ENSMUST00000082405.1|ENSMUSG00000064354.1|-|-|mt-Co2-201|mt-Co2|684|CDS:1-684| | 13.673254 | 13.407600 |
ENSMUST00000082407.1|ENSMUSG00000064356.3|-|-|mt-Atp8-201|mt-Atp8|204|CDS:1-204| | 13.960626 | 14.381415 |
ENSMUST00000082408.1|ENSMUSG00000064357.1|-|-|mt-Atp6-201|mt-Atp6|681|CDS:1-681| | 13.435566 | 13.835251 |
ENSMUST00000082409.1|ENSMUSG00000064358.1|-|-|mt-Co3-201|mt-Co3|784|CDS:1-784| | 13.366910 | 13.871424 |
ENSMUST00000082411.1|ENSMUSG00000064360.1|-|-|mt-Nd3-201|mt-Nd3|348|CDS:1-348| | 10.291689 | 10.614544 |
ENSMUST00000084013.1|ENSMUSG00000065947.3|-|-|mt-Nd4l-201|mt-Nd4l|297|CDS:1-297| | 10.235930 | 11.969117 |
ENSMUST00000082414.1|ENSMUSG00000064363.1|-|-|mt-Nd4-201|mt-Nd4|1378|CDS:1-1378| | 10.380158 | 11.326160 |
ENSMUST00000082418.1|ENSMUSG00000064367.1|-|-|mt-Nd5-201|mt-Nd5|1824|CDS:1-1824| | 9.938436 | 10.430975 |
ENSMUST00000082419.1|ENSMUSG00000064368.1|-|-|mt-Nd6-201|mt-Nd6|519|CDS:1-519| | 9.172783 | 10.414791 |
ENSMUST00000082421.1|ENSMUSG00000064370.1|-|-|mt-Cytb-201|mt-Cytb|1144|CDS:1-1144| | 12.192564 | 12.669727 |
9383 rows × 2 columns
Now let's actually make another dataframe with only the expressed genes.
kallisto_log2_tpm_expressed = kallisto_log2_tpm.loc[expressed_genes]
print(kallisto_log2_tpm_expressed.shape)
kallisto_log2_tpm_expressed.head()
(9383, 2)
s10 | s13 | |
---|---|---|
target_id | ||
ENSMUST00000146665.2|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072662.2|Mrpl15-004|Mrpl15|1569|UTR5:1-62|CDS:63-569|UTR3:570-1569| | 0.000000 | 3.952259 |
ENSMUST00000134384.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051163.2|Lypla1-002|Lypla1|1136|UTR5:1-126|CDS:127-801|UTR3:802-1136| | 0.278669 | 2.769812 |
ENSMUST00000027036.10|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051162.1|Lypla1-001|Lypla1|2507|UTR5:1-91|CDS:92-784|UTR3:785-2507| | 3.174784 | 2.992448 |
ENSMUST00000081551.13|ENSMUSG00000033813.15|OTTMUSG00000042348.1|OTTMUST00000111602.1|Tcea1-001|Tcea1|2547|UTR5:1-100|CDS:101-1006|UTR3:1007-2547| | 6.947351 | 1.058026 |
ENSMUST00000002533.14|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072687.1|Rgs20-002|Rgs20|1778|UTR5:1-160|CDS:161-880|UTR3:881-1778| | 2.795065 | 4.141637 |
We'll plot the jointplot
again, using a little different syntax. Since the data we want are two columns in the same dataframe, we can specify the names of the columns as "x" (first position) and "y" (second position) and then the dataframe in the third position.
sns.jointplot('s10', 's13', kallisto_log2_tpm_expressed)
<seaborn.axisgrid.JointGrid at 0x10fc49710>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
spearmanr
as the statistical function¶### BEGIN SOLUTION
sns.jointplot('s10', 's13', kallisto_log2_tpm_expressed, stat_func=spearmanr)
### END SOLUTION
<seaborn.axisgrid.JointGrid at 0x1101759e8>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
assert _.ax_joint.legend_.texts[0].get_text() == 'spearmanr = -0.23; p = 7e-113'
featureCounts
¶s10_featurecounts = pd.read_table('s10_featureCounts.txt')
print(s10_featurecounts.shape)
s10_featurecounts.head()
(46984, 1)
# Program:featureCounts v1.5.0; Command:"featureCounts" "-T" "8" "-s" "-B" "--primary" "-a" "/projects/ps-yeolab/biom262-2016/genomes/mm10/gencode/m8/gencode.vM8.basic.annotation.gtf" "-o" "/home/ucsd-train01/projects/shalek2013/processed_data/s10_featureCounts.txt" "/home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam" | ||||||
---|---|---|---|---|---|---|
Geneid | Chr | Start | End | Strand | Length | /home/ucsd-train01/projects/shalek2013/process... |
ENSMUSG00000102693.1 | chr1 | 3073253 | 3074322 | + | 1070 | 0 |
ENSMUSG00000064842.1 | chr1 | 3102016 | 3102125 | + | 110 | 980 |
ENSMUSG00000051951.5 | chr1;chr1;chr1 | 3214482;3421702;3670552 | 3216968;3421901;3671498 | -;-;- | 3634 | 0 |
ENSMUSG00000102851.1 | chr1 | 3252757 | 3253236 | + | 480 | 0 |
s10_featurecounts
table¶### BEGIN SOLUTION
s10_featurecounts = pd.read_table('s10_featureCounts.txt', skiprows=1, index_col='Geneid')
### END SOLUTION
print(s10_featurecounts.shape)
s10_featurecounts.head()
(46983, 6)
Chr | Start | End | Strand | Length | /home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam | |
---|---|---|---|---|---|---|
Geneid | ||||||
ENSMUSG00000102693.1 | chr1 | 3073253 | 3074322 | + | 1070 | 0 |
ENSMUSG00000064842.1 | chr1 | 3102016 | 3102125 | + | 110 | 980 |
ENSMUSG00000051951.5 | chr1;chr1;chr1 | 3214482;3421702;3670552 | 3216968;3421901;3671498 | -;-;- | 3634 | 0 |
ENSMUSG00000102851.1 | chr1 | 3252757 | 3253236 | + | 480 | 0 |
ENSMUSG00000103377.1 | chr1 | 3365731 | 3368549 | - | 2819 | 0 |
assert s10_featurecounts.shape == (46983, 6)
assert (s10_featurecounts.columns[:-1] == pd.Index(['Chr', 'Start', 'End', 'Strand', 'Length'],
dtype='object')).all()
featureCounts
¶Now, featureCounts
outputs the actual number of reads mapped to each gene. You can tell because the datatype is integer, which would only be true if it was raw read counts, and not a transformed value like TPM, which has decimals (i.e. is a "floating-point" type)
To get the transcripts per kilobase mapped (TPM) so we can compare to the kallisto
output, we'll have to do some extra steps.
s10_featurecounts.dtypes
Chr object Start object End object Strand object Length int64 /home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam int64 dtype: object
Let's look at the distribution of the number of reads per feature using sns.distplot
.
sns.distplot(s10_featurecounts['/home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam'])
<matplotlib.axes._subplots.AxesSubplot at 0x113bc2c50>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
For the next exercise ... remember that you can use convenient methods of ".sum()" to get the sum of a column. This sums all the gene lengths in the data.
s10_featurecounts['Length'].sum()
90730267
Like with the log2 TPM, we'll create a new column based on the existing columns. This example assigns the big ole column name to a single variable so it's easier to work with, and creates a new column that's the sum of all lengths times the number of reads, divided by 2000 (2e3 = $2\times 10^3$).
Notice that you can use regular multiplication with "*
" and division with "/
" (addition with "+
" and "-
" also work)
reads = s10_featurecounts['/home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam']
s10_featurecounts['new_column'] = (s10_featurecounts['Length'].sum() * reads)/2e3
s10_featurecounts.head()
Chr | Start | End | Strand | Length | /home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam | new_column | |
---|---|---|---|---|---|---|---|
Geneid | |||||||
ENSMUSG00000102693.1 | chr1 | 3073253 | 3074322 | + | 1070 | 0 | 0.00 |
ENSMUSG00000064842.1 | chr1 | 3102016 | 3102125 | + | 110 | 980 | 44457830.83 |
ENSMUSG00000051951.5 | chr1;chr1;chr1 | 3214482;3421702;3670552 | 3216968;3421901;3671498 | -;-;- | 3634 | 0 | 0.00 |
ENSMUSG00000102851.1 | chr1 | 3252757 | 3253236 | + | 480 | 0 | 0.00 |
ENSMUSG00000103377.1 | chr1 | 3365731 | 3368549 | - | 2819 | 0 | 0.00 |
Using your knowledge about FPKM, add a column called 'fpkm'
to s10_featurecounts
that's the fragments per kilobase mapped. We're doing FPKM first because you can calculate the TPM from the FPKM easily.
(Use the "Length" column provided rather than the "effective length" which is the length minus the read lengths. otherwise we'll get negative FPKMs!)
reads = s10_featurecounts['/home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam']
### BEGIN SOLUTION
s10_featurecounts['fpkm'] = 1e9 * reads/(s10_featurecounts['Length'] * reads.sum())
### END SOLUTION
s10_featurecounts.head()
Chr | Start | End | Strand | Length | /home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam | new_column | fpkm | |
---|---|---|---|---|---|---|---|---|
Geneid | ||||||||
ENSMUSG00000102693.1 | chr1 | 3073253 | 3074322 | + | 1070 | 0 | 0.00 | 0.000000 |
ENSMUSG00000064842.1 | chr1 | 3102016 | 3102125 | + | 110 | 980 | 44457830.83 | 151.953838 |
ENSMUSG00000051951.5 | chr1;chr1;chr1 | 3214482;3421702;3670552 | 3216968;3421901;3671498 | -;-;- | 3634 | 0 | 0.00 | 0.000000 |
ENSMUSG00000102851.1 | chr1 | 3252757 | 3253236 | + | 480 | 0 | 0.00 | 0.000000 |
ENSMUSG00000103377.1 | chr1 | 3365731 | 3368549 | - | 2819 | 0 | 0.00 | 0.000000 |
assert s10_featurecounts.loc['ENSMUSG00000064842.1', 'fpkm'] == 151.9538381109796
Let's look at the new distribution of the FPKMs. Notice that the range is much smaller than the reads.
sns.distplot(s10_featurecounts['fpkm'])
<matplotlib.axes._subplots.AxesSubplot at 0x111469f98>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
Now add a column called 'tpm'
which uses FPKM to calculate the transcripts per million. You'll need to read the "What the FPKM" blog post in detail to get the equation for TPM.
Hint: How would you sum all the FPKMs in the data?
### BEGIN SOLUTION
s10_featurecounts['tpm'] = 1e6 * s10_featurecounts['fpkm']/s10_featurecounts['fpkm'].sum()
### END SOLUTION
s10_featurecounts.head()
Chr | Start | End | Strand | Length | /home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam | new_column | fpkm | tpm | |
---|---|---|---|---|---|---|---|---|---|
Geneid | |||||||||
ENSMUSG00000102693.1 | chr1 | 3073253 | 3074322 | + | 1070 | 0 | 0.00 | 0.000000 | 0.000000 |
ENSMUSG00000064842.1 | chr1 | 3102016 | 3102125 | + | 110 | 980 | 44457830.83 | 151.953838 | 260.827361 |
ENSMUSG00000051951.5 | chr1;chr1;chr1 | 3214482;3421702;3670552 | 3216968;3421901;3671498 | -;-;- | 3634 | 0 | 0.00 | 0.000000 | 0.000000 |
ENSMUSG00000102851.1 | chr1 | 3252757 | 3253236 | + | 480 | 0 | 0.00 | 0.000000 | 0.000000 |
ENSMUSG00000103377.1 | chr1 | 3365731 | 3368549 | - | 2819 | 0 | 0.00 | 0.000000 | 0.000000 |
assert s10_featurecounts.loc['ENSMUSG00000064842.1', 'tpm'] == 260.82736102245372
Let's look at this new distribution of TPM. Notice its range is also smaller than the FPKMs.
sns.distplot(s10_featurecounts['tpm'])
<matplotlib.axes._subplots.AxesSubplot at 0x111cb4908>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
If you plot FPKM vs TPM, you'll see that they're linearly related, just like the equation told us :)
sns.jointplot('fpkm', 'tpm', s10_featurecounts)
<seaborn.axisgrid.JointGrid at 0x112a578d0>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
Add a column called 'log2_tpm'
where you do the log2(TPM+1) transformation that we did before.
### BEGIN SOLUTION
s10_featurecounts['log2_tpm'] = np.log2(s10_featurecounts['tpm'] + 1)
### END SOLUTION
s10_featurecounts.head()
Chr | Start | End | Strand | Length | /home/ucsd-train01/projects/shalek2013/processed_data/S10.Aligned.out.sorted.bam | new_column | fpkm | tpm | log2_tpm | |
---|---|---|---|---|---|---|---|---|---|---|
Geneid | ||||||||||
ENSMUSG00000102693.1 | chr1 | 3073253 | 3074322 | + | 1070 | 0 | 0.00 | 0.000000 | 0.000000 | 0.000000 |
ENSMUSG00000064842.1 | chr1 | 3102016 | 3102125 | + | 110 | 980 | 44457830.83 | 151.953838 | 260.827361 | 8.032472 |
ENSMUSG00000051951.5 | chr1;chr1;chr1 | 3214482;3421702;3670552 | 3216968;3421901;3671498 | -;-;- | 3634 | 0 | 0.00 | 0.000000 | 0.000000 | 0.000000 |
ENSMUSG00000102851.1 | chr1 | 3252757 | 3253236 | + | 480 | 0 | 0.00 | 0.000000 | 0.000000 | 0.000000 |
ENSMUSG00000103377.1 | chr1 | 3365731 | 3368549 | - | 2819 | 0 | 0.00 | 0.000000 | 0.000000 | 0.000000 |
assert s10_featurecounts.loc['ENSMUSG00000064842.1', 'log2_tpm'] == 8.0324720569159176
Remember this kallisto dataframe we made? Let's take a look at it again.
kallisto_log2_tpm_expressed.head()
s10 | s13 | |
---|---|---|
target_id | ||
ENSMUST00000146665.2|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072662.2|Mrpl15-004|Mrpl15|1569|UTR5:1-62|CDS:63-569|UTR3:570-1569| | 0.000000 | 3.952259 |
ENSMUST00000134384.7|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051163.2|Lypla1-002|Lypla1|1136|UTR5:1-126|CDS:127-801|UTR3:802-1136| | 0.278669 | 2.769812 |
ENSMUST00000027036.10|ENSMUSG00000025903.14|OTTMUSG00000021562.4|OTTMUST00000051162.1|Lypla1-001|Lypla1|2507|UTR5:1-91|CDS:92-784|UTR3:785-2507| | 3.174784 | 2.992448 |
ENSMUST00000081551.13|ENSMUSG00000033813.15|OTTMUSG00000042348.1|OTTMUST00000111602.1|Tcea1-001|Tcea1|2547|UTR5:1-100|CDS:101-1006|UTR3:1007-2547| | 6.947351 | 1.058026 |
ENSMUST00000002533.14|ENSMUSG00000002459.17|OTTMUSG00000029338.4|OTTMUST00000072687.1|Rgs20-002|Rgs20|1778|UTR5:1-160|CDS:161-880|UTR3:881-1778| | 2.795065 | 4.141637 |
Notice that its row names ("index
") have all this stuff in them. We really only need the gene ids - everything else is an annotation of the transcript (the length, where the UTR is in the sequence, where the coding region is in the sequence, etc).
Here's an example of using split()
on this string which is one of the IDs.
s = 'ENSMUST00000146665.2|ENSMUSG00000033845.13|OTTMUSG00000029329.3|OTTMUST00000072662.2|Mrpl15-004|Mrpl15|1569|UTR5:1-62|CDS:63-569|UTR3:570-1569|'
s.split('|')
['ENSMUST00000146665.2', 'ENSMUSG00000033845.13', 'OTTMUSG00000029329.3', 'OTTMUST00000072662.2', 'Mrpl15-004', 'Mrpl15', '1569', 'UTR5:1-62', 'CDS:63-569', 'UTR3:570-1569', '']
Now since we're using DataFrames we can use this really convenient function map
which applies the same function to each element of the vector. We'll only get the gene names by "mapping" a small function ("lambda") that splits each item in the row name on the pipe ("|") and takes the 1th item (remember we start counting from 0).
kallisto_log2_tpm_expressed.index.map(lambda x: x.split('|')[1])
array(['ENSMUSG00000033845.13', 'ENSMUSG00000025903.14', 'ENSMUSG00000025903.14', ..., 'ENSMUSG00000064367.1', 'ENSMUSG00000064368.1', 'ENSMUSG00000064370.1'], dtype=object)
We'll now copy the original dataframe and replace the "index
" with this new one, so we can compare to featureCounts
.
kallisto_log2_tpm_expressed_genes = kallisto_log2_tpm_expressed.copy()
kallisto_log2_tpm_expressed_genes.index = kallisto_log2_tpm_expressed.index.map(lambda x: x.split('|')[1])
print(kallisto_log2_tpm_expressed_genes.shape)
kallisto_log2_tpm_expressed_genes.head()
(9383, 2)
s10 | s13 | |
---|---|---|
ENSMUSG00000033845.13 | 0.000000 | 3.952259 |
ENSMUSG00000025903.14 | 0.278669 | 2.769812 |
ENSMUSG00000025903.14 | 3.174784 | 2.992448 |
ENSMUSG00000033813.15 | 6.947351 | 1.058026 |
ENSMUSG00000002459.17 | 2.795065 | 4.141637 |
Notice that we have some duplicate gene ids. This is because there were multiple transcripts per gene id.
This next bit of code takes each gene ID, and for the ones that match, it'll sum the TPMs. This is legal to do because the total number of transcripts has not changed, we're merely summing per gene.
The .groupby
function is *very* useful every time you have one or more tables that have some common key (in this case gene ids) that you want to use. We won't go into it here but it may come in handy to you later.
kallisto_log2_tpm_expressed_genes_summed = kallisto_log2_tpm_expressed_genes.groupby(level=0, axis=0).sum()
print(kallisto_log2_tpm_expressed_genes_summed.shape)
kallisto_log2_tpm_expressed_genes_summed.head()
(6733, 2)
s10 | s13 | |
---|---|---|
ENSMUSG00000000001.4 | 0.273651 | 6.259357 |
ENSMUSG00000000056.7 | 5.218173 | 0.000000 |
ENSMUSG00000000058.6 | 11.609494 | 0.344833 |
ENSMUSG00000000078.6 | 2.723716 | 4.196505 |
ENSMUSG00000000085.16 | 0.000000 | 5.360020 |
Now we'll use a convenient function called align
which will unify the row names of the two columns (Series) we have: the kallisto log2 TPMs and the featurecounts log2 TPMs. We'll assign them to "x
" and "y
" for short (for now)
x = kallisto_log2_tpm_expressed_genes_summed['s10']
y = s10_featurecounts['log2_tpm']
print('before aligning:', x.shape, y.shape)
x, y = x.align(y)
print('after aligning:', x.shape, y.shape)
x.head()
before aligning: (6733,) (46983,) after aligning: (46983,) (46983,)
ENSMUSG00000000001.4 0.273651 ENSMUSG00000000003.15 NaN ENSMUSG00000000028.14 NaN ENSMUSG00000000031.15 NaN ENSMUSG00000000037.16 NaN Name: s10, dtype: float64
So that the plot shows the name of the sample and the algorithm, we'll change the .name
attribute of the series.
x.name = 'S10 kallisto'
y.name = 'S10 featureCounts'
sns.jointplot(x, y)
<seaborn.axisgrid.JointGrid at 0x112e76080>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
What about using spearman correlation?
sns.jointplot(x, y, stat_func=spearmanr)
<seaborn.axisgrid.JointGrid at 0x112f1fbe0>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
Write 3-5 sentences describing why you think kallisto and FeatureCounts have similar, but different results, based on the algorithms used for mapping and (quasi-)alignment. Recall that with kallisto we mapped to protein-coding transcripts, and for FeatureCounts we first had to map with STAR (which has its own biases - like what?) and only after that we used the basic annotation, both from GENCODE Mouse V8.
We'll now do some differential expression analyses to get a sense of how these different algorithms affect the results.
kallisto_diff = kallisto_log2_tpm_expressed_genes_summed['s10'] - kallisto_log2_tpm_expressed_genes_summed['s13']
kallisto_diff.head()
ENSMUSG00000000001.4 -5.985706 ENSMUSG00000000056.7 5.218173 ENSMUSG00000000058.6 11.264661 ENSMUSG00000000078.6 -1.472789 ENSMUSG00000000085.16 -5.360020 dtype: float64
sns.distplot(kallisto_diff)
<matplotlib.axes._subplots.AxesSubplot at 0x1144c5ba8>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
Since the differences between genes is approximately normal, we can find the genes that are overexpressed in S10 (i.e. speceific to S10) by getting the ones that are 2 standard deviations greater than the mean.
We need this line:
kallisto_s10_specific_genes = pd.Series(kallisto_s10_specific.index[kallisto_s10_specific])
Becaus we want to create a new series where the values are the gene ids, rather than the index is the gene ids. This makes writing to a file easier.
kallisto_s10_specific = kallisto_diff > (kallisto_diff.mean() + 2*kallisto_diff.std())
kallisto_s10_specific_genes = pd.Series(kallisto_s10_specific.index[kallisto_s10_specific])
print(kallisto_s10_specific_genes.shape)
kallisto_s10_specific_genes.head()
(218,)
0 ENSMUSG00000000827.18 1 ENSMUSG00000001418.13 2 ENSMUSG00000002699.12 3 ENSMUSG00000002885.14 4 ENSMUSG00000002996.17 dtype: object
Make a series called kallisto_s13_specific_genes
which contains the genes that are specifically expressed in the sample S13.
(Hint: specific to S13 means underexpressed in S10 - so similar to above, but 2 standard deviations less than the mean)
### BEGIN SOLUTION
kallisto_s13_specific = kallisto_diff < (kallisto_diff.mean() - 2*kallisto_diff.std())
kallisto_s13_specific_genes = pd.Series(kallisto_s13_specific.index[kallisto_s13_specific])
### END SOLUTION
print(kallisto_s13_specific_genes.shape)
kallisto_s13_specific_genes.head()
(162,)
0 ENSMUSG00000000386.14 1 ENSMUSG00000000399.10 2 ENSMUSG00000000581.8 3 ENSMUSG00000000753.15 4 ENSMUSG00000001127.12 dtype: object
assert kallisto_s13_specific_genes.shape == (162,)
We'll do a quick Gene Ontology to get a sense of the kind of genes which are over- or under-enriched. To do this, we'll need to save our genes to a file and then copy that to our computer.
The gene ids we have are the GENCODE ids which are the ENSEMBL ids + '.version', e.g. "ENSMUSG00000000386.14" is ensembl ID ENSMUSG00000000386, version 14. Most of the gene ontology programs recognize the ENSEMBL ids but not GENCODE ids, so we'll remove the stuff after the period using .split('.')[0]
, which splits the string up every time it sees a period, and then takes the first (0th - counting from zero) item.
kallisto_s10_specific_genes_ensembl = kallisto_s10_specific_genes.map(lambda x: x.split('.')[0])
kallisto_s13_specific_genes_ensembl = kallisto_s13_specific_genes.map(lambda x: x.split('.')[0])
kallisto_s13_specific_genes_ensembl.head()
0 ENSMUSG00000000386 1 ENSMUSG00000000399 2 ENSMUSG00000000581 3 ENSMUSG00000000753 4 ENSMUSG00000001127 dtype: object
We'll save the gene ids, but not the index/row names (which are integers starting from 0 and are boring)
kallisto_s10_specific_genes_ensembl.to_csv('kallisto_s10_specific_genes.csv', index=False)
kallisto_s13_specific_genes_ensembl.to_csv('kallisto_s13_specific_genes.csv', index=False)
We can look at the data we created with head
:
! head kallisto*specific_genes.csv
==> kallisto_s10_specific_genes.csv <== ENSMUSG00000000827 ENSMUSG00000001418 ENSMUSG00000002699 ENSMUSG00000002885 ENSMUSG00000002996 ENSMUSG00000003662 ENSMUSG00000003721 ENSMUSG00000004266 ENSMUSG00000004609 ENSMUSG00000004610 ==> kallisto_s13_specific_genes.csv <== ENSMUSG00000000386 ENSMUSG00000000399 ENSMUSG00000000581 ENSMUSG00000000753 ENSMUSG00000001127 ENSMUSG00000001289 ENSMUSG00000001794 ENSMUSG00000002845 ENSMUSG00000004296 ENSMUSG00000004530
Use scp
(Mac/Linus)/pscp
(Windows) to secure copy these files to your computer.
Go to the gene ontology website (or your other favorite gene ontology enrichment resource) and get the ontology enrichment of the s10-specific and s13-specific genes. What are the two samples enriched for?
Hint: Which organism are we using?
s13_featureCounts.txt
file### BEGIN SOLUTION
s13_featurecounts = pd.read_table('s13_featureCounts.txt', index_col=0, skiprows=1)
reads = s13_featurecounts['/home/ucsd-train01/projects/shalek2013/processed_data/S13.Aligned.out.sorted.bam']
s13_featurecounts['fpkm'] = reads/(s13_featurecounts['Length']/1000)
s13_featurecounts['tpm'] = 1e6 * s13_featurecounts['fpkm']/s13_featurecounts['fpkm'].sum()
s13_featurecounts['log2_tpm'] = np.log2(s13_featurecounts['tpm']+1)
### END SOLUTION
print(s13_featurecounts.shape)
s13_featurecounts.head()
(46983, 9)
Chr | Start | End | Strand | Length | /home/ucsd-train01/projects/shalek2013/processed_data/S13.Aligned.out.sorted.bam | fpkm | tpm | log2_tpm | |
---|---|---|---|---|---|---|---|---|---|
Geneid | |||||||||
ENSMUSG00000102693.1 | chr1 | 3073253 | 3074322 | + | 1070 | 0 | 0 | 0 | 0 |
ENSMUSG00000064842.1 | chr1 | 3102016 | 3102125 | + | 110 | 0 | 0 | 0 | 0 |
ENSMUSG00000051951.5 | chr1;chr1;chr1 | 3214482;3421702;3670552 | 3216968;3421901;3671498 | -;-;- | 3634 | 0 | 0 | 0 | 0 |
ENSMUSG00000102851.1 | chr1 | 3252757 | 3253236 | + | 480 | 0 | 0 | 0 | 0 |
ENSMUSG00000103377.1 | chr1 | 3365731 | 3368549 | - | 2819 | 0 | 0 | 0 | 0 |
assert len(s13_featurecounts.columns.intersection(['fpkm', 'tpm', 'log2_tpm'])) == 3
assert s13_featurecounts.loc['ENSMUSG00000064370.1', 'log2_tpm'] == 13.15202328657807
Let's compare these two samples now.
sns.jointplot(s10_featurecounts['log2_tpm'], s13_featurecounts['log2_tpm'])
<seaborn.axisgrid.JointGrid at 0x11489f7f0>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
Create a series called featurecounts_diff
that is the difference between S10 and S13 log2 TPM in featurecounts (relative to S10).
### BEGIN SOLUTION
featurecounts_diff = s10_featurecounts['log2_tpm'] - s13_featurecounts['log2_tpm']
### END SOLUTION
print(featurecounts_diff.shape)
featurecounts_diff.head()
(46983,)
Geneid ENSMUSG00000102693.1 0.000000 ENSMUSG00000064842.1 8.032472 ENSMUSG00000051951.5 0.000000 ENSMUSG00000102851.1 0.000000 ENSMUSG00000103377.1 0.000000 Name: log2_tpm, dtype: float64
assert featurecounts_diff['ENSMUSG00000064842.1'] == 8.0324720569159282
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-76-df3e426ce533> in <module>() ----> 1 assert featurecounts_diff['ENSMUSG00000064842.1'] == 8.0324720569159282 AssertionError:
Let's look at the distribution of the differences.
sns.distplot(featurecounts_diff)
<matplotlib.axes._subplots.AxesSubplot at 0x116183fd0>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
Yikes, there's a TON of things at zero. Let's get rid of them.
featurecounts_diff_nonzero = featurecounts_diff[featurecounts_diff != 0]
print(featurecounts_diff_nonzero.shape)
sns.distplot(featurecounts_diff_nonzero)
(9470,)
<matplotlib.axes._subplots.AxesSubplot at 0x116085390>
/Users/olga/anaconda3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect. warnings.warn("This figure includes Axes that are not "
Use the "2 standard deviations greater than the mean" and "2 standard deviations less than the mean" concepts from before to get the genes.
Remember you'll need a line like this:
kallisto_s10_specific_genes = pd.Series(kallisto_s10_specific.index[kallisto_s10_specific])
To save the gene ids as a series with the gene ids as the values and integers as the index, which makes writing the gene ids to a file easier.
### BEGIN SOLUTION
featurecounts_s10_specific = featurecounts_diff_nonzero > (featurecounts_diff_nonzero.mean() + 2*featurecounts_diff_nonzero.std())
featurecounts_s10_specific_genes = pd.Series(featurecounts_s10_specific.index[featurecounts_s10_specific])
featurecounts_s13_specific = featurecounts_diff_nonzero < (featurecounts_diff_nonzero.mean() - 2*featurecounts_diff_nonzero.std())
featurecounts_s13_specific_genes = pd.Series(featurecounts_s13_specific.index[featurecounts_s13_specific])
### END SOLUTION
print('featurecounts_s10_specific_genes.shape', featurecounts_s10_specific_genes.shape)
print('featurecounts_s13_specific_genes.shape', featurecounts_s13_specific_genes.shape)
featurecounts_s13_specific_genes.head()
featurecounts_s10_specific_genes.shape (303,) featurecounts_s13_specific_genes.shape (369,)
0 ENSMUSG00000081441.3 1 ENSMUSG00000102664.1 2 ENSMUSG00000079658.9 3 ENSMUSG00000062939.11 4 ENSMUSG00000026094.14 Name: Geneid, dtype: object
## Check S10-specific
assert featurecounts_s10_specific_genes.shape == (303,)
assert featurecounts_s10_specific_genes[0] == 'ENSMUSG00000064842.1'
## Check S13-specific
assert featurecounts_s13_specific_genes.shape == (369,)
assert featurecounts_s13_specific_genes[0] == 'ENSMUSG00000081441.3'
We'll need to remove the gene ID version number again so we have the ENSEMBL version of the ID, not the gencode.
Use the previous code to split the GENCODE ids so you get the period- and version-less ENSEMBL ids.
### BEGIN SOLUTION
featurecounts_s10_specific_genes_ensembl = featurecounts_s10_specific_genes.map(lambda x: x.split('.')[0])
featurecounts_s13_specific_genes_ensembl = featurecounts_s13_specific_genes.map(lambda x: x.split('.')[0])
### END SOLUTION
featurecounts_s13_specific_genes_ensembl.head()
0 ENSMUSG00000081441 1 ENSMUSG00000102664 2 ENSMUSG00000079658 3 ENSMUSG00000062939 4 ENSMUSG00000026094 Name: Geneid, dtype: object
assert featurecounts_s13_specific_genes_ensembl[0] == 'ENSMUSG00000081441'
assert featurecounts_s10_specific_genes_ensembl[0] == 'ENSMUSG00000064842'
featurecounts_s13_specific_genes_ensembl
" and "featurecounts_s10_specific_genes_ensembl
" to fileskallisto
and featureCounts
? (3-5 sentences - answer below)# Code to write series to file.
### BEGIN SOLUTION
featurecounts_s10_specific_genes_ensembl.to_csv('featurecounts_s10_specific_genes.csv', index=False)
featurecounts_s13_specific_genes_ensembl.to_csv('featurecounts_s13_specific_genes.csv', index=False)
### END SOLUTION