flotilla: Data-driven conversations¶
Open-source Python package for iterative machine learning and visualization | YeoLab/flotilla
Olga Botvinnik | @olgabot | github.com/olgabot | olgabotvinnik.com
PhD Candidate in Bioinformatics at UCSD | Gene Yeo Lab
National Defense Science and Engineering Graduate Fellow
NumFOCUS John Hunter Technical Fellow
April 10th, 2015
flotilla is an open-source Python package for exploring data*.
flotilla is focused on biological data such as single-cell and other large-scale RNA-seq transcriptome analyses
scikit-learn makes it easy to run individual algorithms,
pandas makes subsetting data a dream,
seaborn make visualizing computational results a charm, and the IPython notebook makes stringing all of these together into reproducible document possible,
flotilla does something none of these other packages can.
Disclaimer: I am not a neuroscientist, and my understanding of brain anatomy is very rudimentary, so please bear with me.
We will use the BrainSpan Atlas of the Developing Human Brain, which was an effort to establish molecular profiles of brain regions at varying points of developmental time.
%matplotlib inline import flotilla study = flotilla.embark(flotilla._brainspan)
! curl https://s3-us-west-2.amazonaws.com/flotilla/brainspan_batch_corrected_for_amazon_s3/datapackage.json
Principal Component Analysis (PCA) is a dimensionality reduction algorithm which transforms a high-dimensional space like gene expression, to smaller dimensions, like just two for x- y- plotting.
# custom list: https://www.dropbox.com/s/wyjak3oh6z5myfm/cell_cycle_genes.txt?dl=0 study.interactive_pca()
The hippocampus is involved in memory, and we hypothesize that these cells have a unique molecular profile (set of genes that are expresssed). To accomplish this, we will use a classifier on our data to identify genes which separate hippocampal samples from non-hippocampal samples.
flotilla uses an "Extremely Randomised Trees" Classifier (
ExtraTreesClassifier), which takes random subsets of the data many times to create decision trees, like this one for deciding whether to play outside:
FOXJ1, C1orf88, TEKT1 are all involved in development of cilia, fingerlike protrusions from cells. Development of these cilia has been show to be important in memory formation.
So it looks like our classifier picked up the right things!