import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import skimage.io as io
There are 1000 mouse femur bones which have been measured at high resolution and a number of shape analyses run on each sample. - Phenotypical Information - Each column represents a metric which was assessed in the images - CORT_DTO__C_TH for example is the mean thickness of the cortical bone.
pheno = pd.read_csv('phenoTable.csv')
pheno.head()
Genetic Information (genoTable.csv) Each animal has been tagged at a number of different regions of the genome (called markers: D1Mit236)
geno = pd.read_csv('genoTable.csv')
geno.head(5)
We want to merge the data set using the key 'ID'
df = pd.merge(pheno,geno, on='ID')
df.head()
The key 'FEMALE' is a boolean, we want to change the key name to 'GENDER' and the contents from true/false to 'F'/'M'. Here, a lambda function is used for the mapping when we apply an operation to each row in the GENDER column.
df=df.rename(columns={"FEMALE": "GENDER"})
df['GENDER'] = df['GENDER'].apply(lambda x: 'F' if x else 'M')
df['GENDER'].sample(5)
fig,ax=plt.subplots(1,1,figsize=(15,15));
pheno.hist(ax=ax,bins=50);
These are far too many variables to work with. At least for a start. We have to focus on some few e.g.
Explore the data and correlations between various metrics by using the ‘pairplot’ plotting component. Examine different variable combinations.
sns.pairplot(pheno, vars = ['BMD', 'CORT_DTO__C_TH', 'CORT_DTO__C_TH_SD']);
CORT_DTO__C_TH
from scipy.stats import ttest_ind
# Extract the CORT_DTO__C_TH column and create a data frame for FEMALE and MALE respectively using the GENDER column
male = df[insert your gender filters][select column to compare]
female = df[insert your gender filters][select column to compare]
ttest,pval = ttest_ind(female,male)
print("p-value={:0.4f}".format(pval))
# Insert your test code here
For this example we will compare two real cortical bone samples taken from mice. For the purpose of the analysis and keeping the data sizes small, we will use Anders' Crazy Camera for simulating the noisy detection process.
The assignment aims to be more integrative and you will combine a number of different lectures to get to the final answer.
imgA = (0.0 < io.imread('bone_7H3A_B1.tif')).astype(float)
imgB = (0.0 < io.imread('bone_7H6A_B2.tif')).astype(float)
plt.hist([imgA.ravel(), imgB.ravel()],bins=2);
This is the code to simulate a bad camera that produces bad images
import numpy as numpy
import skimage.filters as filters
def camera(img,blurr=1.0,noise=0.1,illum=0.0) :
res = filters.gaussian(img,sigma=blurr)
res = res + np.random.normal(size=res.shape,loc=0,scale=noise)
return res
def crappyCamera(img) :
return camera(img,blurr=2.0,noise=0.2, illum=0.0)
cimgA=crappyCamera(imgA)
cimgB=crappyCamera(imgB)
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,7))
ax1.imshow(cimgA[100])
ax2.imshow(cimgB[100])
between the two samples given the variation in the detector
This exercise (workflow named - Statistical Significance Hunter) shows the same results as we discussed in the lecture for finding p-values of significance.
It takes a completely random stream of numbers with a mean 0.0 and tests them against a null hypothesis (that they equal 0) in small batches, you can adjust the size of the batches, the number of items and the confidence interval. The result is a pie chart showing the number of “significant” results found using the standard scientific criteria for common studies.
from scipy.stats import ttest_1samp
data= np.random.normal(size=(100,3),loc=0,scale=1.0)
tset, pval = ttest_1samp(data, 0.0) # Test if data has average = 0
print("p-values",pval)
reject = pval < 0.05
print("Reject : ", reject)
for idx,r in enumerate(reject) :
if r : # alpha value is 0.05 or 5%
print("[x] we are rejecting null hypothesis")
else:
print("[v] we are accepting null hypothesis")
def statSignificanceHunter(batches=10,samples=10) :
testLevels=[0.05,0.01]
data= np.random.normal(size=(samples,batches),loc=0,scale=1.0)
tset, pval = ttest_1samp(data, 0.0)
counts=[np.sum(testLevels[0]<=pval),np.sum(pval<testLevels[0])-np.sum(pval<testLevels[1]),np.sum(pval<testLevels[1])]
return pval, counts
pvals, counts = statSignificanceHunter(batches=100,samples=100)
plt.figure(figsize=(12,8))
plt.pie(counts,labels=['Accepted','0.05 Reject','0.01 Reject']); plt.legend();
A single run is not sufficient to make a conclusion we need to repeat Write a program that tests the random data with different batch sizes. You need:
# Your code here
Make the modification needed
This is a walk through demonstration using ggplot instead of matplotlib for plotting data.
Making plots or graphics should be divided into separate independent components.
ggplot(input data frame,aes(x=name of x column,y=name of y column))+
geom_point()+
from plotnine import *
from plotnine.data import *
pheno = pd.read_csv('phenoTable.csv')
pheno2 = pheno[['ID','BMD','MECHANICS_STIFFNESS','CORT_DTO__C_TH','CORT_DTO__C_TH_SD']]
pheno2.head()
Setup the input table as pheno
and the map ping with the x position mapped to BMD (Bone Mineral Density) and the y position as CT_TH_RAD (Cortical Bone Thickness)
Create the first simple plot by adding a point representation to the plot
ggplot(pheno,aes(x="BMD",y="CT_TH_RAD")) \
+ geom_point()
ggplot(pheno,aes(x="BMD",y="CT_TH_RAD",color="FEMALE"))+geom_point()
m=pheno
m=m.rename(columns={"FEMALE": "GENDER"})
m['GENDER'] = m['GENDER'].apply(lambda x: 'F' if x else 'M')
ggplot(m,aes(x="BMD",y="CT_TH_RAD",color="GENDER"))+geom_point()
ggplot(m,aes(x="BMD",y="CT_TH_RAD",color="GENDER")) \
+ geom_point() \
+ facet_wrap("GENDER")
For more information and tutorial read about it in: http://ggplot2.org/
The previous plots are not publication ready, explore the different plot grammar components to add decorations