Notebook

In [1]:

import scipy
import numpy as np
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import estimate_bandwidth
from sklearn.cluster import MeanShift, estimate_bandwidth

import pandas as pd

from scipy import stats
from scipy.stats import beta
from math import sin
from random import randint

import matplotlib.pyplot as plt
import itertools as it

import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
init_notebook_mode(connected=True)

import collections

def recursively_default_dict():
        return collections.defaultdict(recursively_default_dict)

from matplotlib.collections import BrokenBarHCollection
import re

from Lab_modules.Modules_tools import return_fsts

In the previous chapter we explored ways to visualize the overlap of distributions. These we simulated by generating clouds of points directly, with predetermined size and shape (covariance along 3 axes). We then varyied the distances that separated them. At each variation, we captured their pdfs and compared their values at a fixed number of coordinates (our observations).

This exercise, while fun, is only really relevant if your original data consists of three dimensional coordinates. We might be used to see structure in our data, it is another thing to show that it could acquire the shapes that would allow for the visualizations we developped.

In this post we shall create our clouds of points from frequency vectors, each encasing the probabilities of binary occurences. These are our real populations.

We will begin by generating these vectors in the same way we did in 1. Generating Haplotypes (link below), by varying the parameters a and b of the theta distribution and sampling from there.

1. Generate haplotypes

In [17]:

from Lab_modules.Generate_freq_vectors import generate_vectors_Beta

# We must first define the number of populations, the length of the haplotypes desired, and their respective population sizes
L= 100
n= 100
rangeA= [1,2.5]
rangeB = [.1,.6]
steps= 20
n_comp = 100
density= 50


vector_lib, background= generate_vectors_Beta(L,n,rangeA,rangeB,steps,n_comp)

We will now reduce the dimensions of this data. However, in this case we don't expect to find structure. We created our vectors randomly within our ranges. This feature space will be used solely to retrieve new vectors through inverse transformation following operations that are not limited to 3 dimensions. Thus, we will try to keep as much of the information as possible for that transformation.

In [18]:

pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized').fit(vector_lib)
features = pca.transform(vector_lib)

print(features.shape)
print(vector_lib.shape)

(2000, 100)
(2000, 100)

In [19]:

#### Let's plot the first 3 coordinates nonetheless.
####
fig_data= [go.Scatter(
        x = features[:,0],
        y = features[:,1],
        mode= "markers",
        text= ['a: {}; b: {}, L: {}; index = {}'.format(background[k,0],background[k,1],background[k,2], k) for k in range(background.shape[0])],
        marker= {
        'line': {'width': 0},
        'size': 4,
        'symbol': 'circle',
      "opacity": .8
      }
    )]


layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)

fig = go.Figure(data=fig_data, layout=layout)
iplot(fig)

By hovering above the points, we see that variation along the first axis is usually correlated with the a and b parameters of the beta distribution.

[Note] If you played around with these parameters you will have noticed the beta distribution becomes less skewed as they rise. Although this relation is not linear, it will generally translate into vectors with a larger number of higher frequencies.

Other considerations aside, we will chose a number of populations from among these - i have set this number to 3. To change it, add or remove from the list 'Pops' and modify the list 'Sizes' as well.

The next block of code will calculate global and pairwise Fsts among the populations chosen. Modify your choice until you are happy with these values, i don't really have a better way to inform this choice right now.

In [20]:

#### calculating Fsts between populations.
Pops= [32,180]
### return_fsts() in Modules_tools.
Pairwise, Total_fst= return_fsts(vector_lib,Pops)

print('global fst: {}'.format(round(Total_fst,4)))
Pairwise

global fst: 0.0813

Out[20]:

	pops	fst
0	(32, 180)	0.081327

Finally, for capturing regions of overlap between our populations.

We will pick two populations from among our choices and we will make their distance vary in the feature space of frequency vectors. We will do this by once again resorting to the sinusoid of their distance.

At each variation, we will calculate the pairwise fsts among all our populations. we will also store the distances separating our vectors in feature space. We will plot these values against each other afterwards.

We will first need to slightly modify our return_fsts function. Fsts of loci with 0 variation will be set to 0. This will still produce a runtime warning as it first attemps to divide by 0. Stay calm.

In [21]:

#### We'll have to first change our fst function.

def return_fsts2(freq_array):
    pops= range(freq_array.shape[0])
    H= {pop: [1-(freq_array[pop,x]**2 + (1 - freq_array[pop,x])**2) for x in range(freq_array.shape[1])] for pop in range(freq_array.shape[0])}
    Store= []

    for comb in it.combinations(H.keys(),2):
        P= [sum([freq_array[x,i] for x in comb]) / len(comb) for i in range(freq_array.shape[1])]
        HT= [2 * P[x] * (1 - P[x]) for x in range(len(P))]
        per_locus_fst= [[(HT[x] - np.mean([H[p][x] for p in comb])) / HT[x],0][int(HT[x] == 0)] for x in range(len(P))]
        per_locus_fst= np.nan_to_num(per_locus_fst)
        Fst= np.mean(per_locus_fst)

        Store.append([comb,Fst])
    
    
    ### total fst:
    P= [sum([freq_array[x,i] for x in pops]) / len(pops) for i in range(freq_array.shape[1])]
    HT= [2 * P[x] * (1 - P[x]) for x in range(len(P))]
    FST= np.mean([(HT[x] - np.mean([H[p][x] for p in pops])) / HT[x] for x in range(len(P))])
    
    return pd.DataFrame(Store,columns= ['pops','fst'])

In [ ]:

In [22]:

#### We're going to do something different now. We'll have two points get closer together in time.
#first chose two
target= [0,1]


labels= []
Fsts_crawl= []
angle_list= []
Distances_crawl= []

for angle in np.arange(0,10,.05):
    coords= features[Pops,:]
    vector2= coords[target[0]] - coords[target[1]]

    coords[target[0]] = coords[target[0]] + [sin(angle * 3) * x for x in vector2]
    #coords[target[0]] = coords[target[0]] + [angle / 10 * x for x in vector2]
    
    new_freqs= pca.inverse_transform(coords)
    
    #new_freqs[target[0]]= pca.inverse_transform(coords[target[0]])
    new_freqs[new_freqs > 1] = 1
    new_freqs[new_freqs < 0] = 0
    
    Pairwise= return_fsts2(new_freqs)
    
    Distances= []
    for train in it.combinations([x for x in range(new_freqs.shape[0])],2):
        Distances.append(np.sqrt((coords[train[0]][0] - coords[train[1]][0])**2 + (coords[train[0]][1] - coords[train[1]][1])**2) + (coords[train[0]][2] - coords[train[1]][2])**2)
    Distances_crawl.extend(Distances)
    
    labels.extend([tuple([x + 1 for x in y]) for y in Pairwise.pops])
    
    Fsts_crawl.extend(Pairwise.fst)
    angle_list.extend([angle] * Pairwise.shape[0])


Control= np.array([angle_list,Fsts_crawl]).T

In [23]:

fig_data= [go.Scatter(
x= Distances_crawl,
y= Fsts_crawl,
mode= 'markers',
name= 'fst vs distances'
)
]

layout = go.Layout(
    title= 'Fst vs. distance in vector feature space',
    yaxis=dict(
        title='fsts'),
    xaxis=dict(
        title='eucledian distance in feature space')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

We can see that distances in feature space aren't linearly correlated to among-population fsts. This is to be expected. However the relationship is globally positive, i'll take that as a win and procede.

Simulation of many data sets¶

We will now simulate many data sets of haploid samples. Correlation between simulated populations will vary across data sets. To allow for this algorithm to run indefinitely, i have set correlation to vary according to a sinusoid function.

We first look at the predicted pattern, according to the allele frequency vectors generated this way.

Then we will save the data generated in vcf format.

In [24]:

coords= {z:[x for x in range(len(labels)) if labels[x] == z] for z in list(set(labels))}


fig_data= [go.Scatter(
x= list(range(len(Control[coords[i],1]))),
y= Control[coords[i],1],
mode= 'markers',
name= 'fst pops {} / {}'.format(i[0],i[1])
) for i in coords.keys()
]


layout = go.Layout(
    title= 'pairwise Fsts as a function of variable X',
    yaxis=dict(
        title='between population Fst'),
    xaxis=dict(
        title='X')
)


fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

Globally, we learn that distances in feature space are indeed related to the amount of differentiation among our populations.

Now we will will perform the same exercice, but at each step we will draw haplotypes from our vectors. We will then reduce the dimensionality of our haplotype data set, and estimate the kernel density of each population in that space.

The values of each KDE at the coordinates of our drawn observations (haplotypes) will be extracted.

Generating samples.¶

In [25]:

from Lab_modules.StructE_tools import extract_profiles
from IPython.display import clear_output

labels= [0,1,2,3]

label_to_vector= {
    0:0,
    1:1,
    2:0,
    3:1
}

target= [0,1]
Sizes= [100,150,50,50]
window_length= 5000

label_vector= np.repeat(np.array([x for x in labels]),Sizes)
label_indicies= {x:[y for y in range(len(label_vector)) if label_vector[y] == x] for x in labels}

Windows= recursively_default_dict()
Haplotypes= recursively_default_dict()
Blocks_truth= recursively_default_dict()

target_indx= {z:[x for x in range(len(label_vector)) if label_vector[x] == z] for z in target}
threshold= .005
P= 30
n_comps= 3

Fst_labels= []

Fst_crawl= []

Fst_windows= []

var_comp_store= []

pc_density= []
pc_coords= []
d= 0

for angle in np.arange(0,10,.05):
    print(angle)
    
    coords= features[Pops,:]
    vector2= coords[target[0]] - coords[target[1]]

    coords[target[0]] = coords[target[0]] + [sin(angle * 3) * x for x in vector2]
    #coords[target[0]] = coords[target[0]] + [angle / 10 * x for x in vector2]
    
    #coords= np.vstack([coords,coords])
    
    new_freqs= pca.inverse_transform(coords)
    
    #new_freqs[target[0]]= pca.inverse_transform(coords[target[0]])
    
    N_pops= len(Sizes)
    
    labels= [label_to_vector[x] for x in label_vector]
    
    data= []

    for k in range(len(Sizes)):

        probs= new_freqs[label_to_vector[k],:]
        
        probs[(probs > 1)]= 1
        probs[(probs < 0)]= 0
        m= Sizes[k]
        Haps= [[np.random.choice([0,1],p= [1-probs[x],probs[x]]) for x in range(L)] for acc in range(m)]

        data.extend(Haps)
    
    data= np.array(data)
    
    Haplotypes[1][int(d*window_length)]= data
    
    pca2 = PCA(n_components=n_comps, whiten=False,svd_solver='randomized').fit(data)

    data= pca2.transform(data)
    
    local_pcvar= list(pca2.explained_variance_ratio_)

    local_pcvar= [angle,*local_pcvar]

    var_comp_store.append(local_pcvar)
    
    profiles= extract_profiles(data,target_indx)
    
    ##### PC density
    PC= 0
    
    pc_places= data[:,PC]
    
    X_plot = np.linspace(-4, 4, 100)

    kde = KernelDensity(kernel='gaussian', bandwidth=0.01).fit(np.array(pc_places).reshape(-1,1))

    log_dens= kde.score_samples(X_plot.reshape(-1,1))

    pc_density.append(np.exp(log_dens))
    pc_coords.append(pc_places)
    
    ##### FSTs
    Pairwise= return_fsts2(new_freqs)
    Fst_labels.extend(Pairwise.pops)
    
    Fst_crawl.extend(Pairwise.fst)
    
    Fst_windows.extend([angle] * Pairwise.shape[0])

    ### store stuff.
    
    Blocks_truth[int(d*window_length)]= labels
    Windows[int(d*window_length)]= profiles
    d += 1
    
    clear_output()


var_comp_store= np.array(var_comp_store)
var_comp_store= pd.DataFrame(var_comp_store,columns=['set',*['PC' + str(x + 1) for x in range(n_comps)]])

Blocks_truth= {1:Blocks_truth}

Windows= {1:Windows}

Out= {
    1: {bl: bl+window_length-1 for bl in Windows[1].keys()}
     }

In [26]:

Out[26]:

(350, 20000)

Write vcf file.¶

In [ ]:

def write_vcf(Haplotypes,Out,L,Chr,filename= 'data.vcf', file_info= {},header_info= []):
    import time

    Snps= [sorted(np.random.choice(list(range(x,Out[1][x])),L,replace= False)) for x in Out[1].keys()]
    Snps= [int(x) for x in it.chain(*Snps)]

    Gen_data= Haplotypes.T

    Output= open(filename,'w')

    for intel in file_info.keys():
        Output.write('='.join([intel,file_info[intel]]) + '\n')

    Output.write('\t'.join(header_info) + '\t')
    Output.write('\t'.join(['sample' + str(x+1) for x in range(Gen_data.shape[1])]))
    Output.write('\n')

    for marker in range(Gen_data.shape[0]):
        info_mark= ['chr' + str(Chr), str(Snps[marker]), str(marker + 1), 'A','T','.','PASS','.','GT:AD:DP']
        Output.write('\t'.join(info_mark) + '\t')

        marker_line= ['{}|{}'.format(x,x) + ':20,0:20' for x in Gen_data[marker,:]]
        Output.write('\t'.join(marker_line) + '\n')


    Output.close()

In [34]:

## form array of haplotype data
Gen_data= np.concatenate(tuple([Haplotypes[1][x] for x in Haplotypes[1].keys()]), axis=1)
Gen_data.shape

## vcf specifications.
filename= 'data_beta.vcf'

file_info= {
    '##fileformat': 'Struct_manip',
    '##fileDate': time.strftime("%x"),
    '##source': 'https://github.com/SantosJGND/Tools_and_toys',
    '##phasing': 'yes',
    '##contig': '<ID={}, length= {}>'.format(Chr,Haplotypes.shape[1])
}

header_info= ['#CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO','FORMAT']

### deploy
write_vcf(Gen_data,Out,L,Chr= 1,filename= filename)

Proceed from data written in vcf¶

In [4]:

from Lab_modules.vcf_geno_tools import simple_read_vcf

vcf_file= 'Complementary_data/data_beta.vcf'

genotype, summary, info_save= simple_read_vcf(vcf_file,row_info= 5,header_info= 9,phased= False)

print('Number of markers: {}'.format(genotype.shape[1]))
print('Number of individuals: {}'.format(genotype.shape[0]))

Number of markers: 20000
Number of individuals: 350

Split into new windows¶

Set window size and number of overlapping SNPs between windows (window_size - steps).

In [6]:

from Lab_modules.vcf_geno_tools import geno_window_split
##### 
window_size= 130
Steps= 25

Windows, Out= geno_window_split(genotype,
                                summary,
                                Steps= Steps,
                                window_size=window_size)

print('number of chromosomes: {}'.format(len(Windows)))
print('number of windows: {}'.format(sum([len(Windows[x].keys()) for x in Windows.keys()])))

number of chromosomes: 1
number of windows: 796

Diversity analysis across local genomic windows.¶

We will now proceed to an analysis of the information produced in the first section.

We first calculate the correlation index Fst between the two reference populations that we defined globally, across every data set.

In [10]:

from Lab_modules.Modules_tools import return_fsts2
from Lab_modules.vcf_geno_tools import window_analysis, window_fst_sup

ref_labels= [0,1]

Windows_range= [0,len(Windows[1].keys())]

Fst_window= window_fst_sup(Windows,
                         ref_labels,
                         label_vector,
                         Chr= 1,
                         ncomp= 4,
                        range_sample= Windows_range)

D:\GitHub\Genetic-data-analysis\Notebooks\Lab_modules\Modules_tools.py:136: RuntimeWarning:

invalid value encountered in double_scalars

In [12]:

from Lab_modules.Tutorial_subplots import fst_window_plot

fst_window_plot(Fst_window,ref_labels,sort= False,window_range= Windows_range)

Source classification.¶

We will now perform classification of samples at individual data sets, using at each time the distribution of samples from known, globally defined groups at each data set.

We will use Kernel Density Estimation in PCA feature space. We will then estimate the accuracy of this method of classification across data sets, and relate it to the estimated correlation between samples.

Our 'truth' vectors are a bit cheated. We will take advantage of the fact that reference samples are pure to simply label individual samples according to the frequency vector they were sampled from.

In [18]:

n_comps= 4

In [15]:

Blocks_truth= {
    1: {bl: [label_to_vector[x] for x in label_vector] for bl in Windows[1].keys()}
}

In [23]:

target_indx= {z:[x for x in range(len(label_vector)) if label_vector[x] == z] for z in ref_labels}

Windows_profiles= recursively_default_dict()

var_comp_store= []

pca2 = PCA(n_components=n_comps, whiten=False,svd_solver='randomized')

for bl in Windows[1].keys():

    data= Windows[1][bl]
    data= pca2.fit_transform(data)

    local_pcvar= list(pca2.explained_variance_ratio_)

    local_pcvar= [bl,*local_pcvar]

    var_comp_store.append(local_pcvar)

    profiles= extract_profiles(data,target_indx)

    ### store stuff.
    Windows_profiles[1][bl]= profiles
    clear_output()

To wrap up we will classify each observation according to the KDE values it received at every step.

We will allow for pure, intermediate and outlier classes.

We will plot these classifications in the form of an ideogram and resort to the same color sceme as in the previous post.

pure classes:

pop1: red
pop2: yellow
pop3: blue

Intermediate classes

pop1-pop2: orange
pop2-pop3: green
pop1-pop3: purple
po1-pop2-pop3: silver

Assignments at varying steps of X will be plotted as a global ideogram.

Comparison_threshold= 5
Outlier_threshold= 1e-4

Functions Merge_class() (classification of collection of analyses), compress_Ideo() (reduce classification data set) and return_Ideogram() (plot classification across collection, uses matplotlib) can be found in Ideogram_tools.py

In [62]:

## set classification parameters, colors and organise input.
from Lab_modules.Lab_ideogram_tools import Merge_class

focus_indexes= [x for x in range(len(label_vector))]

Comparison_threshold= 1
Outlier_threshold= 1e-5
color_ref= ['red','yellow','black','orange','orange','purple','green','silver','red3','deepskyeblue','navy','chartreuse','darkorchid3','goldenrod2']

Blocks = Merge_class(Windows_profiles,focus_indexes,Out,Comparison_threshold,Outlier_threshold)

In [63]:

# prepare final classification data set

chromosome_list = []

Ideo = []

chromosomes= Blocks.keys()

for here in range(len(label_vector)):
    Subject = 'sample' + str(here)
    
    chromosome_list.extend(['Region_chr'+str(Chr)+ '_' + Subject for Chr in chromosomes])
    
    Stock = [[['Region_chr'+str(Chr)+ '_' + Subject,bl,Out[Chr][bl],color_ref[Blocks[Chr][bl][here] - 1]] for bl in sorted(Blocks[Chr].keys())] for Chr in chromosomes]
    Stock = [y for y in it.chain(*[z for z in it.chain(*[Stock])])]
    
    Ideo.extend(Stock)

In [64]:

#### begin by compressing assignments by individuals. Lightens the load of the following plot.
from Lab_modules.Lab_ideogram_tools_II import compress_ideo

import re
ideo = pd.DataFrame(Ideo,columns = ['chrom', 'start', 'end', 'gieStain'])

# Filter out chromosomes not in our list
ideo = ideo[ideo.chrom.apply(lambda x: x in chromosome_list)]

ideo = compress_ideo(ideo,chromosome_list,Out)

In [65]:

from Lab_modules.Lab_ideogram_tools_II import return_ideogram

ID= 'truth'

fig= return_ideogram(ideo, chromosome_list,ID,height= 10,width= 15)

adding ideograms...

Max. likelihood classification - accuracy across data sets.¶

In [67]:

wind= []
Proportions= []
labs= []
for w in Windows_profiles[1].keys():
    for combo in it.combinations(Windows_profiles[1][w].keys(),2):
        diffs= [abs(Windows_profiles[1][w][combo[0]][x] - Windows_profiles[1][w][combo[1]][x]) for x in range(len(Windows_profiles[1][w][combo[1]]))]
        
        Proportions.append(np.mean(diffs))
        labs.append(tuple([x+1 for x in combo]))
        wind.append(w)

In [78]:

coords= {z:[x for x in range(len(labs)) if labs[x] == z] for z in list(set(labs))}
Accuracy= []
where= []
Outlier_rate= []

Like_ratio= []
Misshaps= []

Fst_crawl= [x[0] for x in Fst_window]

targets= [1,2]
labels= [0,1,2,3]

for bl in Blocks[1].keys():
    outliers= [x for x in range(len(Blocks[1][bl])) if Blocks[1][bl][x] == len(targets) + 1]
    non_outlier= [x for x in range(len(Blocks[1][bl])) if Blocks[1][bl][x] != len(targets) + 1]
    misshaps= [x for x in range(len(Blocks[1][bl])) if Blocks[1][bl][x] != (Blocks_truth[1][bl][x]+1) and Blocks[1][bl][x] in targets]
    
    
    if len(labels) == 2:
        light_ration= {
            x: [Windows_profiles[1][bl][z-1][0][x] for z in targets] for x in non_outlier
        }
        
        light_ration= {
            x:(max(light_ration[x]) - min(light_ration[x])) / max(light_ration[x]) for x in light_ration.keys()
        }
        
        Like_ratio.extend([light_ration[x] for x in sorted(light_ration.keys())])
        Misshaps.extend([int(x in misshaps) for x in sorted(light_ration.keys())])
    
    
    gp_non_out= {
        z: [x for x in non_outlier if label_vector[x] == z] for z in labels
    }
    
    gp_out= {
        z: [x for x in outliers if label_vector[x] == z] for z in labels
    }
    
    gp_non_out= {
        z: gp_non_out[z] for z in gp_non_out.keys() if gp_non_out[z]
    }
    
    gp_out= {
        z: gp_out[z] for z in gp_out.keys() if gp_out[z]
    }
    
    if not gp_out:
        out_rate= 0
    else:
        out_rate= np.mean([len(gp_out[z]) / len(label_indicies[z]) for z in gp_out.keys()])
    
    rate= np.mean([1 - len([x for x in gp_non_out[z] if x in misshaps]) / len(gp_non_out[z]) for z in gp_non_out.keys()])
    #rate= 1- len(misshaps) / float(len(non_outlier))
    
    Outlier_rate.append(out_rate)
    Accuracy.append(rate)
    where.append(bl)


fig_data= [go.Scatter(
y= np.array(Accuracy),
x= list(range(len(Fst_crawl))),
mode= 'markers'
)
]

layout = go.Layout(
    title= 'Accuracy by Fst between references',
    yaxis=dict(
        title='Accuracy',
        range= [0,1.1]),
    xaxis=dict(
        title='Fst',
        range= [0,len(Fst_crawl)])
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

PCA and KDE - classifying structure.¶

Since we are using normalized density estimates to classify individual haplotypes, it occured to me that it would be interesting to look at the behaviour of the proportion of such estimates for correctly and incorrectly classified samples.

In [22]:

from sklearn.neighbors import KernelDensity

accurate= [Like_ratio[x] for x in range(len(Like_ratio)) if Misshaps[x] == 0]
errors= [Like_ratio[x] for x in range(len(Like_ratio)) if Misshaps[x] == 1]

X_plot = np.linspace(-.1, 1.1, 100)

kde = KernelDensity(kernel='gaussian', bandwidth=0.02).fit(np.array(accurate).reshape(-1,1))

log_dens = kde.score_samples(X_plot.reshape(-1,1))

fig_roost_dens= [go.Scatter(x=X_plot, y=np.exp(log_dens), 
                            mode='lines', fill='tozeroy', name= 'Prop_correct',
                            line=dict(color='blue', width=2))]
##


X_plot = np.linspace(-.1, 1.1, 100)

kde = KernelDensity(kernel='gaussian', bandwidth=0.02).fit(np.array(errors).reshape(-1,1))

log_dens = kde.score_samples(X_plot.reshape(-1,1))

fig_roost_dens.append(go.Scatter(x=X_plot, y=np.exp(log_dens), 
                            mode='lines', fill='tozeroy', name= 'Prop_miss',
                            line=dict(color='red', width=2)))

##

layout= go.Layout(
    title= 'Ref likelihood proportion by classification accuracy',
    xaxis= dict(
        title= 'proporiton',
        range= [-.1,1.1]
    ),
    yaxis= dict(
        title= 'density'
    )
)

fig = go.Figure(data=fig_roost_dens, layout= layout)
iplot(fig)

Variance captured¶

In [23]:

var_comp_store.head()

Out[23]:

	set	PC1	PC2	PC3
0	0.00	0.145041	0.023869	0.023220
1	0.05	0.163878	0.023935	0.022600
2	0.10	0.182116	0.020911	0.020442
3	0.15	0.205751	0.021363	0.019538
4	0.20	0.220147	0.022069	0.018647

In [24]:

n_comp= var_comp_store.shape[1] - 1
PC_list= ['PC' + str(x + 1) for x in range(n_comp)]
total_var= var_comp_store[PC_list].sum(axis= 1)


X_plot = np.linspace(-0.1, 1, 100)

kde = KernelDensity(kernel='gaussian', bandwidth=0.05).fit(np.array(total_var).reshape(-1,1))

log_dens = kde.score_samples(X_plot.reshape(-1,1))

fig_roost_dens= [go.Scatter(x=X_plot, y=np.exp(log_dens), 
                            mode='lines', fill='tozeroy', name= 'variance captured',
                            line=dict(color='blue', width=2))]
##


layout= go.Layout(
    title= 'ncomp: {}'.format(n_comp),
    yaxis= dict(
        title= 'density'
    ),
    xaxis= dict(
        title= 'variance explained'
    )
)

fig = go.Figure(data=fig_roost_dens, layout= layout)
iplot(fig)

In [25]:

how_much= .8

how_many= sum(total_var >= how_much) / float(len(total_var))
print('% data sets with above {} var explained: {}'.format(how_much*100,how_many))
print('mean var: {}'.format(np.round(np.mean(total_var)*100,2)))
print('sd var: {}'.format(np.round(np.std(total_var)*100,2)))

% data sets with above 80.0 var explained: 0.0
mean var: 19.36
sd var: 8.64

In [26]:

trace1 = [go.Box(
    y=total_var
)]

layout= go.Layout(
    title= 'ncomp: {}'.format(n_comp),
    yaxis= dict(
        title= 'density'
    ),
    xaxis= dict(
        title= 'variance explained'
    )
)

fig= go.Figure(data= trace1,layout= layout)
iplot(fig)

In [27]:

fig_data= [go.Scatter(
y= [Accuracy[x] for x in coords[i]],
x= [total_var[x] for x in coords[i]],
mode= 'markers',
name= ' pops {} / {}'.format(i[0],i[1])
) for i in coords.keys()
]

layout = go.Layout(
    title= 'average KDE score differences',
    yaxis=dict(
        title='Accuracy'),
    xaxis=dict(
        title='total var')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

Classifying Structure.¶

We will make use of a threshold on the proportion of density estimates derived for two reference populations to classify samples into an additional class when this ratio is high. This should effectively prevent miss-classification while at the same time producing an indicator of relative data structure.

In [74]:

Comparison_threshold= 4
Outlier_threshold= 1e-4
color_ref= ['red','yellow','black','orange','orange','purple','green','silver','red3','deepskyeblue','navy','chartreuse','darkorchid3','goldenrod2']

Blocks2 = Merge_class(Windows_profiles,focus_indexes,Out,Comparison_threshold,Outlier_threshold)

chromosome_list = []

Ideo = []

chromosomes= Blocks2.keys()

for here in range(len(label_vector)):
    Subject = 'sample' + str(here)
    
    chromosome_list.extend(['Region_chr'+str(Chr)+ '_' + Subject for Chr in chromosomes])
    
    Stock = [[['Region_chr'+str(Chr)+ '_' + Subject,bl,Out[Chr][bl],color_ref[Blocks2[Chr][bl][here] - 1]] for bl in sorted(Blocks2[Chr].keys())] for Chr in chromosomes]
    Stock = [y for y in it.chain(*[z for z in it.chain(*[Stock])])]
    
    Ideo.extend(Stock)

#### begin by compressing assignments by individuals. Lightens the load of the following plot.
import re
ideo = pd.DataFrame(Ideo,columns = ['chrom', 'start', 'end', 'gieStain'])

# Filter out chromosomes not in our list
ideo = ideo[ideo.chrom.apply(lambda x: x in chromosome_list)]

ideo = compress_ideo(ideo,chromosome_list,Out)
ID= 'th' + str(Comparison_threshold)

fig= return_ideogram(ideo, chromosome_list,ID,height= 10,width= 15)

adding ideograms...

In [76]:

coords= {z:[x for x in range(len(labs)) if labs[x] == z] for z in list(set(labs))}
Buffer= []


targets= [1,2]
labels= [2,3]

for bl in Blocks2[1].keys():
    outliers= [x for x in range(len(Blocks[1][bl])) if Blocks[1][bl][x] == len(targets) + 1]
    non_outlier= [x for x in range(len(Blocks[1][bl])) if Blocks[1][bl][x] != len(targets) + 1]
    inter= [x for x in range(len(Blocks2[1][bl])) if Blocks2[1][bl][x] > (len(targets) + 1)]
    
    gp_non_out= {
        z: [x for x in non_outlier if label_vector[x] == z] for z in labels
    }
    
    gp_non_out= {
        z: gp_non_out[z] for z in gp_non_out.keys() if gp_non_out[z]
    }
    
    
    rate= np.mean([len([x for x in gp_non_out[z] if x in inter]) / len(gp_non_out[z]) for z in gp_non_out.keys()])
    
    Buffer.append(rate)
    where.append(bl)


fig_data= [go.Scatter(
y= Buffer,
x= list(range(len(Fst_crawl))),
mode= 'markers'
)
]

layout = go.Layout(
    title= 'Intermediate classes across data sets',
    yaxis=dict(
        title='Intermediate proportion'),
    xaxis=dict(
        title='data sets')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

Data Visualisation¶

Other ways of looking at the classificaiton output and the genetic structure that underlies it.

In [163]:

# pc_density
# pc_coords
Chr= 1

x_coords= [z for z in it.chain(*[[x] * np.sum(Sizes) for x in range(len(pc_coords))])]
y_coords= [z for z in it.chain(*[list(np.linspace(0,100,np.sum(Sizes))) for x in range(len(pc_coords))])]
z_coords= [z for z in it.chain(*pc_coords)]

class_colors= [z for z in it.chain(*[Blocks2[Chr][bl] for bl in Blocks[Chr].keys()])]
class_colors= [color_ref[x-1] for x in class_colors]

fig_data= [go.Scatter(
x= x_coords,
y= z_coords,
mode= 'markers',
marker= dict(
    color= class_colors,
    size= 4,
    opacity= .8
)
)
]

layout = go.Layout(
    title= 'PC1 coordinates',
    yaxis=dict(
        title='Individual positions along PC1 across data sets'),
    xaxis=dict(
        title='Ideogram')
)

fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

In [ ]:

In [155]:

x_coords= [z for z in it.chain(*[[x] * 100 for x in range(200)])]
y_coords= [z for z in it.chain(*[list(np.linspace(-4,4,100)) for x in range(len(pc_density))])]
z_coords= [z for z in it.chain(*pc_density)]


fig_data= [go.Scatter(
x= x_coords,
y= y_coords,
mode= 'markers',
marker= {
    'color': z_coords,
    'colorscale':'Viridis',
    'line': {'width': 0},
    'size': 7,
    'symbol': 'circle',
    "opacity": .6
})
]

layout = go.Layout(
    title= 'PC1 density',
    yaxis=dict(
        title='PC1 density of projections across data sets'),
    xaxis=dict(
        title='Ideogram')
)


fig= go.Figure(data=fig_data, layout=layout)
iplot(fig)

In [ ]: