Notebook

Generating haplotypes for simulated populations.¶

In [1]:

import scipy
import numpy as np
from sklearn.decomposition import PCA
from scipy.stats import invgamma
from scipy.stats import beta

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *

import Lab_modules.Modules_tools

init_notebook_mode(connected=True)

This is a very rough approach at generating haplotypes for two or more different populations.

For each population, alternative allele frequencies will be drawn randomly from the chosen distribution.

Each individual will then have his haplotype drawn using the allele frequency vector pertaining to his population.

In [2]:

## the distribution i chose here was the beta distribution.
# doc: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html#scipy.stats.beta

# This is because with the right parameters a and b, the probability density produced 
# resembles that often seen in allele frequency densities.
# let's see what that looks like:

a, b = 1.8, .2
r= beta.rvs(a, b, size=1000)
r= [1-x for x in r] 

fig= go.Histogram(
    x=r,
    histnorm='',
    name='control',
    opacity=0.75
)

layout = go.Layout(
    title='frequency distribution. a= {}; b= {}'.format(a,b),
    xaxis=dict(
        title='Value'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.2,
    bargroupgap=0.1
)
fig= [fig]

fig = go.Figure(data=fig, layout=layout)
iplot(fig, filename='styled histogram')

In [3]:

# a= 1.8 and b= .1 seem appropriate, lets proceed to create our populations

# We must first define the number of populations, the length of the haplotypes desired, and their respective population sizes:
N_pops= 3
L= 200
Sizes= [250,100,300]
labels= np.repeat(np.array([x for x in range(N_pops)]),Sizes)

data= []

for k in range(N_pops):
    
    probs= beta.rvs(a, b, size=L)
    probs[(probs > 1)]= 1
    
    m= Sizes[k]
    Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(L)] for acc in range(m)]
    
    data.extend(Haps)

data= np.array(data)
print(data.shape)

(650, 200)

In [ ]:

In [4]:

n_comp = 100

pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized')
features = pca.fit_transform(data)

var_comps= pca.explained_variance_ratio_
print("; ".join(['PC{0}: {1}'.format(x+1,round(var_comps[x],3)) for x in range(n_comp)]))
print(features.shape)

PC1: 0.191; PC2: 0.068; PC3: 0.016; PC4: 0.015; PC5: 0.014; PC6: 0.014; PC7: 0.013; PC8: 0.013; PC9: 0.012; PC10: 0.012; PC11: 0.012; PC12: 0.011; PC13: 0.011; PC14: 0.011; PC15: 0.011; PC16: 0.01; PC17: 0.01; PC18: 0.01; PC19: 0.01; PC20: 0.01; PC21: 0.01; PC22: 0.009; PC23: 0.009; PC24: 0.009; PC25: 0.009; PC26: 0.009; PC27: 0.009; PC28: 0.008; PC29: 0.008; PC30: 0.008; PC31: 0.008; PC32: 0.008; PC33: 0.008; PC34: 0.008; PC35: 0.007; PC36: 0.007; PC37: 0.007; PC38: 0.007; PC39: 0.007; PC40: 0.007; PC41: 0.007; PC42: 0.007; PC43: 0.007; PC44: 0.007; PC45: 0.006; PC46: 0.006; PC47: 0.006; PC48: 0.006; PC49: 0.006; PC50: 0.006; PC51: 0.006; PC52: 0.006; PC53: 0.006; PC54: 0.006; PC55: 0.005; PC56: 0.005; PC57: 0.005; PC58: 0.005; PC59: 0.005; PC60: 0.005; PC61: 0.005; PC62: 0.005; PC63: 0.005; PC64: 0.005; PC65: 0.005; PC66: 0.005; PC67: 0.005; PC68: 0.004; PC69: 0.004; PC70: 0.004; PC71: 0.004; PC72: 0.004; PC73: 0.004; PC74: 0.004; PC75: 0.004; PC76: 0.004; PC77: 0.004; PC78: 0.004; PC79: 0.004; PC80: 0.004; PC81: 0.004; PC82: 0.004; PC83: 0.004; PC84: 0.004; PC85: 0.003; PC86: 0.003; PC87: 0.003; PC88: 0.003; PC89: 0.003; PC90: 0.003; PC91: 0.003; PC92: 0.003; PC93: 0.003; PC94: 0.003; PC95: 0.003; PC96: 0.003; PC97: 0.003; PC98: 0.003; PC99: 0.003; PC100: 0.003
(650, 100)

In [5]:

## lets visualize the result now:
colors_pres= ['red','black','yellow']

fig_data= [go.Scatter(
        x= features[[x for x in range(sum(Sizes)) if labels[x] == i],0],
        y= features[[x for x in range(sum(Sizes)) if labels[x] == i],1],
        mode= "markers",
        marker= {
        'color': colors_pres[i],
        'line': {'width': 0},
        'size': 8,
        'symbol': 'circle',
      "opacity": .8
      },
      name= str(i)
    ) for i in range(N_pops)]


layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    ),
    yaxis=dict(
        title='PC2: {}'.format(round(var_comps[1],3))),
    xaxis=dict(
    title= 'PC1: {}'.format(round(var_comps[0],3)))
)

fig = go.Figure(data=fig_data, layout=layout)
iplot(fig)

This looks very nice. Frequencies drawn at random from a predetermined beta distribution produce very different vectors, which translate into neatly separated populations.

Here, the degree of differentiation between our populations will be proportional to the correlation between the vectors we created them with. However, by drawing at random from a single beta distribution, we really have no hand over the relative distances between these vectors.

How can we simulate populations, or vectors, in a way that allows us to have at least some control over their relative differences?

Population genetics models could provide us with an answer, by assuming mutation rates, growth rates, recombination and standard age gaps to coalesce our four populations and extract ancestral vectors and sizes at nodes of our choosing.

If we don't need the information generated in that process however, a more pragmatic approach is possible.

Since what we are looking to do is identify a degree of correlation between these vectors before we generate their associated haplotypes, it stands that a PCA should do the job.

In the first part, we drew four vectors only for stable values of a and b, the beta parameters. Let's make those parameters vary, and draw 10 vectors at each change.

In [7]:

# Simulate frequency vectors. 
# We must first define the number of populations, the length of the haplotypes desired, and their respective population sizes
L= 300

import itertools as it
n= 200

# Vary a (beta distribution parameter).
a_range= np.linspace(1,2,20)
a_set= [i for i in a_range for _ in range(n)]

# vary b.
b_range= np.linspace(0.1,.4,20)
b_set= [i for i in b_range for _ in range(n)]

## length of haplotypes to extract.
L_set= [L] * n * 20


background_1= np.array([a_set,b_set,L_set]).T

vector_lib= []
for k in range(background_1.shape[0]):
    
    probs= beta.rvs(background_1[k,0], background_1[k,1], size=int(background_1[k,2]))
    probs= 1-probs
    probs[(probs > 1)]= 1
    
    
    vector_lib.append(probs)

vector_lib= np.array(vector_lib)

And reduce their dimension through PCA.

In [9]:

n_comp = 100

pca_vectors = PCA(n_components=n_comp, whiten=False,svd_solver='randomized')
features_vectors = pca_vectors.fit_transform(vector_lib)
var_comps= pca_vectors.explained_variance_ratio_

print("; ".join(['PC{0}: {1}'.format(x+1,round(pca_vectors.explained_variance_ratio_[x],3)) for x in range(n_comp)]))
print('features shape: {}'.format(features.shape))

PC1: 0.015; PC2: 0.005; PC3: 0.005; PC4: 0.005; PC5: 0.005; PC6: 0.005; PC7: 0.005; PC8: 0.005; PC9: 0.005; PC10: 0.005; PC11: 0.005; PC12: 0.005; PC13: 0.005; PC14: 0.005; PC15: 0.005; PC16: 0.005; PC17: 0.005; PC18: 0.005; PC19: 0.005; PC20: 0.005; PC21: 0.005; PC22: 0.005; PC23: 0.005; PC24: 0.005; PC25: 0.005; PC26: 0.005; PC27: 0.005; PC28: 0.005; PC29: 0.005; PC30: 0.005; PC31: 0.005; PC32: 0.005; PC33: 0.005; PC34: 0.004; PC35: 0.004; PC36: 0.004; PC37: 0.004; PC38: 0.004; PC39: 0.004; PC40: 0.004; PC41: 0.004; PC42: 0.004; PC43: 0.004; PC44: 0.004; PC45: 0.004; PC46: 0.004; PC47: 0.004; PC48: 0.004; PC49: 0.004; PC50: 0.004; PC51: 0.004; PC52: 0.004; PC53: 0.004; PC54: 0.004; PC55: 0.004; PC56: 0.004; PC57: 0.004; PC58: 0.004; PC59: 0.004; PC60: 0.004; PC61: 0.004; PC62: 0.004; PC63: 0.004; PC64: 0.004; PC65: 0.004; PC66: 0.004; PC67: 0.004; PC68: 0.004; PC69: 0.004; PC70: 0.004; PC71: 0.004; PC72: 0.004; PC73: 0.004; PC74: 0.004; PC75: 0.004; PC76: 0.004; PC77: 0.004; PC78: 0.004; PC79: 0.004; PC80: 0.004; PC81: 0.004; PC82: 0.004; PC83: 0.004; PC84: 0.004; PC85: 0.004; PC86: 0.004; PC87: 0.004; PC88: 0.004; PC89: 0.004; PC90: 0.004; PC91: 0.004; PC92: 0.004; PC93: 0.004; PC94: 0.003; PC95: 0.003; PC96: 0.003; PC97: 0.003; PC98: 0.003; PC99: 0.003; PC100: 0.003
features shape: (650, 100)

In [10]:

fig_data= [go.Scatter(
        x = features_vectors[:,0],
        y = features_vectors[:,1],
        mode= "markers",
        text= ['a: {}; b: {}, L: {}; index = {}'.format(background_1[k,0],background_1[k,1],background_1[k,2], k) for k in range(background_1.shape[0])],
        marker= {
        'line': {'width': 0},
        'size': 4,
        'symbol': 'circle',
      "opacity": .8
      }
    )]


layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    ),
    yaxis=dict(
        title='PC2: {}'.format(round(var_comps[1],3))),
    xaxis=dict(
    title= 'PC1: {}'.format(round(var_comps[0],3)))
)



fig = go.Figure(data=fig_data, layout=layout)
iplot(fig)

Principal component analysis of the vectors produced. Their index along the vector_lib array can be examined by hovering above the points,

Their distances in this space are our indicators of correlation among these vectors. Chose them according to the scenarios intended and generate the haplotypes. Below i chose four, and plotted the result of a PCA on the resulting data set. I won't comment on them, the process is random, and so will be the relationships betweem the populations.

We'll also calculate the pairwise Fsts between these populations based on their frequency vectors. These can then be compared to their distances in feature space.

In [11]:

Pops= [3,15,1,30,75]
N_pops= len(Pops)
L= 200
Sizes= [50,50,50,50,50]
labels= np.repeat(np.array([x for x in range(N_pops)]),Sizes)

data= []

for k in range(N_pops):
    
    probs= vector_lib[Pops[k],:]
    
    m= Sizes[k]
    Haps= [[np.random.choice([1,0],p= [1-probs[x],probs[x]]) for x in range(L)] for acc in range(m)]
    
    data.extend(Haps)

data= np.array(data)

### Calculate pairwise Fst based on frequency vectors selected.
from Lab_modules.StructE_tools import return_fsts, return_fsts2
freqs_selected= vector_lib[Pops,:]
Pairwise= return_fsts2(freqs_selected)

print(Pairwise)

     pops       fst
0  (0, 1)  0.093189
1  (0, 2)  0.108052
2  (0, 3)  0.101388
3  (0, 4)  0.109989
4  (1, 2)  0.096115
5  (1, 3)  0.090412
6  (1, 4)  0.089475
7  (2, 3)  0.109192
8  (2, 4)  0.093444
9  (3, 4)  0.105544

D:\GitHub\Genetic-data-analysis\Notebooks\Lab_modules\StructE_tools.py:419: RuntimeWarning:

invalid value encountered in double_scalars

In [12]:

n_comp = 4

pca = PCA(n_components=n_comp, whiten=False,svd_solver='randomized')

features= pca.fit_transform(data)

print("; ".join(['PC{0}: {1}'.format(x+1,round(pca.explained_variance_ratio_[x],3)) for x in range(n_comp)]))
print(features.shape)

PC1: 0.159; PC2: 0.122; PC3: 0.101; PC4: 0.089
(250, 4)

In [14]:

## lets visualize the result now:

fig_data= [go.Scatter3d(
        x = features[[x for x in range(sum(Sizes)) if labels[x] == i],0],
        y = features[[x for x in range(sum(Sizes)) if labels[x] == i],1],
        z = features[[x for x in range(sum(Sizes)) if labels[x] == i],2],
        mode= "markers",
        marker= {
        'line': {'width': 1},
        'size': 5,
        'symbol': 'circle',
      "opacity": 1
      },
      name= 'pop: {}'.format(Pops[i])
    ) for i in range(N_pops)]


layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)

fig = go.Figure(data=fig_data, layout=layout)
iplot(fig)

Expanding range of genetic distances.¶

So far we saw how to generate proxy populations along and explored a manner to select them based on correlation.

One questioon remains: how differentiated are the pops we generated? If we select randomly from the vector_lib data set, what differentiation can we expect between the selected populations?

To answer this question, we will sample randomly from the vector_lib data set, calculate pairwise Fsts, and look at their distribution.

In [ ]:

### distribution of genetic distances generated on first layer:

N_sample= 500

Pops_test= np.random.choice(vector_lib.shape[0],N_sample)

Fsts_test, Total_fst= return_fsts(vector_lib,Pops_test)

In [13]:

### Distribution of feature space distances between control populations for even and biased scenarios
from sklearn.neighbors import KernelDensity
Fsts_den= Fsts_test.fst

Fsts_den= np.nan_to_num(Fsts_den)

X_plot = np.linspace(0, .5, 1000)

kde = KernelDensity(kernel='gaussian', bandwidth=0.005).fit(np.array(Fsts_den).reshape(-1,1))

log_dens = kde.score_samples(X_plot.reshape(-1,1))

fig_roost_dens= [go.Scatter(x=X_plot, y=np.exp(log_dens), 
                            mode='lines', fill='tozeroy', name= 'Biased senarios',
                            line=dict(color='blue', width=2))]

##

layout= go.Layout(
    title= 'Distribution of genetic distances generated in layer I.'
)

fig = go.Figure(data=fig_roost_dens, layout= layout)
iplot(fig)

Distribution of pairwise Fsts between frequency vectors randomly selected from the vector_lib data set.

The result is not optimal. We learn that our approach generates populations along a limited range of genetic distances, roughly 0.09 to 0.15.

We will try to expand that range, resorting to the PCA transformation of these vectors.

We will extract the inverse transformation of new coordinates, drawn at small distance intervals between one another.

MRCA - Most Recent Common Ancestor.¶

The following block serves to tie all the populations in the vector data set together.

The random generation of frequency vectors creates vectors distinct along, assymptotically, all possible directions.

Here, we limit the number of possible directions, by creating a data set made entirely of vectors generated as described for the manipulation of genetic distances, i.e. from equally distant coordinates between two initial projections. We continue to rely on pairs of initial projections. However, here, only one projection is made to vary, while the other is chosen beforehand and remains the same.

The result is the starshaped distribution observed in the next graph.

In [84]:

Iter= 50
target= [0,1]
stairs= 4

MRCA= np.random.choice(range(vector_lib.shape[0]),1)
calypso= []
feat= []

background= []

for inter in range(stairs):
    Pair= np.random.choice(range(vector_lib.shape[0]),2,replace= False)
    Pair[1]= MRCA
    print(Pair)
    
    coords= features_vectors[Pair,:]
    
    vector2= coords[target[1]] - coords[target[0]]
    
    for angle in np.linspace(-20,20,Iter):
                
        new_guy = coords[target[0]] + [angle / 10 * x for x in vector2]
        
        feat.append(new_guy)
        
        new_guy= pca_vectors.inverse_transform(new_guy)
        
        
        new_guy[new_guy < 0]= 0
        new_guy[new_guy > 1]= 1
        
        background.append([inter, angle])
        calypso.append(new_guy)

features= np.array(feat)
vector_lib_2= np.array(calypso)
background= np.array(background)

[340  99]
[502  99]
[3847   99]
[1517   99]

In [85]:

## Plot vector PCA
fig_data= [go.Scatter(
        x = features[:,0],
        y = features[:,1],
        mode= "markers",
        text= ['a: {}; b: {}; index = {}'.format(background[k,0],background[k,1], k) for k in range(background.shape[0])],
        marker= {
        'line': {'width': 0},
        'size': 6,
        'symbol': 'circle',
      "opacity": .6
      }
    )]


layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)

fig = go.Figure(data=fig_data)
iplot(fig)

In [86]:

N_sample= 100

Pops_test= np.random.choice(vector_lib_2.shape[0],N_sample)

freqs_selected= vector_lib_2[Pops_test,:]
Pairwise= return_fsts2(freqs_selected)

Fsts_test_2 = Pairwise.fst

C:\Users\jgarcia\Desktop\Jupyter_stuff\Genetic-data-analysis\Notebooks\Lab_modules\StructE_tools.py:419: RuntimeWarning:

invalid value encountered in double_scalars

In [87]:

### Distribution of feature space distances between control populations for even and biased scenarios
from sklearn.neighbors import KernelDensity


Fsts_den= np.nan_to_num(Fsts_test_2)

X_plot = np.linspace(0, .5, 1000)

kde = KernelDensity(kernel='gaussian', bandwidth=0.005).fit(np.array(Fsts_den).reshape(-1,1))

log_dens = kde.score_samples(X_plot.reshape(-1,1))

fig_roost_dens= [go.Scatter(x=X_plot, y=np.exp(log_dens), 
                            mode='lines', fill='tozeroy', name= 'Biased senarios',
                            line=dict(color='blue', width=2))]

##

layout= go.Layout(
    title= 'Distribution of genetic distances generated in layer II.'
)

fig = go.Figure(data=fig_roost_dens, layout= layout)
iplot(fig)

Distribution of pairwise Fsts between frequency vectors randomly selected from MRCA data set.

It is definitely not pretty, but the range of genetic distances produced is succesfully expanded.

Inverse transformation impact on allele frequencty distribution¶

What do inverse transformed allele frequency vectors present in the vector_lib_2 data set look like?

We analyse the average likelihood of frequency values across vectors and look at its variance along this range.

In [113]:

def check_densities(vector_lib_2,N):
    
    who= np.random.choice(list(range(vector_lib_2.shape[0])),N,replace= False)
    
    freqs= []
    
    for pop in who:
        
        freq_vector= vector_lib_2[pop,:]

        X_plot = np.linspace(0, 1, 100)

        kde = KernelDensity(kernel='gaussian', bandwidth=0.01).fit(np.array(freq_vector).reshape(-1,1))

        log_dens= kde.score_samples(X_plot.reshape(-1,1))
                        
        freqs.append(np.exp(log_dens))
    
    freqs= np.array(freqs)
    
    return freqs


dist_freq= check_densities(vector_lib_2,200)

fig_dist= [go.Scatter(
    x= np.linspace(0, 1, dist_freq.shape[1]),
    y= np.mean(dist_freq,axis= 0),
    mode= 'markers+lines',
    name= 'mean'
)    
]

fig_dist.append(go.Scatter(
    x= np.linspace(0, 1, dist_freq.shape[1]),
    y= np.std(dist_freq,axis= 0),
    mode= 'markers+lines',
    name= 'sd'
))

layout= go.Layout(
    title= 'Branch vectors dist',
    yaxis= dict(
        title= 'Likelihood'
    ),
    xaxis= dict(
        title= 'frequency'
    )
)

fig = go.Figure(data=fig_dist, layout= layout)
iplot(fig)

Individual vector example.¶

In [117]:

Rand_select=115

fig_roost_dens= [go.Scatter(x=np.linspace(0, 1, dist_freq.shape[1]), y=dist_freq[Rand_select,:], 
                            mode='lines', fill='tozeroy', name= 'Biased senarios',
                            line=dict(color='blue', width=2))]

##

layout= go.Layout(
    title= 'Distribution of genetic distances generated in layer II.'
)

fig = go.Figure(data=fig_roost_dens, layout= layout)
iplot(fig)

In [ ]: