Notebook

Human/Population Genetics Assignment¶

Purpose¶

This assignment will first require you to gain familiarity with two of the most common web based tools for dealing with human genomics, the UCSC human genome browser and the 1000 genomes database. All of the questions in this assignment will be based on actual data from those two sites. Once you are familiar with these tools, you will use the data to complete some simple population genetics excercises.

Goals¶

Become familiar with resources in the field of human genetics. Develop an understanding of how the principles of population genetics can be used to deduce physical characteristics of populations based on a given populations genetic makeup

IMPORTANT¶

You should turn in two files a .pdf file with all of your written answers and a .ipynb file with any code that you ran.

Below are a few functions that you will need to complete the assignment.¶

Remember to learn about what a function does you can run:

help(name_of_function)

Try this with the funcitons below to see what they do.

In [ ]:

from __future__ import division
from assignment_5_util import get_genotype_counts, create_x2_distribution_plot

Question 1¶

Go to the UCSC genome browser. Search for BRCA1 (use the top hit for this gene).
What are the genomic coordinates of this gene?
What is the function of this gene?
What is the size of the open reading frame(ORF)?
How many Exons does this gene have?
Does it have any orthologs?

Question 2¶

Use the Blat tool in the UCSC genome browser to search for this sequence: TTTCTTGATCACATAGACTTCCATTTTCTACTTTTTCTGA in Humans.
What are the chromosomes listed in the results?
What is the score of the 3rd hit?
Why might there be a match that is not the full length of the reference sequence? Hint: you may need to resear
Why might an identical sequence be in two different locations in the genome?
Search for the same sequence but this time in gorilla, orangutan and rhesus monkey, is the gene present on two chromosomes in each of these species? One? None?
Does your answer to 6 suggest anything about when the duplication event occurred?

Question 3¶

There are three vcf files available, generated from the 1000 genomes dataset. Complete the follwing questions for the GBR population and the Finnish populations individually and then again for the populations together. For the snp at location: 89721094. You are welome to calculate these values by hand or with python.You will need to show your work.

The three vcf files which can be opened by running the cell below contain information about the Finnish population(FIN) the British population(GBR) and the two populations together(GBR_and_FIN)

What information is contained in the VCF file. Hint: explore this on the 1000 genomes website
What are the allele frequencies for this SNP?
What is the predicted heterozygosity?
What is the actual heterozygosity?
Create a null and alternative hypothesis to test if this SNP is in HWE
What is the chi-squared test statistic for this SNP?

In [ ]:

    #Use this call to calculate the values  in question 3.
Great_Britain = 'GBR.vcf'
Finland = 'FIN.vcf'
GBR_and_FIN = 'GBR_and_FIN.vcf'

In [ ]:

#Use this cell to help answer question 3

Question 4¶

Use the create_x2_distribution function to estimate the p-value for each of the test statistic generated above. This does not have to be an exact number though you should indicate whether it is signifcant or not based an an alpha of .05.
Based on the p-values you estimated, is the snp in question in HWE in each of the three populations?
If the values are not in HWE discuss what could cause them not to be?
Based on the expected and observed heterozygosity of each population do you believe there is population structure between Finland and Great Britain. Discuss issues with the approach we took to answer this problem

In [ ]:

#Use this cell to answer question 4