The FASTQ file format is commonly used to store deep sequencing reads data. It is similar to the FASTA format, but includes additional information. Each record is represented by four lines:
@
character and is followed by a sequence identifier and an optional description (like a FASTA title line).+
characterSequence quality is encoded with characters from:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Where !
represents the lowest quality, equivalent to a score of 1, and ~
is the highest quality with a score of 94.
The function given below translates the characters into their corresponding scores and returns a dictionary which you can use later. Make sure you understand how to work with this dictionary before proceeding. Also note the use of """
.
def creates_scores_dict():
scores_string = """!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~"""
scores_dict = {}
for i,symbol in enumerate(scores_string):
scores_dict[symbol] = i + 1
return scores_dict
scores_dict = creates_scores_dict()
print(scores_dict)
{'W': 55, 'Y': 57, '7': 23, 'e': 69, 'n': 78, '>': 30, 'r': 82, ')': 9, 'l': 76, '@': 32, 's': 83, '{': 91, '1': 17, 'g': 71, 'j': 74, 'u': 85, 'X': 56, '$': 4, 'a': 65, 'F': 38, '9': 25, ';': 27, 'B': 34, '#': 3, '^': 62, '*': 10, '[': 59, '|': 92, '4': 20, 'd': 68, ']': 61, '}': 93, 'U': 53, '%': 5, 'L': 44, 'T': 52, 'y': 89, '3': 19, 'c': 67, 'o': 79, 'f': 70, 'A': 33, 'Q': 49, '-': 13, 'R': 50, 'E': 37, '!': 1, '`': 64, 'D': 36, '_': 63, ':': 26, ',': 12, 't': 84, 'I': 41, 'O': 47, 'z': 90, 'G': 39, '5': 21, '(': 8, '\\': 60, 'N': 46, '~': 94, 'Z': 58, 'i': 73, '6': 22, 'q': 81, '8': 24, 'v': 86, 'k': 75, 'S': 51, '/': 15, 'b': 66, 'K': 43, '=': 29, "'": 7, 'C': 35, '0': 16, 'M': 45, 'w': 87, '?': 31, 'J': 42, 'm': 77, 'x': 88, '&': 6, 'P': 48, '+': 11, 'h': 72, '.': 14, 'V': 54, '2': 18, '<': 28, '"': 2, 'p': 80, 'H': 40}
The file files_for_hw4/lambda_reads.fq
contains 10,000 reads from the sequencing of $\lambda$ phage. We would like to discard low quality reads. A low quality read is defined as one with a mean score lower than some predefined threshold.
a) Write a function mean_score
that receives a read quality string and returns the mean score (float) of the read.
For example, the quality string !!!!!
is equivalent to the scores 1,1,1,1,1
, and thus the mean is 1.0
.
However, the string 49@5<*>=E
is equivalent to the scores 20,25,32,21,28,10,30,29,37
and has a mean of 25.77
.
def mean_score(read_quality_string):
# your code here. remove the pass statement.
pass
assert(mean_score('!!!!!') == 1.0)
assert(round(mean_score('49@5<*>=E'),2) == 25.78)
b) Write a function parse_FASTQ
that parses a FASTQ file. It receives a path to a FASTQ file on the local filesystem and returns a dictionary where keys are sequences and values are the mean scores of the sequences.
Use the function on the provided file files_for_hw4/lambda_reads.fq
.
It is recommended to use the readline()
method of file objects (although other solutions are also possible).
def parse_FASTQ(file):
# your code here. remove the pass statement.
pass
# parse lambda reads file
lambda_reads_file = "files_for_hw4/lambda_reads.fq"
lambda_seqs_dict = parse_FASTQ(lambda_reads_file)
c) Write a function filter_reads
that takes the output from section b and a score cutoff (integer/float) and prints out the sequences with scores higher than the cutoff.
Each sequence will be printed in a separate line (no need to keep the FASTQ format). Try different cutoffs (5,10,20) on the $\lambda$ phage reads.
def filter_reads(seqs_dict,cutoff,out_file):
# your code here. remove the pass statement.
pass
# run on Lambda reads
lambda_filtered_file = "files_for_hw4/lambda_filtered_reads.txt"
filter_reads(lambda_seqs_dict, 10, lambda_filtered_file)
In this question, we will work on data from the Global Biodiversity Information Facility (GBIF). This server has lots of data about occurences of organisms around the world.
Our goal will be to get the coordinates of observations of endangered turtles species in the Southern hemisphere.
The CSV files under files_for_hw4/gbif_files
, named GBIF1.csv
, GBIF2.csv
etc. contain observations data for various turtles genera. Each record is an observation, and contains information such as the species name and the coordinates of the observation.
a) The first step in the analysis will be to concatenate all the CSV files into one CSV file to make processing easier.
Write a function that receives a list of file names and concatenates them (i.e., inserts all records from all files into one file). Remember to include the header line only once, at the beginning.
Use the function to create a unified csv file.
Note: The function has no return value!
def concat_csvs(filenames, output_filename):
# your code here. remove the pass statement.
pass
# create unified file
import glob
filenames = glob.glob("files_for_hw4/gbif_files/GBIF*.csv")
print(filenames)
output_filename = "files_for_hw4/gbif_files/concat_GBIF.csv"
append_csvs(______________, ______________)
b) The file files_for_hw4/gbif_files/endangered_turtles.txt
contains a list of endangered turtles species from The IUCN Red List of Threatened Species.
Write a function get_species
that reads the file and returns a list of the species it includes.
def get_species(filename):
# your code here. remove the pass statement.
pass
turtles_filename = "files_for_hw4/gbif_files/endangered_turtles.txt"
endangered_turtles = get_species(turtles_filename)
print(endangered_turtles)
c) We will now use the unified CSV file and endangered species list to create a filtered CSV, containing only the records we are interested in: those of endangered species, observed in the Southern hemisphere (i.e. in latitude < 0).
Write a function that receives the unified filename, the species list, maximum latitude and output file, and write to the output file the records which satisfy these conditions (no return value is needed).
def filter_records(filename, species, max_lat, out_csv_file):
# your code here. remove the pass statement.
pass
filtered_file = "files_for_hw4/gbif_files/endangered.csv"
filter_records(_______, _______, _______, _______)
In this question, you don't have to write real code, just write the regular expression you'd use within the quotation marks.
a) Write a regex that will match strings containing any kind of number: positive/negative, integer/float etc. For example, all of the following should be matched: 7 , -3 , 6.14 , -0.00054
import re
re.compile(r'')
b) Write a regex that will match strings that end with a number between 100 and 199, followed by a '.' or a '' character.
re.compile(r'')
c) Write a regex that will match whole strings of prices in dollars, such as '100$', '2.99$', '500.90$', but not '7.656$', '80.0001$' or 'price is: 56.80$'.
re.compile(r'')
d) Write a regex that will match strings beginning with 3 to 8 uppercase letters, followed by at least 4 characters, which can be anything but '%' or '!', and end with 'XY' or 'QW'.
re.compile(r'')
A full scientific name of a plant species consists of a genus name, a species name and an authority (usually a short for the name of the person to first describe the species). For example, in Arabidopsis thaliana (L.) Heynh., Arabidopsis is the genus, thaliana) is the species and (L.) Heynh. is the authority.
In addition, a name may (or may not) include an infraspecific rank. This is added after the species name, and consists of an epithet which is either subsp. (for subspecies) or var. (for variety). The epithet is followed by the name of the infraspecific rank.
For example, in Fraxinus americana var. acuminata (Lam.) K.Koch, the genus is Fraxinus, the species is americana, the infraspecies is var. acuminata and the authority is (Lam.) K.Koch.
The file files_for_hw4/plant_names.txt
contains a list of plant names. The goal is to break these names into their components.
Write a program that reads the names from the file and writes a new CSV file with the following column names: Genus, Species, Infraspecific and Authority.
Each plant name in plant_names.txt
should then be processed (use regular expressions) and its components inserted into the CSV file.
### your code here