Choose one of the following exercises.
Here is the beginning of a fairy tale:
A boy called Peter lived with his 2 parents in a village on the hillside. His
parents, like most of the other people in the village, were sheep farmers. There were 430 sheep in the village, 33 sheep dogs and 21 humans. Close to the village, there were also 5 wolves. Everybody in the village took turns to look after the sheep, and when Peter was 10 years old, he was considered old enough to take his turn at shepherding.
Copy it and use python and regular expressions to find:
one word consisting of exactly two letters.
all words that contain 'o'
all numbers written with digits (ie. "2" and "43" but not "eleven")
Finally, rename "Peter" to "Petter".
Hint:
Use finditer
to find all instances.
The given fasta file (proteins.fasta
) contains data on aminoacid sequences from 1000 different people. Each line represents one individual.
We are interested in finding an amino acid substitution in the sequence "TPLTVETLAKT", where "VE" is changed to "Lx" (x here means anything).
x
take)?We're back with IMDB. Again, feel free to reuse your previous code!
Use regular expressions to solve the problems and print the results using formatting.
a) Change titles
A lot of movie titles contains the word "and". Some people find the ampersand "&" a lot prettier. Pretend you're one of them and replace all "and" with "&" in the titles. Print a table with the changed titles, approximately like this:
-- Old title ---- | -- New title ----
Me and you | Me & you
Cat and dog | Cat & dog
b) Find common titles
Titles like "the grave", "the crown", "the road" are popular for movies. The pattern is "the" + one word (a noun). Find the three most popular such words.
The output should look like this:
* Noun Count
- Crown 3
- Cat 1
- Dog 1
c) Find alliterations
Alliteration is a fancy word for repetitions of a sound (or letter) in the beginning of each word, like in "Seven sisters slept soundly" or "Veni, vidi, vici". It is used in poetry as well as in rhetoric. Surely there are some movie titles using alleration as well.
Find all titles that are alliterations!
We will consider a title as an alliteration if it contains the same letter in the beginning of every word.
To make life a bit easier, we only look for alliterations with two words. Case is not important.
Print the output like this:
Title | Letter
My milkshake | M
Toxic toy | T
For each task
Counter()
from the module collections
for task b)If you find it tricky to get started with the regular expressions, have a look at the documentation:
Try listing a few examples of strings you want to match on a piece of paper. For c), you have
My milkshake
Toxic toy
Step through each line and abstract as much as you can. What characters are important? Which one's do we need to keep track of?
For c), and possibly b), you will need to use capturing groups:
import re
p = re.compile('a ([a-z]*)') # the parentheses creates a group
match = p.search('I see a cat over there.')
print('the group:', match.group(1)) # what's in the captured group
print('the whole match:', match.group()) # what's in the whole match
the group: cat the whole match: a cat
import re
# look for a word X, the word 'and' and then the word X again
p = re.compile(r'([a-z]+) and \1') match = p.search('I eat and drink, I sleep and sleep.')
print('the group:', match.group(1)) # what's in the captured group
print('the whole match:', match.group()) # what's in the whole match
the group: sleep the whole match: sleep and sleep
d) Update your code for finding alliterations to also allow and find titles with more than two words. Print the output like this:
Title | Letter | Repetitions
Mary maid me milkshake | M | 4
The toxic toy | T | 3
e) So far we have required alliterations to start with the same letter. Usually, the definition is more complex as it is based on sound rather than spelling.
Include the following criteria in your code: