An Introduction to Solving Biological Problems with Perl

Graham Ritchie, John Davey & Feyruz Yalcin
Cambridge Graduate School of Life Sciences Bioinformatics Training

from Sofia Robb's original course: `http://sofiarobb.com/learning-perl-toc/`

This page can be found at bit.ly/cambridgeperl. Solutions to all the exercises can be found on the course GitHub repository

Lesson 1: Perl script skeleton

Perl is widely used by biologists to solve problems we encounter when handling DNA and protein sequences, file format conversions, as well as many any other tasks that are computationally repetitive. Perl is not the only language for these tasks but one of many.

Perl comes preinstalled on Macs and Linux distributions. It can be downloaded for PCs (see perl.org for details; Strawberry Perl is a good choice).

Perl is designed to work like a natural language. Just as there are many different ways to say roughly the same thing in a natural language, so there are many different ways to solve a problem using Perl. Perl's motto is "There's More Than One Way To Do It". This has advantages and disadvantages. On the one hand, Perl can be easily adopted by many different communities and applied to many different problems; it is heavily used by organisations like the BBC and Amazon and was instrumental in the completion of the Human Genome Project. On the other hand, it can be horribly abused, just like natural languages.

In this course, we will teach the basic vocabulary and grammar of the Perl language, using some simple bioinformatics examples. However, as bioinformatics covers a multitude of different topics, we recommend that you consider the problems you are trying to solve in your research and try to apply Perl to these problems throughout the course.

As some people on the course have not programmed before, we assume no previous experience throughout. The pace will therefore be slow for people who have used other programming languages. Please feel free to read ahead.

Perl programs (also known as scripts) are series of instructions that we save in plain text files. As computers do not understand Perl directly, our scripts must be processed by a program called the Perl interpreter, which translates the scripts into machine language. This program is run from the command line. To get started with Perl, we must get acquainted with the command line.

  1. Open the text editor gedit (icon in the menu on the left side of the screen) and save a file called "central_dogma.pl" containing the following line:

    • print "Once information has passed into protein it cannot get out again\n";

      Note the directory that you save the file into. Now open up the Terminal (also in the left hand menu). We want to move to the directory containing the file you just saved, and run the Perl interpreter on that file.

  1. Enter the command pwd (on Linux or Mac) or cd (on Windows) to see which directory you are in. If it's not the directory that contains your script, use the command ls (on Linux or Mac) or dir (on Windows) to see what the directory contains. If you see your directory in the list - let's say it's called mydir - you can move into that directory by entering the command cd mydir. Check that mydir contains your central_dogma.pl script using ls or dir.

  2. Run the Perl interpreter by typing:

    • perl central_dogma.pl

The script you just ran contains one command, which prints the string "Once information has passed into protein it cannot get out again" to the screen, and moves the cursor to a new line (\n). The semi-colon is used to punctuate the end of a command.

You can work through the rest of the course using the interpreter in this way. However, you may find it useful to add the following features to your script. If you find them confusing, however, feel free to skip to Lesson 2 for now.

  1. If there is an error in your code that the interpreter can't make sense of, it will fail to run. But it can also warn you about problems in the code that it can interpret, but which are probably not going to work correctly. You can turn on these warnings by putting this at the top of your script:

    • use warnings;

      You can also ask the Perl interpreter to turn warnings on for you:

    • perl -w central_dogma.pl

      use warnings and -w do the same thing, so you only need to use one of them.

  2. It's useful to annotate your scripts with human-readable comments. Perl ignores anything on a line that follows a hash symbol. So we could have written:

    • # central_dogma.pl
    • # Print the Central Dogma of Molecular Biology
    • print "Once information has passed into protein it cannot get out again\n" # See Crick 1970

      In this case, the comments are not very useful, but when we get to writing more complicated code it is very useful to make notes in comments, to remind ourselves how the code works when we come to read it months or years later.

  3. Many Perl guides will tell you to add use strict; to every script you write (like use warnings; above). This tells the Perl interpreter to complain if you use some of the more dangerous Perl idioms in your code. It's not too important for the course, as we won't be using any of those idioms, but it's not a bad idea to use it in scripts you write and actually use.

Putting it all together, the complete script:

In [1]:
%%perl
# central_dogma.pl
# Print the Central Dogma of Molecular Biology
use strict;
use warnings;
print "Once information has passed into protein it cannot get out again\n"; # See Crick 1970
Once information has passed into protein it cannot get out again

Lesson 2: Values and variables

Typically, we want to process some data using Perl and produce some output. This means we have a set of values and we want to transform these into some other, more useful information. In Perl, we create data structures to hold our values, which we can then manipulate. We call these variables, because we can change their values as we progress through our analysis.

There are three major types of data structures we typically use in Perl: scalars, arrays and hashes. Scalars are single values, such as numbers or characters. Strings of characters also count as single values in Perl. Arrays are ordered lists of values, whereas hashes are unordered collections of values. Variable names of each type begin with a symbol that identifies the variable as a scalar, array or hash. The symbols are given in the table below.

Scalar$single values
Array@ordered lists of scalars
Hash%unordered collections of scalars

Values can be assigned to variables using the assignment operator =. For example (all code needs to have the complete script skeleton from Lesson 1 but we will leave it out of the notes in further examples):

In [2]:
%%perl
# use 'my' the FIRST time you use a variable
my $serine = "TCA";
my $codonLength = 3;
print "Codon $serine has $codonLength bases\n";
Codon TCA has 3 bases

We can change the value of a variable by using the same assignment operator:

In [3]:
%%perl
my $serine = "TCA";
print "Serine is coded by $serine\n";
$serine = "TCC";
print "Serine is also coded by $serine\n";
Serine is coded by TCA
Serine is also coded by TCC

We can even use the value of a variable when we change its value, for example if we wanted to count something:

In [4]:
%%perl
my $age = 18;
print "I am $age years old\n";

# two years later
$age = $age + 2;
print "I will be $age years old two years later\n";
I am 18 years old
I will be 20 years old two years later

Another example of scalar reassignment:

In [5]:
%%perl
my $dogma = "Once information gets into protein";
print "1: $dogma\n";
 
# '.' text concatenation operator
$dogma = $dogma . " it cannot get out again";
print "2: $dogma\n";
 
$dogma = "DNA makes RNA makes Protein";
print "3: $dogma\n";
1: Once information gets into protein
2: Once information gets into protein it cannot get out again
3: DNA makes RNA makes Protein

Scalars

type of scalarexamples
number: integers1 2 -3 128
number: floating-points-0.2 0.1 3.14159
string: letters"a" "A" "G" "c"
string: words"Protein" "DNA"
string: words and spaces and punctuation"DNA makes RNA makes Protein."
reference(see later lessons)

Single and Double Quotes

If you put a variable such as “$codon” inside of double quotes in a print statement, the value of $codon will be printed, but if you put ‘$codon’ inside of single quotes you will literally get $codon. The same is true for special characters like the newline “\n”.

In [6]:
%%perl
my $codon = "ACG";
print "$codon\n";
print '$codon\n';
ACG
$codon\n

Exercises

  1. Look up the genetic code. Create a number of scalar variables that store possible DNA encodings of serine (S), leucine (L), tyrosine (Y) and cysteine (C). Where multiple codings are available, just pick one for now.
  2. Create a variable containing a possible DNA sequence for the protein sequence SYLYC. Use the concatenation operator.

Lesson 3: Arrays

An array is an ordered list of scalars. The scalars do not all need to be of the same kind.

In [7]:
%%perl
my @numbers = (22,103,1,0);
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");
my $serine = "TCA";
my @mixed = (23,$serine,0.3,"Histone");

There are a number of different ways to print arrays. For example, we can dump the elements of an array one after the other with no delimiters just by directly printing the array:

In [8]:
%%perl
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");

print "Amino acids: " , @aminoAcids , "\n";
Amino acids: SerineTyrosineLeucineTyrosineCysteine

But this isn't very readable. We can do better by enclosing the array variable in double quotes, which inserts a space after each element in the array:

In [9]:
%%perl
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");
print "Amino acids, spaced: @aminoAcids\n";
Amino acids, spaced: Serine Tyrosine Leucine Tyrosine Cysteine

We will see a third way later on when we look at the join function.

We can access individual elements of an array by using their index. Each element is given a unique number, an index, corresponding to their place in the array. In Perl, counting starts from 0, so the first element of the array has index 0, the second has index 1 and so on. We access elements using square brackets:

In [10]:
%%perl -w
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");
print $aminoAcids[2], "\n";
print $aminoAcids[0], "\n";
print $aminoAcids[-1], "\n";
Leucine
Serine
Cysteine

You can assign new values to array elements by using their indices:

In [11]:
%%perl -w
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");
$aminoAcids[2] = "Cysteine";
print "@aminoAcids\n";
$aminoAcids[5] = "STOP";
print "@aminoAcids\n";
Serine Tyrosine Cysteine Tyrosine Cysteine
Serine Tyrosine Cysteine Tyrosine Cysteine STOP

Arrays can be included in other arrays, but the result is just one flat array. For example, if two arrays of three elements each are merged together, the result is an array containing the six elements, not an array containing two arrays.

In [12]:
%%perl -w

my @serine = ('T','C','A');
print "@serine\n";

my @cysteine = ('T', 'G', 'T');
print "@cysteine\n";

my @sc = (@serine, @cysteine);
print "@sc\n";

print "$sc[4]\n";
T C A
T G T
T C A T G T
G

We can get the length of an array like this:

In [13]:
%%perl -w
my @serine = ('T','C','A');
my $length = @serine;
print "$length";
3

This is called accessing an array in scalar context. $length is a scalar, so when we set the array @serine as the value of the scalar $length, Perl interprets this to mean we want the length of the array. This is just a convention - Perl could, for example, return the first element of the array instead - but it is a useful convention that we will use often.

Exercises

  1. Create an array for the protein sequence CLYSY where the elements of the array are DNA codon sequences. Use the codon variables you defined in the last lesson.
  2. Print the DNA sequence of the protein to the screen, with and without spaces between each codon.
  3. Print the DNA sequence of the last amino acid in the protein by accessing this element from the array.
  4. Create an additional stop codon variable and add the DNA for stop to your DNA sequence.

Lesson 4: Hashes

Arrays are very useful but not sufficient for many purposes. Often, we have a data set with no inherent order but where each entry has a particular value. For example, we may have a set of genes with expression levels, where we want to look up expression levels by gene name. There is no biological order to the gene names and we want to rapidly access particular genes without trawling through the whole data collection looking for it. Perl has a very efficient data structure for dealing with this situation - the hash.

In [14]:
%%perl -w
# hash_assignment.pl
 
my %geneticCode = ('TAT' => 'Tyrosine',
                   'TAC' => 'Tyrosine',
                   'CAA' => 'Glutamine',
                   'CAG' => 'Glutamine',
                  );

Hashes have keys and values, and work like a dictionary - we look up the keys to find the matching values. The keys are like indices of an array, but they can be any scalar, not just a number, as long as they are unique in the hash. In the hash above, the DNA codons are the keys (TAT, TAC and so on) and the amino acids names are the values. As with arrays, the values can be any type and do not need to all be the same type, as long as they are scalars.

We can access values from a hash by looking up their keys. We use curly brackets to do this, rather than the square brackets we used for array indices. As for array elements, a hash value is a scalar, not a hash, so we say $geneticCode{'TAT'}, not %geneticCode{'TAT'}.

In [15]:
%%perl -w
# hash_assignment.pl
 
my %geneticCode = ('TAT' => 'Tyrosine',
                   'TAC' => 'Tyrosine',
                   'CAA' => 'Glutamine',
                   'CAG' => 'Glutamine',
                  );

print $geneticCode{'TAT'};
print "\n";
print "$geneticCode{'CAG'}\n";
Tyrosine
Glutamine

We can extract the keys or values from a hash using the keys and values functions. These will return an unsorted array of the keys or values of the hash, which we can use like any other array (we will see how to sort these arrays later in the course).

In [16]:
%%perl -w
# hash_assignment.pl
 
my %geneticCode = ('TAT' => 'Tyrosine',
                   'TAC' => 'Tyrosine',
                   'CAA' => 'Glutamine',
                   'CAG' => 'Glutamine',
                  );

my @codons = keys %geneticCode;
my @aas = values %geneticCode;
print "@codons\n";
print "@aas\n";
CAG CAA TAC TAT
Glutamine Glutamine Tyrosine Tyrosine

We can add or update key-value pairs to the hash just as we could add new array elements, or delete them with the delete function. If we try to look up a non-existent key, Perl will complain, as long as we are using warnings (-w).

In [17]:
%%perl -w
# hash_assignment.pl
 
my %geneticCode = ('TAT' => 'Tyrosine',
                   'TAC' => 'Tyrosine',
                   'CAA' => 'Glutamine',
                   'CAG' => 'Glutamine',
                  );

$geneticCode{'TGA'} = 'Cysteine';
print "$geneticCode{'TGA'}\n";
$geneticCode{'TGA'} = 'STOP'; # Whoops! Got the assignment wrong the first time
print "$geneticCode{'TGA'}\n";
delete $geneticCode{'TGA'};
print "$geneticCode{'TGA'}\n";
Cysteine
STOP

Use of uninitialized value $geneticCode{"TGA"} in concatenation (.) or string at - line 14.

Exercise

  1. Print out the names of the amino acids that would be produced by the DNA sequence "GTT GCA CCA CAA CCG GTT GTG". Use a hash to do this, similar to the %geneticCode hash above.
  2. Why couldn't we build a hash where the keys are names of amino acids and the values are the DNA codons?

Lesson 5: Loops

In the exercise for Lesson 4, perhaps you ended up writing something like this:

In [18]:
%%perl -w
# hash_assignment.pl
 
my %geneticCode = ('TAT' => 'Tyrosine',
                   'TAC' => 'Tyrosine',
                   'CAA' => 'Glutamine',
                   'CAG' => 'Glutamine',
                  );

my @sequence = ('CAA','CAG','TAC');
print $geneticCode{$sequence[0]}, "\n";
print $geneticCode{$sequence[1]}, "\n";
print $geneticCode{$sequence[2]}, "\n";
Glutamine
Glutamine
Tyrosine

Repeating the last line of code several times is not ideal. We have to type, or copy and paste, the same thing multiple times and if we wanted to change what we were doing with each amino acid name, we would have to change every line (for example, we forgot to print a new line or a space after each name). It would be much easier to just write the line once and then execute that line for every item in the array. We can do this with a for loop:

In [19]:
%%perl
my %geneticCode = ('TAT' => 'Tyrosine',
                   'TAC' => 'Tyrosine',
                   'CAA' => 'Glutamine',
                   'CAG' => 'Glutamine',
                  );

my @sequence = ('CAA','CAG','TAC');

foreach my $codon (@sequence) {
   print "$geneticCode{$codon}\n";
}
Glutamine
Glutamine
Tyrosine

Because 'keys %hash' is an array, we can also loop over the entries in a hash in the same way:

In [20]:
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
                      'Beta-Actin' => 1.7,
                      'Pax6' => 0,
                      'HoxA2' => -3.2,
                     );

foreach my $gene (keys %geneExpression) {
    print "$gene: $geneExpression{$gene}\n";
}
Beta-Catenin: 2.5
HoxA2: -3.2
Beta-Actin: 1.7
Pax6: 0

Perl has another type of loop which can also be used to iterate over a hash, among other things - the while loop:

In [21]:
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
                   'Beta-Actin' => 1.7,
                   'Pax6' => 0,
                   'HoxA2' => -3.2,
                  );

while (my $gene = each %geneExpression) {
    print "$gene: $geneExpression{$gene}\n";
}
Beta-Catenin: 2.5
HoxA2: -3.2
Beta-Actin: 1.7
Pax6: 0

Like the for loop, the while loop executes the block of code within curly brackets. The difference is that, whereas the for loop takes an array and does something with each element of the array, the while loop takes a logical test and runs until that test is false. So here it runs until each of the genes in %geneExpression have been processed. Let's say we want to write out the first five codons of a gene's DNA sequence:

In [22]:
%%perl -w
my @sequence = ('CAA','CAG','TAG','CAT','GGT','GAG','GGC','CAG');

my $i = 0;
while ($i<5) {
    print "$sequence[$i] ";
    $i = $i + 1;
}
CAA CAG TAG CAT GGT 

Exercises

  1. Create an array where each element is an individual base of DNA. Make the array 15 bases long.
  2. Create a for loop to output every base of the sequence on a new line.
  3. Create a while loop similar to the one above that starts at the third base in the sequence and outputs every third base until the 12th.

Lesson 6: If statements and loop control

In the while loop above, we saw an example of a logical test: ($i<5). While this statement was true, the while loop continued; when it became false, the while loop ended. Another powerful control structure that uses logical tests is if..elsif..else.

In [23]:
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
                      'Beta-Actin' => 1.7,
                      'Pax6' => 0,
                      'HoxA2' => -3.2,
                     );

foreach my $gene (keys %geneExpression) {
    if ($geneExpression{$gene} < 0) {
        print "$gene is downregulated\n";
    }
    elsif ($geneExpression{$gene} > 0) {
        print "$gene is upregulated\n";
    }
    else {
        print "No change in expression of $gene\n";
    }
}
Beta-Catenin is upregulated
HoxA2 is downregulated
Beta-Actin is upregulated
No change in expression of Pax6

The elsif block is optional, and we can use as many as we like. The else block is also optional, but it is good practice to include it in most cases, so that you think to do something explicitly with whatever doesn't pass the test in the if block. Otherwise, some cases might not pass the condition, and Perl would pass over them silently, which may not be what you want. There are many different comparison operators we can use in our logical tests.

Strings and numbers have different operators.

OperatorDescriptionExample
==numeric equality1 == 2 # False
eqstring equality'a' eq 'b' # False
!=numeric non equality1 != 2 # True
nestring non equality'a' ne 'b' # True
<numeric less than1 < 2 # True
<=numeric equal or less than2 <= 2 # True
ltstring less than'a' lt 'b' # True by ASCII order
>numeric greater then1 > 2 # False
>=numeric equal or greater than1 >= 1 # True
gtstring greater than'a' gt 'b' # False

For example, we might want to output only the nonsynonymous bases of a codon (the first and second bases) and ignore the third, synonymous base (because most amino acids are determined by the first and second bases alone).

In [24]:
%%perl

my @sequence = ('T','A','C','G','G','C','A','T','C','T','A','G');

my $i = 0;
foreach my $base (@sequence) {
    $i = $i + 1;
    if ($i == 3) {
      $i = 0;
    }
    else {
      print "$base";
    }
}
TAGGATTA

Perl has a useful keyword that can simplify loops like this: next, which moves on to the next iteration of the loop if a condition is true:

In [25]:
%%perl

my @sequence = ('T','A','C','G','G','C','A','T','C','T','A','G');

my $i = 0;
foreach my $base (@sequence) {
    $i = $i + 1;
    if ($i == 3) {
      $i = 0;
      next;
    }
    print "$base";
}
TAGGATTA

Here we have omitted the else statement, against our advice above, and in this simple example the first version was probably better. But imagine if we had complex processing of the base to do, with perhaps ten or twenty lines of code following the if statement. If we used an else, we would have to enclose all of these lines in the else block, which would have made our code less readable than just placing them after the if statement.

Perl also has a last keyword, which exits a loop if a condition is met:

In [26]:
%%perl

my %geneticCode = ('TAT' => 'Tyrosine',
                   'TAC' => 'Tyrosine',
                   'CAA' => 'Glutamine',
                   'CAG' => 'Glutamine',
                   'TAG' => 'STOP',
                  );

my @sequence = ('CAG','TAC','CAA','TAG','TAC','CAG','CAA');

foreach my $codon (@sequence) {
   if ($geneticCode{$codon} eq 'STOP') {
       last;
   }
   else {
       print "$geneticCode{$codon}\n";
   }
}
Glutamine
Tyrosine
Glutamine

Here the amino acids in the sequence are printed until the STOP codon (TAG) is reached. We used the else here, but actually Perl provides a more readable syntax for if statements when there is only one command in the block (as here, the last command). We can just put the command before the if statement, and forget about the blocks:

In [27]:
%%perl

my %geneticCode = ('TAT' => 'Tyrosine',
                   'TAC' => 'Tyrosine',
                   'CAA' => 'Glutamine',
                   'CAG' => 'Glutamine',
                   'TAG' => 'STOP',
                  );

my @sequence = ('CAG','TAC','CAA','TAG','TAC','CAG','CAA');

foreach my $codon (@sequence) {
   last if ($geneticCode{$codon} eq 'STOP');
   print "$geneticCode{$codon}\n";
}
Glutamine
Tyrosine
Glutamine

Finally for this lesson, we can chain multiple logical tests with and and or operators. We can either use the words and and or, or the symbols && and ||.

Operator (symbolic)Operator (written)ExampleTo be true
&& and i < 50 && i > 5Both conditions must be true
|| or str eq "clock" || str ne "black"Only one condition needs to be true
In [28]:
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
                   'Beta-Actin' => 1.7,
                   'Pax6' => 0,
                   'HoxA2' => -3.2,
                  );

foreach my $gene (keys %geneExpression) {
    if ($geneExpression{$gene} < 0 or $geneExpression{$gene} > 0) {
        print "$gene has changed expression\n";
    }
    else {
        print "$gene has not changed expression\n";
    }

    if ($geneExpression{$gene} > 0 and $gene ne 'Beta-Actin') {
        print "$gene is upregulated and not a housekeeping gene\n";
    }
}
Beta-Catenin has changed expression
Beta-Catenin is upregulated and not a housekeeping gene
HoxA2 has changed expression
Beta-Actin has changed expression
Pax6 has not changed expression

Exercises

  1. Let's calculate GC content of a DNA sequence. Use the 15-base array you created for the exercises in the lesson on loops. Create a variable, $gc, which we will use to count the number of Gs or Cs in our sequence.
  2. Create a loop to iterate over the bases in your sequence. If the base is a G or the base is a C, add one to your $gc variable.
  3. When the loop is done, divide the number of GC bases by the length of the sequence and multiply by 100 to get the GC percentage. Recall the array lesson to see how to get the length of the sequence. Just as we have been using '+' to add in Perl, to multiply, we use '*' and to divide, we use '/'.

Lesson 7: Introduction to functions: basic numeric operators

In the next few lessons, we will introduce many of Perl's basic functions and operators for processing numbers, strings, arrays and hashes. We will see how to load functions from Perl modules and finally we will show how you can construct your own functions.

A function is something that takes some input and produces some output. We have already seen several Perl functions: print, keys, values and each are all examples of functions. For example, keys takes a hash as input and returns the keys of the hash as output.

Perl has many core operators and functions built into the language. For example, the following operators all take two numbers as input and produce one number as output:

OperatorSummaryExample
+addition2 + 2
-subtraction5 - 3
*multiplication4 * 4
/division8 / 2
%modulus, returns remainder10 % 2
**exponent4**2
In [29]:
%%perl
# addition
print "sum of 2 and 3 = ";
my $num = 2 + 3;
print "$num\n";

# subtraction
print "$num minus 2 = ";
$num = $num - 2;
print "$num\n";

# multiplication
print "product of $num and 3 = ";
$num = $num * 3;
print "$num\n";

# division
print "$num divided by 2 = ";
$num = $num / 2;
print "$num\n";

# modulus
print "the remainder of $num divided by 2 = ";
my $remainder = $num % 2;
print "$remainder\n";

# exponentiation
print "$remainder raised to the power 2 = ";
$num = $remainder ** 2;
print "$num\n";
sum of 2 and 3 = 5
5 minus 2 = 3
product of 3 and 3 = 9
9 divided by 2 = 4.5
the remainder of 4.5 divided by 2 = 0
0 raised to the power 2 = 0

Perl has a couple of handy shortcuts for this kind of operation. Any of the above operators can be placed before the equals sign, so that instead of writing $num = $num + 2 we can just write $num += 2. Also, to add or subtract one to or from a variable, we can use ++ or --:

In [30]:
%%perl
my $num = 1;

print "Number was $num,";
$num *= 5;
$num += 3;

print " now $num\n";

# increment in place
print "$num + 1 = ";
$num++;
print "$num\n";

# decrement in place
print "$num - 1 = ";
$num--;
print "$num\n";
Number was 1, now 8
8 + 1 = 9
9 - 1 = 8
++increment by 1 (same as $i = $i+1)$i++
--decrement by 1, (same as $i = $i-1)$i--
+=in place addition$i += 2
-=in place subtraction$i -= 3

Finally, here are a couple of useful functions for calculating the absolute value of some number, and the integer portion of a floating-point number:

FunctionSummaryExample
absabsolute valueabs(-32)
intget the integer portion of a numberint(5.45)
In [31]:
%%perl
# absolute value
print "the absolute value of 3 - 4.2 = ";
my $num = abs 3 - 4.2;
print "$num\n";

# conversion to integer
print "int portion of $num = ";
print int $num, "\n";
the absolute value of 3 - 4.2 = 1.2
int portion of 1.2 = 1

A word about brackets here. If you are used to other languages, you may be surprised that we don't need brackets around function arguments. Perl is generally fairly lax about requiring brackets, in the hope that code can be read something like a natural language. However, it doesn't mind if you put the brackets in, so if you find it easier to read, you can write things like this:

In [32]:
%%perl
# absolute value
print "the absolute value of 3 - 4.2 = ";
my $num = abs(3 - 4.2);
print "$num\n";

# conversion to integer
print "int portion of $num = ";
print(int($num), "\n");
the absolute value of 3 - 4.2 = 1.2
int portion of 1.2 = 1

The rule of thumb is to do what you find most readable and clear. For example, the brackets around the abs are probably a good idea; it saves us from thinking Perl will calculate abs 3 and then subtract 4.2 from that. But the brackets on the int line arguably clutter up the line rather than making it clearer. Sometimes the brackets will be necessary, however, to tell Perl which arguments belong to which function. As int only takes one argument, there is no ambiguity here, but we will shortly see some examples where the brackets are required. This is something you will figure out through experience.

Exercises

  1. Can you simplify your GC calculation from the previous lesson using one of the shorthands here?
  2. Did you get an ugly floating point number for your GC percentage? Turn it into an integer.

Lesson 8: String functions and operators

Perl also has a set of useful string functions. For example, we can add strings together using the concatenation operator, which has a prefix notation like the numerical operators:

OperatorSummaryExample
.concatenation$str . " " . "!!"
.=concatenation & assignment$str .= "add this"
In [33]:
%%perl
my $str = "ACG" . "AGA" . "GCG";
print "$str\n";
$str .= "TGT";
print "$str\n";
ACGAGAGCG
ACGAGAGCGTGT

Here are some more basic functions, which are hopefully fairly self-explanatory:

FunctionSummaryExample
lengthlength of stringprint length $str
ucuppercaseprint uc $str
lclowercaseprint lc $str
ucfirstuppercase first characterprint ucfirst $str
lcfirstlowercase first characterprint lcfirst $str
chopremove last characterchop $str
chompremove new line characterschomp $str
In [34]:
%%perl
# change case
my $str = "ACGAGAGCGTGT";
print length $str, "\n";
print "$str\n";
$str = lc $str;
print "$str\n";
print uc $str, "\n";
print "ucfirst: " , ucfirst(lc($str)),"\n";
print "lcfirst: " , lcfirst uc $str,"\n";
chop $str;
print "Chopped: $str\n";
$str .= "\n";
print "Str now has a new line: $str$str$str";
chomp $str;
print "Now it doesn't: $str$str$str";
12
ACGAGAGCGTGT
acgagagcgtgt
ACGAGAGCGTGT
ucfirst: Acgagagcgtgt
lcfirst: aCGAGAGCGTGT
Chopped: acgagagcgtg
Str now has a new line: acgagagcgtg
acgagagcgtg
acgagagcgtg
Now it doesn't: acgagagcgtgacgagagcgtgacgagagcgtg

chomp is often used when processing files or user input, because lines from files or lines entered by users will usually have a new line character on the end. More on this later.

Here are some more sophisticated string processing functions.

substrsub stringprint substr($str, 3, 5)
splitsplits up a string, returns an arraymy @array = split ' ', $str

split takes a string and breaks up the string at every instance of a particular character, returning an array containing each part of the string:

In [35]:
%%perl
my $seq = "ACG AGA GCG TGT";
my @codons = split ' ', $seq;
foreach my $codon (@codons) {
    print "$codon\n";
}
foreach my $base (split '', $seq) {
    next if $base eq " ";
    print "$base\n";
}
ACG
AGA
GCG
TGT
A
C
G
A
G
A
G
C
G
T
G
T

substr takes a string, a starting position and (optionally) a length, and returns a substring of the string of the desired length that starts at the given position. As with array indexing, substr starts counting from 0. It has a few odd variants, which are worth exploring. If the length argument is omitted, substr returns everything from the start position to the end of the string. If a negative number is used for the start position, substr counts from the end of the string. If a negative number is used for the length, substr again subtracts that number of characters from the end of the string. Make sure you understand the following examples.

In [36]:
%%perl
my $str = "ACGAGAGCGTGT";
print substr($str, 3, 3),"\n";
print substr($str, 3), "\n";
print substr($str, -3), "\n";
print substr($str, 3, -3), "\n";
AGA
AGAGCGTGT
TGT
AGAGCG

Exercise

  1. Write a routine that takes a DNA sequence and uses substr and a while loop to print out each codon in the sequence on a new line. Before the while loop starts, check that the length of the sequence is divisible by three and only run the while loop if it is.

Lesson 9: Array and Hash Functions

We have seen how to create a new array containing a set of elements, and one way to add new elements to an array (by setting a particular array element to a chosen value). We need a bit more flexibility, however. Here are some functions to help us manipulate arrays we have already created.

FunctionDescription
pop @arrayremoves last element from array and returns it
push @array, $new_element, ...appends one or more elements to the end of an array
shift @arrayremoves the first element from the array and returns it
unshift @array, $new_element, ...add elements at the front of the array
spliceadds or removes elements anywhere in the array
join($str, @array)joins the elements of the array together with the specified string, and returns the whole string

pop, push, shift and unshift all remove or add one element from or to an array. pop and push work from the end of the array; shift and unshift work from the beginning. pop and shift return the element they have removed.

In [37]:
%%perl

my @array = ('ACT','GCT','GAG','CAG');
print "Array is @array\n";
my $last = pop @array;
print "Popped the last element off the array: @array. Here's the last element: $last\n";
push @array, 'CGT';
print "Added CGT to the end of the array: @array\n";
my $first = shift @array;
print "Shifted the first element off the array: @array. Here's the first element: $first\n";
unshift @array, 'AAC';
print "Added AAC to the beginning of the array: @array\n";
Array is ACT GCT GAG CAG
Popped the last element off the array: ACT GCT GAG. Here's the last element: CAG
Added CGT to the end of the array: ACT GCT GAG CGT
Shifted the first element off the array: GCT GAG CGT. Here's the first element: ACT
Added AAC to the beginning of the array: AAC GCT GAG CGT

With splice, we can remove chunks of the array and replace them with something else if we want to. Like substr, splice takes an array as input, a starting position and an optional length, plus an optional replacement array. It removes the elements from the start position up to the given length, and replaces them with the replacement array. It returns the array elements it has removed.

In [38]:
%%perl

my @array = ('ACT','GCT','GAG','CAG');
print "Array starts as @array\n";
my @spliced = splice @array, 1, 2;
print "Removed @spliced to leave @array\n";
my @replacement = ('CGG','CCC','AGT','GAC');
@spliced = splice @array, 1, 1, @replacement;
print "Removed @spliced and replaced with @replacement to leave @array\n";
Array starts as ACT GCT GAG CAG
Removed GCT GAG to leave ACT CAG
Removed CAG and replaced with CGG CCC AGT GAC to leave ACT CGG CCC AGT GAC

Finally, join does the opposite of split; it takes an array and joins the elements together into a string, using a separator if given:

In [39]:
%%perl
my @bases = ('A','C','G','T','A','C','G','G','T','T','G','A');
my $seq = join '', @bases;
print "@bases is now $seq\n";
$seq = join ':', @bases;
print "We can use separators too: $seq\n";
A C G T A C G G T T G A is now ACGTACGGTTGA
We can use separators too: A:C:G:T:A:C:G:G:T:T:G:A

Perl also provides some convenience hash functions. We've already met three of these.

FunctionDescription
keys %hashretrieve an array of all keys in a hash
values %hashretrieve an array of all values in a hash
each %hashretrieve next key-value pair from a hash
exists $hash{'key'}test whether a key is found in the hash
delete $hash{'key'}delete a key, and any corresponding value, from a hash

Here is a simple example of using exists and delete. We wouldn't normally write such code, with the hash and array defined next to each other, but suppose that the hash has been loaded from one file and the array from another, and we don't know whether the array contains the same genes as the hash.

In [40]:
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
                      'Beta-Actin' => 1.7,
                      'Pax6' => 0,
                      'HoxA2' => -3.2,
                     );

my @genes = ('Pax6','EphB2','HoxA2');

for my $gene (@genes) {
    if (exists $geneExpression{$gene}) {
        print "Found $gene in hash, has expression $geneExpression{$gene}\n";
        delete $geneExpression{$gene};
    }
    else {
        print "Can't find $gene\n";
    }
}

print "Hash now contains the following genes: ", 
    join(' ', keys %geneExpression), " with the following values: ", 
    join(' ', values %geneExpression), "\n";
Found Pax6 in hash, has expression 0
Can't find EphB2
Found HoxA2 in hash, has expression -3.2
Hash now contains the following genes: Beta-Catenin Beta-Actin with the following values: 2.5 1.7

Exercise

Let's put all of this together. You are given the following data:

In [41]:
%%perl

my %geneticCode = ('TCT' => 'Serine',
                   'TCC' => 'Serine',
                   'CTA' => 'Leucine',
                   'CTG' => 'Leucine',
                   'TGT' => 'Cysteine',
                   'TAG' => 'STOP',
                  );

my $sequence = 'TCCTGTCTACCCCTGAAGTCTTAG';

Write a routine that will convert the given DNA sequence into an amino acid sequence, using the first letter of the amino acid name for each amino acid (eg output S for Serine, L for Leucine and so on). Where no codon is present in the hash, output '?'. Use '*' for the stop codon.

You have all the tools to figure out how to do this, but if you need some hints:

  1. You're going to need to iterate through the sequence and split it into codons. You can do this with a while loop or a for loop.
  2. There are many ways to get the codons. For example, you can either shift bases off the sequence into a codon variable until the codon is three bases long, or you could split the sequence up into an array and use splice to chop off three bases at a time. You may be able to think of another solution.
  3. You'll need an if statement to check whether the codon exists in the hash. If it does, you'll want to add the first character of the amino acid value in the hash to an amino acid sequence (which you'll have to set up before the loop begins). You can use substr to get the first character. If it doesn't, you want to add '?' to the amino acid sequence.
  4. You will need another if statement to check whether the codon is a STOP codon or not. If it is, you want to add '*' to the amino acid sequence, not the first character of STOP.

Lesson 10: Core functions, modules and subroutines

We have now made use of several core Perl functions. There are many more that we haven't covered here. See Perl functions for a full list. Perl also has a large number of core modules, which are packages containing additional functions with common goals. These are all documented on Perldoc too.

The core functions are always available, but to use a module we need to import it. For example, one commonly used module is List::Util. This contains some commonly used list (array) functions. Here's how we can use it:

In [42]:
%%perl

use List::Util ('max', 'min');

my @genome_mb = (100.3, 2700, 4.6, 3200, 12.1);

print max(@genome_mb), "\n";
print min(@genome_mb), "\n";
3200
4.6

We import the functions max and min from List::Util, and then we can use these functions just like any of Perl's core functions. We can find out what functions are contained in List::Util by looking up the `List::Util` Perldoc page.

We can also create our own functions, or subroutines. If there is a block of code you use more than once in your script, or there are some fiddly details that you would rather hide away from the important parts of your algorithm, you should write a subroutine. Using subroutines reduces the chance of introducing errors by copying and pasting code, and if you have to change something you only change it in one place, not in many.

Just like a normal function, subroutines can take arguments as input and return values as output. Here's an example:

In [43]:
%%perl

my @genome_mb = (100.3, 2700);

sub maximum {
    my ($a, $b)= @_;
    
    if ($a > $b) {
        return $a;
    }
    else {
        return $b;
    }
}

my $max = maximum(@genome_mb);

print "Maximum genome size in our list is $max Mb\n";
Maximum genome size in our list is 2700 Mb

Here we create a subroutine called maximum which takes two arguments, $a and $b. Perl has an odd syntax for subroutine arguments. It creates a special array, @_, to hold any arguments passed into a subroutine. To get at the arguments, we have to unpack @_. This is what we are doing in the first line of maximum.

To return a value from the subroutine, we use the return keyword. Perl's subroutines can only return a single value or an array, but not separate multiple values, and not a hash.

For interest, here's another take on this subroutine:

In [44]:
%%perl

my @genome_mb = (100.3, 2700);

sub maximum2 {
    my $a = shift;
    my $b = shift;
    return ($a > $b) ? $a : $b;
}

my $max = maximum2 @genome_mb;

print "Maximum genome size in our list is $max Mb\n";
Maximum genome size in our list is 2700 Mb

Three things have changed here. Firstly, we are now calling the subroutine without brackets - this works for our own subroutines as well as core functions, although it is not recommended. Secondly, we are using shift to take arguments off the special @_ array; as this array is always created for a subroutine, Perl doesn't require us to call it by name, and just assumes we mean this array when we don't give an argument for shift.

Thirdly, we are using a nice shorthand conditional syntax: (cond) ? A : B. If cond is true, do whatever is after the ?; if false, do whatever is after the :. This is very handy when we are trying to decide what value a variable should have based on some condition, as here, or when the processing we need to do as a result of the computation is very simple.

Exercises

  1. Write a subroutine based on maximum that takes two numerical arguments and returns their mean.
  2. Our mean and maximum subroutines assume we only have two arguments, but often we have more or less. Write another subroutine that takes an array that could contain any number of elements and returns the mean of all the elements. Use the first @genome_mb array (with five elements) to test your subroutine.
  3. Write a subroutine that takes a DNA sequence as input and returns the GC content of the sequence as a percentage. Round the percentage to the nearest integer.

Lesson 11: User/Command-line input

So far we have just been operating on data that we have defined ourselves within our scripts. In the real world we will want to get input data from users and files etc., operate on it, and frequently output some results to other files. Perl has lots of support for dealing with input and output (I/O in computer jargon), and we'll go through some of these in the next few lessons.

The first form of user input we'll look at is reading in arguments passed to our scripts on the command line, these may include specific parameters for some analysis (e.g. gene names, or numerical cutoffs) or often the names of files that we will operate on later. Command line arguments are similar in some ways to arguments that we pass to subroutines.

So far we have been running our scripts with command lines like:

perl my_script.pl

If we want to supply some arguments to the script we can do so by writing them after the name of the script:

perl my_script.pl 23 BRCA2 0.5

Perl will split these up according to whitespace (so if you want to pass in an argument that contains whitespace you'll need to put the string in quotation marks), and then put them all in another 'magic' array variable called @ARGV.

We can then access arguments as values in this array just like any other, so @ARGV is equivalent to the @_ array Perl uses to pass arguments to subroutines.

In [45]:
%%file lesson11_example1.pl

my @user_input = @ARGV;
print join("\n" , @user_input) , "\n";
Writing lesson11_example1.pl
In [46]:
%%bash
perl lesson11_example1.pl 23 BRCA2 0.5
23
BRCA2
0.5

shift and @ARGV

Remember the array function shift removes and returns the first element of an array.

In [47]:
%%file lesson11_example2.pl

my $first_input = shift @ARGV;
print "first input: $first_input\n";
Writing lesson11_example2.pl
In [48]:
%%bash
perl lesson11_example2.pl one two three
first input: one

Just like in subroutines where the argument array @_ is implicitly used when calling array functions like shift so you don't have to explicitly name it, in the main body of a script @ARGV will be used implicitly, so a common idiom for reading in a command line argument is:

In [49]:
%%file lesson11_example3.pl

my $first = shift;
print "First: $first";
Writing lesson11_example3.pl
In [50]:
%%bash
perl lesson11_example3.pl one two three
First: one

Using array indices

Remember with arrays that each element can be accessed using indices.

In [51]:
%%file lesson11_example4.pl
my $first_input = $ARGV[0];
my $second_input = $ARGV[1];
print "$first_input and $second_input\n";
Writing lesson11_example4.pl
In [52]:
%%bash
perl lesson11_example4.pl one two three
one and two

Assigning an array to a list of variables

In [53]:
%%file lesson11_example5.pl
my ($in_1, $in_2 , $in_3) = @ARGV;
print "$in_1 and $in_2 and $in_3\n";
Writing lesson11_example5.pl
In [54]:
%%bash
perl lesson11_example5.pl one two three
one and two and three

Exercises

  1. Write a script that takes 2 numbers from the command line, adds the 2 numbers and prints out the result.
  2. Write a script that reads in a DNA sequence from the command line, and then prints out its length and GC percentage (using the subroutine you wrote earlier on to compute the GC content)

Lesson 12: Sanity checks and testing

Now we are able to read in some input from users, it is important for your programs to check that the input matches any assumptions you have made about this input, and to deal with things gracefully if they do not. For example, say you want to compute the average of some numbers the user supplies on the command line, what should you do if the user gives you strings instead of numbers?

An even more basic check to perform is that an expected argument has indeed been supplied.

Another common check is that if the user supplies the name of a file to operate on (as we will see later), we should check that the file exists, and perhaps that it has some data in it.

Perl provides a few useful functions to help with these sort of checks:

TestDescription
definedchecks if a variable is defined
-echecks if a file exists
-schecks the size of a file (returning the result as a number)

If we find some problem with user input then Perl also has a couple of useful functions that allow us to print warning messages to the user, without interfering with the normal output of the program:

ActionDescription
die $messageprints a message to STDERR and kills the program
warn $messageprints a message to STDERR, but the program will continue

The example below checks to see if an argument was provided, and if it is then the argument will be assigned to $arg_1. Otherwise the else block will be executed and the script will die and an error message will be printed.

In [55]:
%%file lesson12_example1.pl

my $arg_1;
 
if (defined $ARGV[0]){
    $arg_1 = $ARGV[0];
    print "Here's your number: $arg_1\n";
}
else {
    die "Please supply a number\n";
}
Writing lesson12_example1.pl
In [56]:
%%bash
perl lesson12_example1.pl 3.14
perl lesson12_example1.pl
Here's your number: 3.14
Please supply a number

File tests

The example below does the same check, but also uses -e to check that a file with the supplied name exists. Here we are using negative conditions (NOT defined and so on) to check the basic sanity tests, so we can get them out of the way before we get on to the rest of the code. This way, we don't have to embed all of our code in if blocks.

In [57]:
%%file lesson12_example2.pl

if (!defined $ARGV[0]) {
    die "Please supply a file\n";
}

if (!-e $ARGV[0]) {
    die "$ARGV[0] does not exist. Please check the filename and/or location.\n";
}

my $filename = $ARGV[0];

print "$filename is a file that exists!\n";
Writing lesson12_example2.pl
In [58]:
%%bash
perl lesson12_example2.pl lesson12_example1.pl
perl lesson12_example2.pl
perl lesson12_example2.pl lesson12_example42.pl
lesson12_example1.pl is a file that exists!
Please supply a file
lesson12_example42.pl does not exist. Please check the filename and/or location.

Exercises

  1. Modify the script you wrote for the last lesson that takes two numbers from the command line and adds them, add a check to make sure that two arguments are supplied. Die if there are too few or too many arguments (Hint: remember that @ARGV is just a normal array)
  2. Write a script that takes a single filename as an argument and prints out the size of the named file. The script should check that an argument has been provided and that the file exists. Use any file that you can see in the same directory as your script (on the command line, use ls on Linux or Mac or dir on Windows to check for one)
  3. Optional: what else could go wrong with your script that adds two numbers? For example, what happens if one of the arguments is not a number? See if you can find a function in a core module that will deal with this problem, load it into your script and use it.

Lesson 13: Files and File handles

Frequently the data we want to operate on or analyse will be stored in files, so in our programs we need to be able to open files, read through them (perhaps all at once, perhaps not), and then close them. We will also frequently want to be able to print output to files rather than always printing out results to the terminal.

Perl supports all of these modes of operations on files, and provides a number of useful functions and syntax to make dealing with files fairly straightforward.

The first thing we need to be able to do is open a file, for this Perl has the conveniently named open function.

The syntax of this function is: open(file handle, mode, file name)

A file handle is a special scalar variable that perl uses to represent files in your program, and is distinct from the file name which is just a string which contains the location of the file on your machine's hard disk.

The mode tells Perl if you want to open your file for reading or writing, or appending (another form of writing), and Perl uses special characters to represent each of these modes (note these are the same as the UNIX file redirection symbols):

ModeSymbol
reading<
writing>
appending>>

If the call to open is successful it will return a true value, if it is not, perhaps because the file doesn't exist, or you don't have the correct permissions to open it, it will return a false value.

Another very common idiom that relies on this is to include an or die statement directly after the call to open which will be executed if there was some problem opening the file.

In [59]:
%%file lesson13_example1.pl

my $file = shift @ARGV; 

open (my $fh , '<' , $file) or die "Can't open $file, $!\n";

print "opened file: $file\n";
Writing lesson13_example1.pl
In [60]:
%%bash
perl lesson13_example1.pl lesson13_example1.pl
opened file: lesson13_example1.pl

Iterating through a file

Once we have successfully opened up our file we will probably want to start reading the contents of a file, which we typically do line by line. A common pattern, especially if we don't know in advance how many lines our file contains, is to use a while loop to loop over all the lines in the file operating on each line in turn - perhaps doing some analysis on each line as we proceed, or perhaps storing the data in some data structure to analyse later on in the script.

Perl provides some special syntax to read files line by line, to access lines in a file we wrap our file handle in <> characters, e.g. <$FILEHANDLE>. This is another operator that is aware of the context in which it is called. In scalar context this will return the next line in the file (Perl keeps track of where in a file it is up to for you), but if used in array context it will return all lines from the file in an array (watch out for big files!)

  • scalar context: my $line = <$FILEHANDLE>
  • array context: my @lines = <$FILEHANDLE>

Each line in the file with have an invisible newline character \n at the end, and generally we will want to remove this before operating on the line, and so we can use the chomp string function to remove it.

The example file used below is in Course_Materials/lesson_files but any file you have will do. You can either change directory to lesson_files, or you can just specify the name of the directory when writing out the file. Here we are using the Unix command cat to print out the contents of the file.

In [61]:
%%bash
cat lesson_files/lesson13_file1.txt
cd lesson_files
cat lesson13_file1.txt
This is a file
with
lines of words.
This is a file
with
lines of words.
In [62]:
%%perl

my $filename = 'lesson_files/lesson13_file1.txt';

open(my $inputfh, '<' , $filename) or die "can't open $filename\n";
 
while (my $line = <$inputfh>) {
    chomp $line;
    print uc $line , "\n";
}
THIS IS A FILE
WITH
LINES OF WORDS.

Open and close a file for writing

If we open a file for writing we to tell Perl to print output to that file handle, otherwise it will keep printing out to the terminal. We do this by putting the file handle between the print statement and the stuff we want to print to that file. It is good practice to then close your file when you are finished with it.

In [63]:
%%perl
my $newfilename = 'lesson13_file2.txt'; 
open(my $newfh , '>' , $newfilename) or die "Can't open $newfilename\n";
 
print $newfh lc "Print this TO a FILE in LOWER case\n";
close $newfh;
In [64]:
%%bash
cat lesson13_file2.txt
print this to a file in lower case

Tab delimited files

In biology we very often end up dealing with tab (or comma) delimited files. We can use some of the string functions we covered earlier such as split, in combination with the various file operations Perl supports, to deal with the sorts of files in a very straightforward way.

Here is an example splitting the columns into elements of an array. The file we will use contains the chromosomal coordinates of a few genes downloaded from Ensembl, and has 4 columns: gene, chromosome, start, end. We want to read this file and then compute the length of each gene.

In [65]:
%%bash
cat lesson_files/lesson13_file3.txt
BRCA2	13	32889611	32973805
TNFAIP3	6	138188351	138204449
TCF7	5	133450402	133487556
In [66]:
%%perl

my $genefilename = 'lesson_files/lesson13_file3.txt';

open(my $genefh, '<' , $genefilename) or die "Can't open $genefilename\n";

while (my $line = <$genefh>) {
    chomp $line;
    
    my ($gene, $chr, $start, $end) = split "\t", $line;
    
    ## for csv (comma separated files) split on the comma
    # split ',' , $line;
    
    my $length = $end - $start + 1;
    
    print "$gene is on chromosome $chr and is $length bps long\n";
}
close $genefh;
BRCA2 is on chromosome 13 and is 84195 bps long
TNFAIP3 is on chromosome 6 and is 16099 bps long
TCF7 is on chromosome 5 and is 37155 bps long

Exercises

  1. Create a script that takes the name of a file containing many lines of nucleotide sequence as a command line argument, checks that it exists with -e, and opens the file for reading. For each line in the file, print out the line number and the length of the corresponding line. You can use lesson13_file4.txt as an example input file, or you can create your own.
  2. Write a script that reads in a FASTA file of DNA sequence and computes the GC content of each line, using the GC content subroutine from earlier lessons (see, for example, lesson11_exercise2.pl). The script should also open an output file for writing and then print the GC content for each input line on a separate line in the output file. You should check that you don't try to compute GC content on the header lines, which start with a '>' character. You can use lesson13_file5.fasta as an example input file, which contains BRCA2 sequences for human and mouse, or you can create your own.
In [67]:
%%bash 
cat lesson_files/lesson13_file4.txt
CGGCTAGATCCAGAT
CGTGTAA
GTACACCCA
GTCAACACTTA
In [68]:
%%bash 
cat lesson_files/lesson13_file5.fasta
>13 dna:chromosome chromosome:GRCh37:13:32889611:32973805:1
GGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCTGTGGCACTGCT
GCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTG
AGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCAAAAAAGAACTG
CACCTCTGGAGCGGGTTAGTGGTGGTGGTAGTGGGTTGGGACGAGCGCGTCTTCCGCAGT
CCCAGTCCAGCGTGGCGGGGGAGCGCCTCACGCCCCGGGTCGCTGCCGCGGCTTCTTGCC
CTTTTGTCTCTGCCAACCCCCACCCATGCCTGAGAGAAAGGTCCTTGCCCGAAGGCAGAT
TTTCGCCAAGCAAATTCGAGCCCCGCCCCTTCCCTGGGTCTCCATTTCCCGCCTCCGGCC
CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGA
CGCCATCTGAAATTTCTTGGAAACACGATCACTTTAACGGAATATTGCTGTTTTGGGGAA
>gi|1905987|gb|U89503.1|MMU89503 Mus musculus Brca2 (BRCA2) gene, partial cds
GATATCCACTCTAAAAAATAAAAAAAGAAAGTTTATTTACTCTGTAAGTGATGATGCATCTCTTCAAGGA
AAAAAACTACAGACACACAGACAGTTAGAGCTTACTAATCTTTCAGCCCAGCTTGAAGCAAGTGCTTTTG
AAGTACCATTGACCTTTACAAATGTAAATTCAGGTATTATTCTTTTTTTCTTCTAATGTATATAGTCTTT
AGGATGAGATTAAGTTTTCCTATTTGAATATAAGTTTTTTGTTTTTTGAATTTTAAGAGAAACACTTGCT
TACTTTCTGTAGTTTTTCAGATCTCTTTTTGGTTTGGTTTGGTTTGGTTTGGTTTGGTTTGGTTTGGTGA
GAACCCTGTGTAGCCCTGGCTGTCTTAGAACTTAGGCTGGCCTGGAACTTAGATATC

Lesson 13b: Perl one-liners

Very often in bioinformatics we will deal with line-oriented text files with multiple columns of information, and sometimes we want to convert between formats or compute simple statistics over a complete data file without having to write a full script (though that is a good idea if the computation we want to perform is complex, or we will be running the same analysis multiple times).

There are several unix tools that can be used to do this kind of work, such as cut to extract particular columns, awk to filter data and perform simple calculations, and sed to substitute text. Perl was originally designed as a superset of all these tools, and supports working with data files in a very similar way to these tools in 'one-liner' mode.

Various flags can be supplied to the perl interpreter to make it useful in this context:

  • -e execute the string passed on the command line as a perl program (the program should be supplied within quotes)
  • -l automatically chomps each line, and prints a trailing newline after each print command
  • -a automatically splits each line from the standard input into a special array called @F, the delimiter defaults to whitespace, but can be controlled by specifying a delimiter with -F, e.g. -F"," will split on commas, or -F"\t" to split on tabs.
  • -n which tells perl to assume the following loop around your program: while (<>) { # your program goes here }

These can all be combined to produce a powerful mechanism for processing files and other data piped into your program from the command line. If you want to keep track of some data throughout the run and then produce some summary results at the end, you can use an END{} block at the end of your code which will only be run once the whole file has been processed. You can also supply a BEGIN{} block which will be run before any lines are processed and can be used, for example, to print out a header line for your output.

For example, we could solve the exercise from above with the following command line:

In [69]:
%%bash
perl -lane 'print $. , ": ", length $F[0]' < lesson_files/lesson13_file4.txt
1: 15
2: 7
3: 9
4: 11

Note that the special perl variable $. contains the current line number being processed. As another example we could print out the gene names and corresponding chromosome from the example file of gene coordinates with:

In [70]:
%%bash
perl -lane 'print "$F[0] is on chromosome $F[1]"' < lesson_files/lesson13_file3.txt
BRCA2 is on chromosome 13
TNFAIP3 is on chromosome 6
TCF7 is on chromosome 5

Exercises

  1. Write a perl one liner to compute the length of each gene from lesson13_file3.txt and print the gene name in lower case and it's length separated by a comma, you could print a header line as well with a BEGIN{} block.

  2. Advanced: write a perl one liner to compute the counts of all nucleotides occuring in lesson13_file4.txt

Lesson 13c: System calls

Another common use of Perl, particularly on UNIX systems, is to automate running external programs or scripts.

Perl makes this very easy to do, and there are two ways in which you can run an external command.

  • using system function
  • using backticks ``

system

system is a function that you can supply the command line statement to be executed as the first argument and the exit status is returned. Output cannot be captured in a variable using system, but see backticks `` below for this feature.

system returns:

  • 0 for no errors
  • -1 for errors
  • if you get an error, you can print the variable $! to find out the reason
In [71]:
%%perl

my $sys = system ("date");
print "sys returned: $sys\n";

if ($sys == -1) {
    print "Error: $!\n";
}
Thu  1 May 2014 20:38:40 BST
sys returned: 0

Backticks ``

You can also wrap the command you want to run (which may include variables) in backticks``. The advantage of this approach is that you can capture the output of the external command into a variable in your script, and then operate on this data. In array context backticks will give you all the output lines in an array, while in scalar context you will get all the output as a single string value.

In [72]:
%%perl
my $dir = `pwd`;

print "I am currently in $dir\n";
I am currently in /Users/grsr/git/perl-course

Exercises

  1. Write a script that runs the UNIX command ls to get a list the files in the current directory, then prints out all upper case versions of all the file names.
  2. Modify your script to only print out Perl files, i.e. those ending with '.pl'

Lesson 14: Sorting with arrays and hashes

Sorting data in one way or another is a very common operation. Perl provides us with a very flexible function called sort that we can use to sort any kind of array, and because we can access the keys and values of hashes as arrays, it allows us to control the order in which we output values stored in hashes etc.

sort is very customisable, but its default behaviour is to sort in alphabetical order (or ASCII order to be precise).

sort is a function that takes an array as an argument and returns a copy of the array sorted according to our criteria.

In [73]:
%%perl

my @array = ('Valine', 'Leucine','Cysteine','Tyrosine');
my @sorted_array = sort @array;
 
foreach my $element (@sorted_array){
    print "$element\n";
}
 
my @numbers = (3,400,1,100,4,20);
my @sorted_numbers = sort @numbers;
 
foreach my $element (@sorted_numbers){
    print "$element\n";
}
Cysteine
Leucine
Tyrosine
Valine
1
100
20
3
4
400

Sorting numerically

When we have numerical data we generally don't want this behaviour, we'd rather that the values were sorted by their values, possibly either ascending or descending.

To do this we have to customise the behaviour of sort with a small block of code placed in between sort and the array we are sorting.

The only thing sort needs to know to be able to sort any array in any complicated way you can imagine is whether one value is "less than", "greater than" or "equal" to another, and that what the block of code you supply to sort tells it. Inside this block of code the two values sort is asking you to compare are called $a and $b.

So, to sort a list of numbers we need some function that tells sort how to compare two numbers. You could supply your own function, but this is a common requirement so Perl gives us the "spaceship" operator <=>, and the block of code we will use is: {$a <=> $b}.

This block will sort numbers in ascending order, if we want to reverse the sort order we can just swap the order of $a and $b.

In [74]:
%%perl

my @numbers = (3,400,1,100,4,20);

my @sorted_numbers = sort {$a <=> $b}  @numbers;
 
foreach my $element (@sorted_numbers){
    print "$element\n";
}
1
3
4
20
100
400

Sorting by length

As well as sorting arrays according to the values of the elements, we can sort by any property of the values as well.

A useful example is to sort an array of strings by the lenght of each string, instead of their alphabetical ordering. We can do this just by slightly modifying the sort block.

In [75]:
%%perl

my @fragments = ( 'TCTCGAATC' , 'TTA' , 'A' , 'GCGTGATGTCGA' , 'GATC' );

my @sorted_by_length = sort { length($b) <=> length($a) } @fragments;

foreach my $element( @sorted_by_length ) {
    print "$element\t", length ($element), "\n";
}
GCGTGATGTCGA	12
TCTCGAATC	9
GATC	4
TTA	3
A	1

Sorting a hash by values

Because hashes are inherently unordered people often want to access hashes in more controlled way by sorting either the keys or the values.

Sorting by key is easy, we just use keys to get an array of the keys, sort this array, and then use this sorted array to access all of out values. Sorting by value is a little trickier.

Recall that to access the value corresponding to some key in a hash we use code like: $value = $hash{$key}

So to access the values of the hash in a sorted order, we will in fact have to sort the keys.

In our sort block $a and $b will hold the keys of the hash, but just as in the length example above we will be sorting by some property of the keys, namely their corresponding values. Here's an example:

Order of operations:

  1. keys are retrieved with the keys function
  2. the keys are used to retrieve the values
  3. the keys are sorted and returned based on the value ( $value = $hash{$key} )
  4. the list of newly sorted keys (based on the value) are passed 1 by 1 to $nt by the foreach loop
In [76]:
%%perl

my %nt_count = (
                 'G' => 2,
                 'C' => 1,
                 'T' => 4,
                 'A' => 3,
                );

@sorted_by_values = sort {$nt_count{$a} <=> $nt_count{$b}} keys %nt_count;

print join ' - ', @sorted_by_values;

print "\n";

foreach my $nt( @sorted_by_values ) {
    print "$nt\t$nt_count{$nt}\n";
}
C - G - A - T
C	1
G	2
A	3
T	4

Exercises

  1. Write a script that reads in some numbers from the command line and prints them in descending numerical order.
  2. Write a program that will accept a DNA sequence from the command line and outputs the number of As, Cs, Gs and Ts in the sequence, sorted by number of occurrences of each nucleotide, smallest first. Hint: try using a hash to store the nucleotide counts, where the keys are the nucleotides and the values are the occurrences.

Lesson 15a: Regular Expressions I

In all sorts of programming contexts we will want to manipulate and analyse text. We have seen a number of useful string functions built into Perl that let us do things like look at substrings of strings, chop them up in various ways, and we can already combine these with Perl's logical operators to manipulate strings in fairly complex ways.

We don't need very complicated examples though to realise the limitations of manipulating text using what we've learned so far though. Consider the simple example of checking that a variable supplied by the user is a true number, and doesn't contain any non-numerical characters. How could you check for this?

This is just a very simple example, there are all sorts of patterns we might want to identify in data - in biology we might want to check that a string contains only valid DNA bases, or amino acid codes, or even try to identify specific motifs in a DNA sequence where proteins bind or restriction enzymes cut.

Perl provides us with a very powerful, but somewhat complicated, system to let us answer these kinds of questions called "Regular Expressions" ("regexes", "REs"). This is almost another language built inside Perl and we will only introduce some of the basics here.

When we write a regular expression we are essentially writing a template against which we will check values to see if they match the RE. We typically write these templates inside forward slashes //, and we ask if a string matches the template with the =~ operator which is a logical operator like ==. There is also an equivalent to != to check if the string does not match: !~.

The simplest RE, that is still often very useful, just checks if a specific bit of text is found in the string, here we check if there is a stop codon anywhere in the DNA.

In [77]:
%%perl

my $dna = 'ACTTGATAG';

if ($dna =~ /TAG/) {
    print "$dna contains a stop codon\n";
}
else {
    print "No stop codons found";
}
ACTTGATAG contains a stop codon

As this example shows, inside the pattern, normal characters just match themselves, but we can also match more general classes of things:

  • Any normal character matches itself
  • The . character matches anything except a newline
  • A specific set of characters specified in square brackets, e.g. [ACGT] matches any DNA base, [1234] matches any number less than 5

By default, the string we are testing is allowed match the pattern anywhere, so in our example above we could have a stop codon as the first 3 bases and the expression would still match. We can however control where in the string the pattern can match with the following special characters:

  • ^ Matches the beginning of the string
  • $ Matches the end of the string

We can use all of these features to look for common biological signals in DNA sequences:

In [78]:
%%perl

my $dna = "ACAGGAGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATAGGCAGACGAGAACGTAG";

if ($dna =~ /CTGCAG/) {
    print "Found PstI recognition site\n";
}
if ($dna =~ /GA.TC/) {
    print "Found HinfI recognition site\n";
}
if ($dna =~ /[AG]GATC[TC]/) {
    print "Found XhoII recognition site\n";
}
if ($dna =~ /TAG$/) {
    print "Found STOP codon at the end of the sequence\n";
}
Found PstI recognition site
Found HinfI recognition site
Found XhoII recognition site
Found STOP codon at the end of the sequence
  • Perl provides various predefined character classes:
    • \d matches any digit ([0-9])
    • \w matches any word character ([a-zA-Z0-9_])
    • \s matches any whitespace ([ \t\n\r])

Each of the atoms we have described so far will match a single character, but we can quantify how many times any of them will match with some special characters:

  • * matches atom zero or more times
  • + matches atom one or more times
In [79]:
%%perl
foreach my $genename ("BRCA1","BRCA11","BRCA 1","BRCA1 gene") {
    if ($genename =~ /^\w+\d$/) {
        print "$genename contains one or more word characters, one digit, and nothing else\n";
    }
    else {
        print "$genename contains something other than one or more word characters, one digit and nothing else\n";
    }
    if ($genename =~ /\w+\s*\d+/) {
        print "$genename contains one or more word characters, optional whitespace and one or more digits\n";
    }
    else {
        print "$genename contains something other than one or more word characters, optional whitespace and one or more digits\n";
    }
}
BRCA1 contains one or more word characters, one digit, and nothing else
BRCA1 contains one or more word characters, optional whitespace and one or more digits
BRCA11 contains one or more word characters, one digit, and nothing else
BRCA11 contains one or more word characters, optional whitespace and one or more digits
BRCA 1 contains something other than one or more word characters, one digit and nothing else
BRCA 1 contains one or more word characters, optional whitespace and one or more digits
BRCA1 gene contains something other than one or more word characters, one digit and nothing else
BRCA1 gene contains one or more word characters, optional whitespace and one or more digits

Exercises

  1. Use a regular expression to check if a DNA sequence contains any CG dinucleotides.
  2. Write a regular expression that will check that a string only contains DNA bases, but allow for either upper or lower case characters.
  3. DNA can be methylated at specific sites where there is a purine base (either a G or an A), then a C, then any base, then a G. Write a regular expression to check if a DNA string contains such a methylation site.

Lesson 15b: Regular Expressions II

Capturing matches

So far we have only been checking that our values match the patterns we have been concocting. You can also use regular expressions to capture matching chunks of the string.

To tell Perl you want to capture some bit of the match, just surround the bit you're after in parentheses (). You can do this multiple times if you want. Once you have checked that the pattern matches, Perl will then create some (more) "magic" variables that will contain the matched portion of the string. The first captured match will be in a variable called $1, the second in $2 and so on.

We can also look for alternate variations using the | operator. So, for example, we could check for any of the three possible STOP codons and capture which one we have found:

In [80]:
%%perl

my $dna = "ACGTGTAG";

if ($dna =~ /(TAG|TAA|TGA)/) {
    print "Found this stop codon: $1\n";
}
Found this stop codon: TAG

Subpatterns and Greediness

By default, regular expressions are "greedy". They try to match as much as they can. For example, suppose we wanted to look for gene sequences that start with a Methionine (start codon, ATG) and end with a STOP codon. We might try this, to pull out the two gene sequences in the following sequence (genes in uppercase just so you can see where they are, we turn the whole sequence to upper case straight away before matching):

In [81]:
%%perl

my $dna = 'caccATGAGACAGAACAGTAGgggacagttgcacATGCCAGGACGACCCATATAATAGaca';
$dna = uc $dna;
$dna =~ /(ATG[ACGT]+(TAG|TAA|TGA))/;
print "$1\n";
ATGAGACAGAACAGTAGGGGACAGTTGCACATGCCAGGACGACCCATATAATAG

Because the match is greedy, both of the genes have been extracted. To pull out just the first gene, we need to match the minimum number of times, not the maximum. We can do this using a ?, which modifies the preceding quantifier and makes it non-greedy. Note that (unfortunately) the ? is also a quantifier itself that specifies that the atom should match 0 or 1 times (i.e. make it optional).

In [82]:
%%perl

my $dna = 'caccATGAGACAGAACAGTAGgggacagttgcacATGCCAGGACGACCCATATAATAGaca';
$dna = uc $dna;
$dna =~ /(ATG[ACGT]+?(TAG|TAA|TGA))/;
print "$1\n";
ATGAGACAGAACAGTAG

What if we wanted to get both of the genes? We could do this by asking for the regular expression to do a global search, with the g modifier. We can then use a while loop to get each match one at a time:

In [83]:
%%perl

my $dna = 'caccATGAGACAGAACAGTAGgggacagttgcacATGCCAGGACGACCCATATAATAGaca';
my $dna = uc $dna;
while ($dna =~ /(ATG[ACGT]+?(TAG|TAA|TGA))/g) {
    print "$1\n";
}
ATGAGACAGAACAGTAG
ATGCCAGGACGACCCATATAA

String Substitution

So far we have only matched parts of strings, but we can also replace patterns with the s/// function.

s/// has two parts: the regular expression and the string to replace it with: s/expression/replacement/. By default, s/// will only substitute one instance of the match, if you want to substitute all instances then add a g at the end: s/expression/replacement/g.

For example, we can simulate the degradation of methylated CpG sites in the genome to TpG as follows:

In [84]:
%%perl 

my $dna = "CGAGGAGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATAGGCAGACGAGAACGTCG";

print "$dna\n";
$dna =~ s/CG/TG/;
print "$dna\n"; # Whoops! only mutates the first CG
$dna =~ s/CG/TG/g;
print "$dna\n";
CGAGGAGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATAGGCAGACGAGAACGTCG
TGAGGAGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATAGGCAGACGAGAACGTCG
TGAGGAGGATCTACTGCAGGCCAGTGTGAAGAGACTCATATAGGCAGATGAGAATGTTG

Regular expression can also be used in several Perl built in functions that we have already seen, for example you can use a regular expression as the delimiter in a split command. Here's how we could simulate a set of PstI restriction fragments, for example:

In [85]:
%%perl
my $psti = 'CTGCAG';
my $dna = "CGAGGACTGCAGGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATACTGCAGCGTCG";
my @psti_frags = split /$psti/, $dna;
print "Original sequence\n$dna\n";
print "PstI($psti) fragments:\n", join("\n", @psti_frags), "\n"
Original sequence
CGAGGACTGCAGGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATACTGCAGCGTCG
PstI(CTGCAG) fragments:
CGAGGA
GGATCTA
GCCAGCGCGAAGAGACTCATATA
CGTCG

Exercises

  1. Use the s/// operator to transcribe a DNA sequence into an RNA sequence.
  2. The consensus transcription start site sequence is TATTAT. Read in the FASTA file lesson13_file5.fasta and check each of the sequences in this file for transcription start sites. If a start site is found, print out the sequence downstream of the start site, including any lines after the one containing the start site.

Lesson 16: References and multidimensional data structures

In the lesson on subroutines, we saw that we can only pass a single array into a subroutine, and only return a single scalar or array. But we might want to pass in several data structures as input and return more than one value, or possibly more than one complex data structure. We know from looking at arrays earlier that if we put an array inside of an array, Perl will flatten the array and lose the internal structure. This will also happen if we pass in an array or a hash to a subroutine, which is no good.

Also, although arrays and hashes are very useful, our real world data is often more complicated than these structures. We may want to store arrays of arrays, hashes of arrays, arrays of hashes, hashes of hashes. But arrays and hashes can only store scalar values.

To take care of these problems, we can use references. All data and data structures have an address in memory. A reference is a Perl variable that refers to the memory address of some other variable. We never actually need to look at this address, we can just use the reference without thinking about what it points to explicitly. References are scalars, so we can use them to pass complex data structures into subroutines, by passing in references to these data structures. Similarly, to create an array of arrays, we create an array of references to other arrays.

Just so you know what a reference looks like, let's print one out. Ideally, you will never see this in real program output, but in practice, it's not uncommon to accidentally print out some references because of some bug in your code, so it's worth just seeing what they look like.

In [86]:
%%perl
my @codons = ('ACG','GAG','GCG','CCC','AGT','GAA');
my $codon_ref = \@codons;
print "$codon_ref\n";

my %geneticCode = ('TCT' => 'Serine',
                   'TCC' => 'Serine',
                   'TAG' => 'STOP',
                  );

my $geneticCode_ref = \%geneticCode;
print "$geneticCode_ref\n";
ARRAY(0x7ff95202dca0)
HASH(0x7ff952031e18)

Let's see how we can use this to store an array reference as a hash value. Note that we are not copying the array; there is only one copy of the array, and when we change it, the hash will also see the change. We will use a module called Data::Dumper to print more complex data structures.

In [87]:
%%perl

use Data::Dumper;
        
my @serine = ('TCA','TCC','TCG','TCT');
my @proline =('CCA','CCC','CCG','CCT');

my %aas;
$aas{'serine'} = \@serine;
$aas{'proline'} = \@proline;

print Dumper \%aas;

pop @serine;
shift @proline;

print "After array changes, the hash also changes\n";
print Dumper \%aas;
$VAR1 = {
          'serine' => [
                        'TCA',
                        'TCC',
                        'TCG',
                        'TCT'
                      ],
          'proline' => [
                         'CCA',
                         'CCC',
                         'CCG',
                         'CCT'
                       ]
        };
After array changes, the hash also changes
$VAR1 = {
          'serine' => [
                        'TCA',
                        'TCC',
                        'TCG'
                      ],
          'proline' => [
                         'CCC',
                         'CCG',
                         'CCT'
                       ]
        };

In the Dumper output, you can see that array elements are wrapped in square brackets [] and hash elements are wrapped in curly brackets {}. We can use these brackets directly to create anonymous arrays and anonymous hashes, which are references to arrays and hashes where the arrays and hashes themselves do not have names. This makes it easier to create multi-dimensional data structures.

In [88]:
%%perl -w
use Data::Dumper;

my %aas = ('serine' => ['TCA','TCC','TCG','TCT'],
                 'proline' => ['CCA','CCC','CCG','CCT'],
                );

my %codes;
$codes{'earth'} = \%aas;
$codes{'mars'} = {'serine' => ['QWZ','QWX','QWQ','QWW'],
                  'proline' => ['ZXZ','ZXX','ZXQ','ZXW'],
                 };

print Dumper \%codes;

print "$codes{'earth'}{'serine'}[0]\n";
print "$codes{'mars'}{'proline'}[1]\n";
$VAR1 = {
          'earth' => {
                       'serine' => [
                                     'TCA',
                                     'TCC',
                                     'TCG',
                                     'TCT'
                                   ],
                       'proline' => [
                                      'CCA',
                                      'CCC',
                                      'CCG',
                                      'CCT'
                                    ]
                     },
          'mars' => {
                      'serine' => [
                                    'QWZ',
                                    'QWX',
                                    'QWQ',
                                    'QWW'
                                  ],
                      'proline' => [
                                     'ZXZ',
                                     'ZXX',
                                     'ZXQ',
                                     'ZXW'
                                   ]
                    }
        };
TCA
ZXX

Note the subtle difference between creating the hash %aa, for which we use the normal brackets () as we have before, and creating the hash reference $codes{'mars'}, for which we use the curly brackets {}. Once we have created our multi-dimensional structure, we can access elements of it by chaining the hash and array indices. You will sometimes see people using arrows to make it clear we are chaining references (eg $codes->{'earth'}->{'serine'}->[0]), but this is not strictly necessary.

We have seen how to create a reference for a data structure, but not how to get the data structure from a reference - how to dereference. This can be done by wrapping the reference in curly brackets.

In [89]:
%%perl

my $mars_ref = {'serine' => ['QWZ','QWX','QWQ','QWW'],
                  'proline' => ['ZXZ','ZXX','ZXQ','ZXW'],
               };

my @aas = keys %{$mars_ref};

print "@aas\n";
serine proline

Finally, here's how we can use references to pass complex data structures in to a subroutine and get multiple values out.

In [90]:
%%perl
use Data::Dumper;
        
sub means {
    my @means;
    for my $array_ref (@_) {
        my $length = @{$array_ref};
        my $total = 0;
        for my $num (@{$array_ref}) {
            $total += $num;
        }
        push @means, $total / $length;
    }
    return @means;
}

my @mean_out = means([1,2,3],[4,5,6],[7,8,9]);

print Dumper \@mean_out;
$VAR1 = [
          '2',
          '5',
          '8'
        ];

The means subroutine takes a list of array references as input, with the references pointing to anonymous arrays, defined with square brackets. The subroutine itself loops through the array references and calculates the mean of each array. It then returns the @means array, which we then copy into our @mean_out output array.

Exercises

  1. Write a subroutine that takes the full %codes hash defined above as input, identifies the unique bases of each code from the given codons, and returns two arrays, one containing the four bases of the earth genetic code, and one containing the four bases of the mars genetic code.

HINTS: use multiple for loops to iterate through the hash of hashes of arrays and split up each codon by base. You will need to do multiple dereferences to get at the inner hashes and arrays. To find the unique bases, see the answer to Lesson 14 Exercise II.