Perl is widely used by biologists to solve problems we encounter when handling DNA and protein sequences, file format conversions, as well as many any other tasks that are computationally repetitive. Perl is not the only language for these tasks but one of many.
Perl comes preinstalled on Macs and Linux distributions. It can be downloaded for PCs (see perl.org for details; Strawberry Perl is a good choice).
Perl is designed to work like a natural language. Just as there are many different ways to say roughly the same thing in a natural language, so there are many different ways to solve a problem using Perl. Perl's motto is "There's More Than One Way To Do It". This has advantages and disadvantages. On the one hand, Perl can be easily adopted by many different communities and applied to many different problems; it is heavily used by organisations like the BBC and Amazon and was instrumental in the completion of the Human Genome Project. On the other hand, it can be horribly abused, just like natural languages.
In this course, we will teach the basic vocabulary and grammar of the Perl language, using some simple bioinformatics examples. However, as bioinformatics covers a multitude of different topics, we recommend that you consider the problems you are trying to solve in your research and try to apply Perl to these problems throughout the course.
As some people on the course have not programmed before, we assume no previous experience throughout. The pace will therefore be slow for people who have used other programming languages. Please feel free to read ahead.
Perl programs (also known as scripts) are series of instructions that we save in plain text files. As computers do not understand Perl directly, our scripts must be processed by a program called the Perl interpreter, which translates the scripts into machine language. This program is run from the command line. To get started with Perl, we must get acquainted with the command line.
Open the text editor gedit (icon in the menu on the left side of the screen) and save a file called "central_dogma.pl
" containing the following line:
print "Once information has passed into protein it cannot get out again\n";
Note the directory that you save the file into. Now open up the Terminal (also in the left hand menu). We want to move to the directory containing the file you just saved, and run the Perl interpreter on that file.
Enter the command pwd
(on Linux or Mac) or cd
(on Windows) to see which directory you are in. If it's not the directory that contains your script, use the command ls
(on Linux or Mac) or dir
(on Windows) to see what the directory contains. If you see your directory in the list - let's say it's called mydir
- you can move into that directory by entering the command cd mydir
. Check that mydir
contains your central_dogma.pl
script using ls
or dir
.
Run the Perl interpreter by typing:
perl central_dogma.pl
The script you just ran contains one command, which print
s the string "Once information has passed into protein it cannot get out again" to the screen, and moves the cursor to a new line (\n
). The semi-colon is used to punctuate the end of a command.
You can work through the rest of the course using the interpreter in this way. However, you may find it useful to add the following features to your script. If you find them confusing, however, feel free to skip to Lesson 2 for now.
If there is an error in your code that the interpreter can't make sense of, it will fail to run. But it can also warn you about problems in the code that it can interpret, but which are probably not going to work correctly. You can turn on these warnings by putting this at the top of your script:
use warnings;
You can also ask the Perl interpreter to turn warnings on for you:
perl -w central_dogma.pl
use warnings
and -w
do the same thing, so you only need to use one of them.
It's useful to annotate your scripts with human-readable comments. Perl ignores anything on a line that follows a hash symbol. So we could have written:
# central_dogma.pl
# Print the Central Dogma of Molecular Biology
print "Once information has passed into protein it cannot get out again\n" # See Crick 1970
In this case, the comments are not very useful, but when we get to writing more complicated code it is very useful to make notes in comments, to remind ourselves how the code works when we come to read it months or years later.
Many Perl guides will tell you to add use strict;
to every script you write (like use warnings;
above). This tells the Perl interpreter to complain if you use some of the more dangerous Perl idioms in your code. It's not too important for the course, as we won't be using any of those idioms, but it's not a bad idea to use it in scripts you write and actually use.
Putting it all together, the complete script:
%%perl
# central_dogma.pl
# Print the Central Dogma of Molecular Biology
use strict;
use warnings;
print "Once information has passed into protein it cannot get out again\n"; # See Crick 1970
Once information has passed into protein it cannot get out again
Typically, we want to process some data using Perl and produce some output. This means we have a set of values and we want to transform these into some other, more useful information. In Perl, we create data structures to hold our values, which we can then manipulate. We call these variables, because we can change their values as we progress through our analysis.
There are three major types of data structures we typically use in Perl: scalars, arrays and hashes. Scalars are single values, such as numbers or characters. Strings of characters also count as single values in Perl. Arrays are ordered lists of values, whereas hashes are unordered collections of values. Variable names of each type begin with a symbol that identifies the variable as a scalar, array or hash. The symbols are given in the table below.
Scalar | $ | single values |
Array | @ | ordered lists of scalars |
Hash | % | unordered collections of scalars |
Values can be assigned to variables using the assignment operator =
. For example (all code needs to have the complete script skeleton from Lesson 1 but we will leave it out of the notes in further examples):
%%perl
# use 'my' the FIRST time you use a variable
my $serine = "TCA";
my $codonLength = 3;
print "Codon $serine has $codonLength bases\n";
Codon TCA has 3 bases
We can change the value of a variable by using the same assignment operator:
%%perl
my $serine = "TCA";
print "Serine is coded by $serine\n";
$serine = "TCC";
print "Serine is also coded by $serine\n";
Serine is coded by TCA Serine is also coded by TCC
We can even use the value of a variable when we change its value, for example if we wanted to count something:
%%perl
my $age = 18;
print "I am $age years old\n";
# two years later
$age = $age + 2;
print "I will be $age years old two years later\n";
I am 18 years old I will be 20 years old two years later
Another example of scalar reassignment:
%%perl
my $dogma = "Once information gets into protein";
print "1: $dogma\n";
# '.' text concatenation operator
$dogma = $dogma . " it cannot get out again";
print "2: $dogma\n";
$dogma = "DNA makes RNA makes Protein";
print "3: $dogma\n";
1: Once information gets into protein 2: Once information gets into protein it cannot get out again 3: DNA makes RNA makes Protein
type of scalar | examples |
---|---|
number: integers | 1 2 -3 128 |
number: floating-points | -0.2 0.1 3.14159 |
string: letters | "a" "A" "G" "c" |
string: words | "Protein" "DNA" |
string: words and spaces and punctuation | "DNA makes RNA makes Protein." |
reference | (see later lessons) |
If you put a variable such as “$codon”
inside of double quotes in a print
statement, the value of $codon
will be printed, but if you put ‘$codon’
inside of single quotes you will literally get $codon
. The same is true for special characters like the newline “\n”
.
%%perl
my $codon = "ACG";
print "$codon\n";
print '$codon\n';
ACG $codon\n
SYLYC
. Use the concatenation operator.An array is an ordered list of scalars. The scalars do not all need to be of the same kind.
%%perl
my @numbers = (22,103,1,0);
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");
my $serine = "TCA";
my @mixed = (23,$serine,0.3,"Histone");
There are a number of different ways to print arrays. For example, we can dump the elements of an array one after the other with no delimiters just by directly printing the array:
%%perl
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");
print "Amino acids: " , @aminoAcids , "\n";
Amino acids: SerineTyrosineLeucineTyrosineCysteine
But this isn't very readable. We can do better by enclosing the array variable in double quotes, which inserts a space after each element in the array:
%%perl
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");
print "Amino acids, spaced: @aminoAcids\n";
Amino acids, spaced: Serine Tyrosine Leucine Tyrosine Cysteine
We will see a third way later on when we look at the join
function.
We can access individual elements of an array by using their index. Each element is given a unique number, an index, corresponding to their place in the array. In Perl, counting starts from 0, so the first element of the array has index 0, the second has index 1 and so on. We access elements using square brackets:
%%perl -w
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");
print $aminoAcids[2], "\n";
print $aminoAcids[0], "\n";
print $aminoAcids[-1], "\n";
Leucine Serine Cysteine
You can assign new values to array elements by using their indices:
%%perl -w
my @aminoAcids = ("Serine","Tyrosine","Leucine","Tyrosine","Cysteine");
$aminoAcids[2] = "Cysteine";
print "@aminoAcids\n";
$aminoAcids[5] = "STOP";
print "@aminoAcids\n";
Serine Tyrosine Cysteine Tyrosine Cysteine Serine Tyrosine Cysteine Tyrosine Cysteine STOP
Arrays can be included in other arrays, but the result is just one flat array. For example, if two arrays of three elements each are merged together, the result is an array containing the six elements, not an array containing two arrays.
%%perl -w
my @serine = ('T','C','A');
print "@serine\n";
my @cysteine = ('T', 'G', 'T');
print "@cysteine\n";
my @sc = (@serine, @cysteine);
print "@sc\n";
print "$sc[4]\n";
T C A T G T T C A T G T G
We can get the length of an array like this:
%%perl -w
my @serine = ('T','C','A');
my $length = @serine;
print "$length";
3
This is called accessing an array in scalar context. $length
is a scalar, so when we set the array @serine
as the value of the scalar $length
, Perl interprets this to mean we want the length of the array. This is just a convention - Perl could, for example, return the first element of the array instead - but it is a useful convention that we will use often.
CLYSY
where the elements of the array are DNA codon sequences. Use the codon variables you defined in the last lesson.Arrays are very useful but not sufficient for many purposes. Often, we have a data set with no inherent order but where each entry has a particular value. For example, we may have a set of genes with expression levels, where we want to look up expression levels by gene name. There is no biological order to the gene names and we want to rapidly access particular genes without trawling through the whole data collection looking for it. Perl has a very efficient data structure for dealing with this situation - the hash.
%%perl -w
# hash_assignment.pl
my %geneticCode = ('TAT' => 'Tyrosine',
'TAC' => 'Tyrosine',
'CAA' => 'Glutamine',
'CAG' => 'Glutamine',
);
Hashes have keys and values, and work like a dictionary - we look up the keys to find the matching values. The keys are like indices of an array, but they can be any scalar, not just a number, as long as they are unique in the hash. In the hash above, the DNA codons are the keys (TAT
, TAC
and so on) and the amino acids names are the values. As with arrays, the values can be any type and do not need to all be the same type, as long as they are scalars.
We can access values from a hash by looking up their keys. We use curly brackets to do this, rather than the square brackets we used for array indices. As for array elements, a hash value is a scalar, not a hash, so we say $geneticCode{'TAT'}
, not %geneticCode{'TAT'}
.
%%perl -w
# hash_assignment.pl
my %geneticCode = ('TAT' => 'Tyrosine',
'TAC' => 'Tyrosine',
'CAA' => 'Glutamine',
'CAG' => 'Glutamine',
);
print $geneticCode{'TAT'};
print "\n";
print "$geneticCode{'CAG'}\n";
Tyrosine Glutamine
We can extract the keys or values from a hash using the keys
and values
functions. These will return an unsorted array of the keys or values of the hash, which we can use like any other array (we will see how to sort these arrays later in the course).
%%perl -w
# hash_assignment.pl
my %geneticCode = ('TAT' => 'Tyrosine',
'TAC' => 'Tyrosine',
'CAA' => 'Glutamine',
'CAG' => 'Glutamine',
);
my @codons = keys %geneticCode;
my @aas = values %geneticCode;
print "@codons\n";
print "@aas\n";
CAG CAA TAC TAT Glutamine Glutamine Tyrosine Tyrosine
We can add or update key-value pairs to the hash just as we could add new array elements, or delete them with the delete
function. If we try to look up a non-existent key, Perl will complain, as long as we are using warnings (-w
).
%%perl -w
# hash_assignment.pl
my %geneticCode = ('TAT' => 'Tyrosine',
'TAC' => 'Tyrosine',
'CAA' => 'Glutamine',
'CAG' => 'Glutamine',
);
$geneticCode{'TGA'} = 'Cysteine';
print "$geneticCode{'TGA'}\n";
$geneticCode{'TGA'} = 'STOP'; # Whoops! Got the assignment wrong the first time
print "$geneticCode{'TGA'}\n";
delete $geneticCode{'TGA'};
print "$geneticCode{'TGA'}\n";
Cysteine STOP
Use of uninitialized value $geneticCode{"TGA"} in concatenation (.) or string at - line 14.
"GTT GCA CCA CAA CCG GTT GTG"
. Use a hash to do this, similar to the %geneticCode
hash above.In the exercise for Lesson 4, perhaps you ended up writing something like this:
%%perl -w
# hash_assignment.pl
my %geneticCode = ('TAT' => 'Tyrosine',
'TAC' => 'Tyrosine',
'CAA' => 'Glutamine',
'CAG' => 'Glutamine',
);
my @sequence = ('CAA','CAG','TAC');
print $geneticCode{$sequence[0]}, "\n";
print $geneticCode{$sequence[1]}, "\n";
print $geneticCode{$sequence[2]}, "\n";
Glutamine Glutamine Tyrosine
Repeating the last line of code several times is not ideal. We have to type, or copy and paste, the same thing multiple times and if we wanted to change what we were doing with each amino acid name, we would have to change every line (for example, we forgot to print a new line or a space after each name). It would be much easier to just write the line once and then execute that line for every item in the array. We can do this with a for loop:
%%perl
my %geneticCode = ('TAT' => 'Tyrosine',
'TAC' => 'Tyrosine',
'CAA' => 'Glutamine',
'CAG' => 'Glutamine',
);
my @sequence = ('CAA','CAG','TAC');
foreach my $codon (@sequence) {
print "$geneticCode{$codon}\n";
}
Glutamine Glutamine Tyrosine
Because 'keys %hash
' is an array, we can also loop over the entries in a hash in the same way:
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
'Beta-Actin' => 1.7,
'Pax6' => 0,
'HoxA2' => -3.2,
);
foreach my $gene (keys %geneExpression) {
print "$gene: $geneExpression{$gene}\n";
}
Beta-Catenin: 2.5 HoxA2: -3.2 Beta-Actin: 1.7 Pax6: 0
Perl has another type of loop which can also be used to iterate over a hash, among other things - the while loop:
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
'Beta-Actin' => 1.7,
'Pax6' => 0,
'HoxA2' => -3.2,
);
while (my $gene = each %geneExpression) {
print "$gene: $geneExpression{$gene}\n";
}
Beta-Catenin: 2.5 HoxA2: -3.2 Beta-Actin: 1.7 Pax6: 0
Like the for loop, the while loop executes the block of code within curly brackets. The difference is that, whereas the for loop takes an array and does something with each element of the array, the while loop takes a logical test and runs until that test is false. So here it runs until each of the genes in %geneExpression
have been processed. Let's say we want to write out the first five codons of a gene's DNA sequence:
%%perl -w
my @sequence = ('CAA','CAG','TAG','CAT','GGT','GAG','GGC','CAG');
my $i = 0;
while ($i<5) {
print "$sequence[$i] ";
$i = $i + 1;
}
CAA CAG TAG CAT GGT
In the while loop above, we saw an example of a logical test: ($i<5)
. While this statement was true, the while loop continued; when it became false, the while loop ended. Another powerful control structure that uses logical tests is if..elsif..else
.
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
'Beta-Actin' => 1.7,
'Pax6' => 0,
'HoxA2' => -3.2,
);
foreach my $gene (keys %geneExpression) {
if ($geneExpression{$gene} < 0) {
print "$gene is downregulated\n";
}
elsif ($geneExpression{$gene} > 0) {
print "$gene is upregulated\n";
}
else {
print "No change in expression of $gene\n";
}
}
Beta-Catenin is upregulated HoxA2 is downregulated Beta-Actin is upregulated No change in expression of Pax6
The elsif
block is optional, and we can use as many as we like. The else
block is also optional, but it is good practice to include it in most cases, so that you think to do something explicitly with whatever doesn't pass the test in the if
block. Otherwise, some cases might not pass the condition, and Perl would pass over them silently, which may not be what you want. There are many different comparison operators we can use in our logical tests.
Strings and numbers have different operators.
Operator | Description | Example |
---|---|---|
== | numeric equality | 1 == 2 # False |
eq | string equality | 'a' eq 'b' # False |
!= | numeric non equality | 1 != 2 # True |
ne | string non equality | 'a' ne 'b' # True |
< | numeric less than | 1 < 2 # True |
<= | numeric equal or less than | 2 <= 2 # True |
lt | string less than | 'a' lt 'b' # True by ASCII order |
> | numeric greater then | 1 > 2 # False |
>= | numeric equal or greater than | 1 >= 1 # True |
gt | string greater than | 'a' gt 'b' # False |
For example, we might want to output only the nonsynonymous bases of a codon (the first and second bases) and ignore the third, synonymous base (because most amino acids are determined by the first and second bases alone).
%%perl
my @sequence = ('T','A','C','G','G','C','A','T','C','T','A','G');
my $i = 0;
foreach my $base (@sequence) {
$i = $i + 1;
if ($i == 3) {
$i = 0;
}
else {
print "$base";
}
}
TAGGATTA
Perl has a useful keyword that can simplify loops like this: next
, which moves on to the next iteration of the loop if a condition is true:
%%perl
my @sequence = ('T','A','C','G','G','C','A','T','C','T','A','G');
my $i = 0;
foreach my $base (@sequence) {
$i = $i + 1;
if ($i == 3) {
$i = 0;
next;
}
print "$base";
}
TAGGATTA
Here we have omitted the else statement, against our advice above, and in this simple example the first version was probably better. But imagine if we had complex processing of the base to do, with perhaps ten or twenty lines of code following the if statement. If we used an else, we would have to enclose all of these lines in the else block, which would have made our code less readable than just placing them after the if statement.
Perl also has a last
keyword, which exits a loop if a condition is met:
%%perl
my %geneticCode = ('TAT' => 'Tyrosine',
'TAC' => 'Tyrosine',
'CAA' => 'Glutamine',
'CAG' => 'Glutamine',
'TAG' => 'STOP',
);
my @sequence = ('CAG','TAC','CAA','TAG','TAC','CAG','CAA');
foreach my $codon (@sequence) {
if ($geneticCode{$codon} eq 'STOP') {
last;
}
else {
print "$geneticCode{$codon}\n";
}
}
Glutamine Tyrosine Glutamine
Here the amino acids in the sequence are printed until the STOP
codon (TAG
) is reached. We used the else
here, but actually Perl provides a more readable syntax for if statements when there is only one command in the block (as here, the last
command). We can just put the command before the if
statement, and forget about the blocks:
%%perl
my %geneticCode = ('TAT' => 'Tyrosine',
'TAC' => 'Tyrosine',
'CAA' => 'Glutamine',
'CAG' => 'Glutamine',
'TAG' => 'STOP',
);
my @sequence = ('CAG','TAC','CAA','TAG','TAC','CAG','CAA');
foreach my $codon (@sequence) {
last if ($geneticCode{$codon} eq 'STOP');
print "$geneticCode{$codon}\n";
}
Glutamine Tyrosine Glutamine
Finally for this lesson, we can chain multiple logical tests with and
and or
operators. We can either use the words and
and or
, or the symbols &&
and ||
.
Operator (symbolic) | Operator (written) | Example | To be true |
---|---|---|---|
&& | and | i < 50 && i > 5 | Both conditions must be true |
|| | or | str eq "clock" || str ne "black" | Only one condition needs to be true |
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
'Beta-Actin' => 1.7,
'Pax6' => 0,
'HoxA2' => -3.2,
);
foreach my $gene (keys %geneExpression) {
if ($geneExpression{$gene} < 0 or $geneExpression{$gene} > 0) {
print "$gene has changed expression\n";
}
else {
print "$gene has not changed expression\n";
}
if ($geneExpression{$gene} > 0 and $gene ne 'Beta-Actin') {
print "$gene is upregulated and not a housekeeping gene\n";
}
}
Beta-Catenin has changed expression Beta-Catenin is upregulated and not a housekeeping gene HoxA2 has changed expression Beta-Actin has changed expression Pax6 has not changed expression
$gc
, which we will use to count the number of Gs or Cs in our sequence.$gc
variable.+
' to add in Perl, to multiply, we use '*
' and to divide, we use '/
'.In the next few lessons, we will introduce many of Perl's basic functions and operators for processing numbers, strings, arrays and hashes. We will see how to load functions from Perl modules and finally we will show how you can construct your own functions.
A function is something that takes some input and produces some output. We have already seen several Perl functions: print
, keys
, values
and each
are all examples of functions. For example, keys
takes a hash as input and returns the keys of the hash as output.
Perl has many core operators and functions built into the language. For example, the following operators all take two numbers as input and produce one number as output:
Operator | Summary | Example |
---|---|---|
+ | addition | 2 + 2 |
- | subtraction | 5 - 3 |
* | multiplication | 4 * 4 |
/ | division | 8 / 2 |
% | modulus, returns remainder | 10 % 2 |
** | exponent | 4**2 |
%%perl
# addition
print "sum of 2 and 3 = ";
my $num = 2 + 3;
print "$num\n";
# subtraction
print "$num minus 2 = ";
$num = $num - 2;
print "$num\n";
# multiplication
print "product of $num and 3 = ";
$num = $num * 3;
print "$num\n";
# division
print "$num divided by 2 = ";
$num = $num / 2;
print "$num\n";
# modulus
print "the remainder of $num divided by 2 = ";
my $remainder = $num % 2;
print "$remainder\n";
# exponentiation
print "$remainder raised to the power 2 = ";
$num = $remainder ** 2;
print "$num\n";
sum of 2 and 3 = 5 5 minus 2 = 3 product of 3 and 3 = 9 9 divided by 2 = 4.5 the remainder of 4.5 divided by 2 = 0 0 raised to the power 2 = 0
Perl has a couple of handy shortcuts for this kind of operation. Any of the above operators can be placed before the equals sign, so that instead of writing $num = $num + 2
we can just write $num += 2
. Also, to add or subtract one to or from a variable, we can use ++
or --
:
%%perl
my $num = 1;
print "Number was $num,";
$num *= 5;
$num += 3;
print " now $num\n";
# increment in place
print "$num + 1 = ";
$num++;
print "$num\n";
# decrement in place
print "$num - 1 = ";
$num--;
print "$num\n";
Number was 1, now 8 8 + 1 = 9 9 - 1 = 8
++ | increment by 1 (same as $i = $i+1) | $i++ |
-- | decrement by 1, (same as $i = $i-1) | $i-- |
+= | in place addition | $i += 2 |
-= | in place subtraction | $i -= 3 |
Finally, here are a couple of useful functions for calculating the absolute value of some number, and the integer portion of a floating-point number:
Function | Summary | Example |
---|---|---|
abs | absolute value | abs(-32) |
int | get the integer portion of a number | int(5.45) |
%%perl
# absolute value
print "the absolute value of 3 - 4.2 = ";
my $num = abs 3 - 4.2;
print "$num\n";
# conversion to integer
print "int portion of $num = ";
print int $num, "\n";
the absolute value of 3 - 4.2 = 1.2 int portion of 1.2 = 1
A word about brackets here. If you are used to other languages, you may be surprised that we don't need brackets around function arguments. Perl is generally fairly lax about requiring brackets, in the hope that code can be read something like a natural language. However, it doesn't mind if you put the brackets in, so if you find it easier to read, you can write things like this:
%%perl
# absolute value
print "the absolute value of 3 - 4.2 = ";
my $num = abs(3 - 4.2);
print "$num\n";
# conversion to integer
print "int portion of $num = ";
print(int($num), "\n");
the absolute value of 3 - 4.2 = 1.2 int portion of 1.2 = 1
The rule of thumb is to do what you find most readable and clear. For example, the brackets around the abs
are probably a good idea; it saves us from thinking Perl will calculate abs 3
and then subtract 4.2
from that. But the brackets on the int
line arguably clutter up the line rather than making it clearer. Sometimes the brackets will be necessary, however, to tell Perl which arguments belong to which function. As int
only takes one argument, there is no ambiguity here, but we will shortly see some examples where the brackets are required. This is something you will figure out through experience.
Perl also has a set of useful string functions. For example, we can add strings together using the concatenation operator, which has a prefix notation like the numerical operators:
Operator | Summary | Example |
---|---|---|
. | concatenation | $str . " " . "!!" |
.= | concatenation & assignment | $str .= "add this" |
%%perl
my $str = "ACG" . "AGA" . "GCG";
print "$str\n";
$str .= "TGT";
print "$str\n";
ACGAGAGCG ACGAGAGCGTGT
Here are some more basic functions, which are hopefully fairly self-explanatory:
Function | Summary | Example |
---|---|---|
length | length of string | print length $str |
uc | uppercase | print uc $str |
lc | lowercase | print lc $str |
ucfirst | uppercase first character | print ucfirst $str |
lcfirst | lowercase first character | print lcfirst $str |
chop | remove last character | chop $str |
chomp | remove new line characters | chomp $str |
%%perl
# change case
my $str = "ACGAGAGCGTGT";
print length $str, "\n";
print "$str\n";
$str = lc $str;
print "$str\n";
print uc $str, "\n";
print "ucfirst: " , ucfirst(lc($str)),"\n";
print "lcfirst: " , lcfirst uc $str,"\n";
chop $str;
print "Chopped: $str\n";
$str .= "\n";
print "Str now has a new line: $str$str$str";
chomp $str;
print "Now it doesn't: $str$str$str";
12 ACGAGAGCGTGT acgagagcgtgt ACGAGAGCGTGT ucfirst: Acgagagcgtgt lcfirst: aCGAGAGCGTGT Chopped: acgagagcgtg Str now has a new line: acgagagcgtg acgagagcgtg acgagagcgtg Now it doesn't: acgagagcgtgacgagagcgtgacgagagcgtg
chomp
is often used when processing files or user input, because lines from files or lines entered by users will usually have a new line character on the end. More on this later.
Here are some more sophisticated string processing functions.
substr | sub string | print substr($str, 3, 5) |
split | splits up a string, returns an array | my @array = split ' ', $str |
split
takes a string and breaks up the string at every instance of a particular character, returning an array containing each part of the string:
%%perl
my $seq = "ACG AGA GCG TGT";
my @codons = split ' ', $seq;
foreach my $codon (@codons) {
print "$codon\n";
}
foreach my $base (split '', $seq) {
next if $base eq " ";
print "$base\n";
}
ACG AGA GCG TGT A C G A G A G C G T G T
substr
takes a string, a starting position and (optionally) a length, and returns a substring of the string of the desired length that starts at the given position. As with array indexing, substr
starts counting from 0. It has a few odd variants, which are worth exploring. If the length argument is omitted, substr
returns everything from the start position to the end of the string. If a negative number is used for the start position, substr
counts from the end of the string. If a negative number is used for the length, substr
again subtracts that number of characters from the end of the string. Make sure you understand the following examples.
%%perl
my $str = "ACGAGAGCGTGT";
print substr($str, 3, 3),"\n";
print substr($str, 3), "\n";
print substr($str, -3), "\n";
print substr($str, 3, -3), "\n";
AGA AGAGCGTGT TGT AGAGCG
substr
and a while
loop to print out each codon in the sequence on a new line. Before the while
loop starts, check that the length of the sequence is divisible by three and only run the while
loop if it is.We have seen how to create a new array containing a set of elements, and one way to add new elements to an array (by setting a particular array element to a chosen value). We need a bit more flexibility, however. Here are some functions to help us manipulate arrays we have already created.
Function | Description |
---|---|
pop @array | removes last element from array and returns it |
push @array, $new_element, ... | appends one or more elements to the end of an array |
shift @array | removes the first element from the array and returns it |
unshift @array, $new_element, ... | add elements at the front of the array |
splice | adds or removes elements anywhere in the array |
join($str, @array) | joins the elements of the array together with the specified string, and returns the whole string |
pop
, push
, shift
and unshift
all remove or add one element from or to an array. pop
and push
work from the end of the array; shift
and unshift
work from the beginning. pop
and shift
return the element they have removed.
%%perl
my @array = ('ACT','GCT','GAG','CAG');
print "Array is @array\n";
my $last = pop @array;
print "Popped the last element off the array: @array. Here's the last element: $last\n";
push @array, 'CGT';
print "Added CGT to the end of the array: @array\n";
my $first = shift @array;
print "Shifted the first element off the array: @array. Here's the first element: $first\n";
unshift @array, 'AAC';
print "Added AAC to the beginning of the array: @array\n";
Array is ACT GCT GAG CAG Popped the last element off the array: ACT GCT GAG. Here's the last element: CAG Added CGT to the end of the array: ACT GCT GAG CGT Shifted the first element off the array: GCT GAG CGT. Here's the first element: ACT Added AAC to the beginning of the array: AAC GCT GAG CGT
With splice
, we can remove chunks of the array and replace them with something else if we want to. Like substr
, splice
takes an array as input, a starting position and an optional length, plus an optional replacement array. It removes the elements from the start position up to the given length, and replaces them with the replacement array. It returns the array elements it has removed.
%%perl
my @array = ('ACT','GCT','GAG','CAG');
print "Array starts as @array\n";
my @spliced = splice @array, 1, 2;
print "Removed @spliced to leave @array\n";
my @replacement = ('CGG','CCC','AGT','GAC');
@spliced = splice @array, 1, 1, @replacement;
print "Removed @spliced and replaced with @replacement to leave @array\n";
Array starts as ACT GCT GAG CAG Removed GCT GAG to leave ACT CAG Removed CAG and replaced with CGG CCC AGT GAC to leave ACT CGG CCC AGT GAC
Finally, join
does the opposite of split
; it takes an array and joins the elements together into a string, using a separator if given:
%%perl
my @bases = ('A','C','G','T','A','C','G','G','T','T','G','A');
my $seq = join '', @bases;
print "@bases is now $seq\n";
$seq = join ':', @bases;
print "We can use separators too: $seq\n";
A C G T A C G G T T G A is now ACGTACGGTTGA We can use separators too: A:C:G:T:A:C:G:G:T:T:G:A
Perl also provides some convenience hash functions. We've already met three of these.
Function | Description |
---|---|
keys %hash | retrieve an array of all keys in a hash |
values %hash | retrieve an array of all values in a hash |
each %hash | retrieve next key-value pair from a hash |
exists $hash{'key'} | test whether a key is found in the hash |
delete $hash{'key'} | delete a key, and any corresponding value, from a hash |
Here is a simple example of using exists
and delete
. We wouldn't normally write such code, with the hash and array defined next to each other, but suppose that the hash has been loaded from one file and the array from another, and we don't know whether the array contains the same genes as the hash.
%%perl
my %geneExpression = ('Beta-Catenin' => 2.5,
'Beta-Actin' => 1.7,
'Pax6' => 0,
'HoxA2' => -3.2,
);
my @genes = ('Pax6','EphB2','HoxA2');
for my $gene (@genes) {
if (exists $geneExpression{$gene}) {
print "Found $gene in hash, has expression $geneExpression{$gene}\n";
delete $geneExpression{$gene};
}
else {
print "Can't find $gene\n";
}
}
print "Hash now contains the following genes: ",
join(' ', keys %geneExpression), " with the following values: ",
join(' ', values %geneExpression), "\n";
Found Pax6 in hash, has expression 0 Can't find EphB2 Found HoxA2 in hash, has expression -3.2 Hash now contains the following genes: Beta-Catenin Beta-Actin with the following values: 2.5 1.7
Let's put all of this together. You are given the following data:
%%perl
my %geneticCode = ('TCT' => 'Serine',
'TCC' => 'Serine',
'CTA' => 'Leucine',
'CTG' => 'Leucine',
'TGT' => 'Cysteine',
'TAG' => 'STOP',
);
my $sequence = 'TCCTGTCTACCCCTGAAGTCTTAG';
Write a routine that will convert the given DNA sequence into an amino acid sequence, using the first letter of the amino acid name for each amino acid (eg output S
for Serine, L
for Leucine and so on). Where no codon is present in the hash, output '?
'. Use '*'
for the stop codon.
You have all the tools to figure out how to do this, but if you need some hints:
codon
variable until the codon is three bases long, or you could split the sequence up into an array and use splice to chop off three bases at a time. You may be able to think of another solution.if
statement to check whether the codon exists in the hash. If it does, you'll want to add the first character of the amino acid value in the hash to an amino acid sequence (which you'll have to set up before the loop begins). You can use substr
to get the first character. If it doesn't, you want to add '?
' to the amino acid sequence.if
statement to check whether the codon is a STOP
codon or not. If it is, you want to add '*
' to the amino acid sequence, not the first character of STOP
.We have now made use of several core Perl functions. There are many more that we haven't covered here. See Perl functions for a full list. Perl also has a large number of core modules, which are packages containing additional functions with common goals. These are all documented on Perldoc too.
The core functions are always available, but to use a module we need to import it. For example, one commonly used module is List::Util
. This contains some commonly used list (array) functions. Here's how we can use it:
%%perl
use List::Util ('max', 'min');
my @genome_mb = (100.3, 2700, 4.6, 3200, 12.1);
print max(@genome_mb), "\n";
print min(@genome_mb), "\n";
3200 4.6
We import the functions max
and min
from List::Util
, and then we can use these functions just like any of Perl's core functions. We can find out what functions are contained in List::Util
by looking up the List::Util
Perldoc page.
We can also create our own functions, or subroutines. If there is a block of code you use more than once in your script, or there are some fiddly details that you would rather hide away from the important parts of your algorithm, you should write a subroutine. Using subroutines reduces the chance of introducing errors by copying and pasting code, and if you have to change something you only change it in one place, not in many.
Just like a normal function, subroutines can take arguments as input and return values as output. Here's an example:
%%perl
my @genome_mb = (100.3, 2700);
sub maximum {
my ($a, $b)= @_;
if ($a > $b) {
return $a;
}
else {
return $b;
}
}
my $max = maximum(@genome_mb);
print "Maximum genome size in our list is $max Mb\n";
Maximum genome size in our list is 2700 Mb
Here we create a subroutine called maximum
which takes two arguments, $a
and $b
. Perl has an odd syntax for subroutine arguments. It creates a special array, @_
, to hold any arguments passed into a subroutine. To get at the arguments, we have to unpack @_
. This is what we are doing in the first line of maximum
.
To return a value from the subroutine, we use the return
keyword. Perl's subroutines can only return a single value or an array, but not separate multiple values, and not a hash.
For interest, here's another take on this subroutine:
%%perl
my @genome_mb = (100.3, 2700);
sub maximum2 {
my $a = shift;
my $b = shift;
return ($a > $b) ? $a : $b;
}
my $max = maximum2 @genome_mb;
print "Maximum genome size in our list is $max Mb\n";
Maximum genome size in our list is 2700 Mb
Three things have changed here. Firstly, we are now calling the subroutine without brackets - this works for our own subroutines as well as core functions, although it is not recommended. Secondly, we are using shift
to take arguments off the special @_
array; as this array is always created for a subroutine, Perl doesn't require us to call it by name, and just assumes we mean this array when we don't give an argument for shift.
Thirdly, we are using a nice shorthand conditional syntax: (cond) ? A : B
. If cond
is true, do whatever is after the ?
; if false, do whatever is after the :
. This is very handy when we are trying to decide what value a variable should have based on some condition, as here, or when the processing we need to do as a result of the computation is very simple.
Exercises
maximum
that takes two numerical arguments and returns their mean.mean
and maximum
subroutines assume we only have two arguments, but often we have more or less. Write another subroutine that takes an array that could contain any number of elements and returns the mean of all the elements. Use the first @genome_mb
array (with five elements) to test your subroutine.So far we have just been operating on data that we have defined ourselves within our scripts. In the real world we will want to get input data from users and files etc., operate on it, and frequently output some results to other files. Perl has lots of support for dealing with input and output (I/O in computer jargon), and we'll go through some of these in the next few lessons.
The first form of user input we'll look at is reading in arguments passed to our scripts on the command line, these may include specific parameters for some analysis (e.g. gene names, or numerical cutoffs) or often the names of files that we will operate on later. Command line arguments are similar in some ways to arguments that we pass to subroutines.
So far we have been running our scripts with command lines like:
perl my_script.pl
If we want to supply some arguments to the script we can do so by writing them after the name of the script:
perl my_script.pl 23 BRCA2 0.5
Perl will split these up according to whitespace (so if you want to pass in an argument that contains whitespace you'll need to put the string in quotation marks), and then put them all in another 'magic' array variable called @ARGV
.
We can then access arguments as values in this array just like any other, so @ARGV
is equivalent to the @_
array Perl uses to pass arguments to subroutines.
%%file lesson11_example1.pl
my @user_input = @ARGV;
print join("\n" , @user_input) , "\n";
Writing lesson11_example1.pl
%%bash
perl lesson11_example1.pl 23 BRCA2 0.5
23 BRCA2 0.5
shift
and @ARGV
¶Remember the array function shift
removes and returns the first element of an array.
%%file lesson11_example2.pl
my $first_input = shift @ARGV;
print "first input: $first_input\n";
Writing lesson11_example2.pl
%%bash
perl lesson11_example2.pl one two three
first input: one
Just like in subroutines where the argument array @_
is implicitly used when calling array functions like shift
so you don't have to explicitly name it, in the main body of a script @ARGV
will be used implicitly, so a common idiom for reading in a command line argument is:
%%file lesson11_example3.pl
my $first = shift;
print "First: $first";
Writing lesson11_example3.pl
%%bash
perl lesson11_example3.pl one two three
First: one
Remember with arrays that each element can be accessed using indices.
%%file lesson11_example4.pl
my $first_input = $ARGV[0];
my $second_input = $ARGV[1];
print "$first_input and $second_input\n";
Writing lesson11_example4.pl
%%bash
perl lesson11_example4.pl one two three
one and two
%%file lesson11_example5.pl
my ($in_1, $in_2 , $in_3) = @ARGV;
print "$in_1 and $in_2 and $in_3\n";
Writing lesson11_example5.pl
%%bash
perl lesson11_example5.pl one two three
one and two and three
Now we are able to read in some input from users, it is important for your programs to check that the input matches any assumptions you have made about this input, and to deal with things gracefully if they do not. For example, say you want to compute the average of some numbers the user supplies on the command line, what should you do if the user gives you strings instead of numbers?
An even more basic check to perform is that an expected argument has indeed been supplied.
Another common check is that if the user supplies the name of a file to operate on (as we will see later), we should check that the file exists, and perhaps that it has some data in it.
Perl provides a few useful functions to help with these sort of checks:
Test | Description |
---|---|
defined | checks if a variable is defined |
-e | checks if a file exists |
-s | checks the size of a file (returning the result as a number) |
If we find some problem with user input then Perl also has a couple of useful functions that allow us to print warning messages to the user, without interfering with the normal output of the program:
Action | Description |
---|---|
die $message | prints a message to STDERR and kills the program |
warn $message | prints a message to STDERR, but the program will continue |
The example below checks to see if an argument was provided, and if it is then the argument will be assigned to $arg_1
. Otherwise the else
block will be executed and the script will die and an error message will be printed.
%%file lesson12_example1.pl
my $arg_1;
if (defined $ARGV[0]){
$arg_1 = $ARGV[0];
print "Here's your number: $arg_1\n";
}
else {
die "Please supply a number\n";
}
Writing lesson12_example1.pl
%%bash
perl lesson12_example1.pl 3.14
perl lesson12_example1.pl
Here's your number: 3.14
Please supply a number
The example below does the same check, but also uses -e
to check that a file with the supplied name exists. Here we are using negative conditions (NOT defined and so on) to check the basic sanity tests, so we can get them out of the way before we get on to the rest of the code. This way, we don't have to embed all of our code in if blocks.
%%file lesson12_example2.pl
if (!defined $ARGV[0]) {
die "Please supply a file\n";
}
if (!-e $ARGV[0]) {
die "$ARGV[0] does not exist. Please check the filename and/or location.\n";
}
my $filename = $ARGV[0];
print "$filename is a file that exists!\n";
Writing lesson12_example2.pl
%%bash
perl lesson12_example2.pl lesson12_example1.pl
perl lesson12_example2.pl
perl lesson12_example2.pl lesson12_example42.pl
lesson12_example1.pl is a file that exists!
Please supply a file lesson12_example42.pl does not exist. Please check the filename and/or location.
@ARGV
is just a normal array)ls
on Linux or Mac or dir
on Windows to check for one)Frequently the data we want to operate on or analyse will be stored in files, so in our programs we need to be able to open files, read through them (perhaps all at once, perhaps not), and then close them. We will also frequently want to be able to print output to files rather than always printing out results to the terminal.
Perl supports all of these modes of operations on files, and provides a number of useful functions and syntax to make dealing with files fairly straightforward.
The first thing we need to be able to do is open a file, for this Perl has the conveniently named open
function.
The syntax of this function is: open(file handle, mode, file name)
A file handle is a special scalar variable that perl uses to represent files in your program, and is distinct from the file name which is just a string which contains the location of the file on your machine's hard disk.
The mode tells Perl if you want to open your file for reading or writing, or appending (another form of writing), and Perl uses special characters to represent each of these modes (note these are the same as the UNIX file redirection symbols):
Mode | Symbol |
---|---|
reading | < |
writing | > |
appending | >> |
If the call to open
is successful it will return a true value, if it is not, perhaps because the file doesn't exist, or you don't have the correct permissions to open it, it will return a false value.
Another very common idiom that relies on this is to include an or die
statement directly after the call to open which will be executed if there was some problem opening the file.
%%file lesson13_example1.pl
my $file = shift @ARGV;
open (my $fh , '<' , $file) or die "Can't open $file, $!\n";
print "opened file: $file\n";
Writing lesson13_example1.pl
%%bash
perl lesson13_example1.pl lesson13_example1.pl
opened file: lesson13_example1.pl
Once we have successfully opened up our file we will probably want to start reading the contents of a file, which we typically do line by line. A common pattern, especially if we don't know in advance how many lines our file contains, is to use a while
loop to loop over all the lines in the file operating on each line in turn - perhaps doing some analysis on each line as we proceed, or perhaps storing the data in some data structure to analyse later on in the script.
Perl provides some special syntax to read files line by line, to access lines in a file we wrap our file handle in <>
characters, e.g. <$FILEHANDLE>
. This is another operator that is aware of the context in which it is called. In scalar context this will return the next line in the file (Perl keeps track of where in a file it is up to for you), but if used in array context it will return all lines from the file in an array (watch out for big files!)
my $line = <$FILEHANDLE>
my @lines = <$FILEHANDLE>
Each line in the file with have an invisible newline character \n
at the end, and generally we will want to remove this before operating on the line, and so we can use the chomp
string function to remove it.
The example file used below is in Course_Materials/lesson_files
but any file you have will do. You can either change directory to lesson_files, or you can just specify the name of the directory when writing out the file. Here we are using the Unix command cat
to print out the contents of the file.
%%bash
cat lesson_files/lesson13_file1.txt
cd lesson_files
cat lesson13_file1.txt
This is a file with lines of words. This is a file with lines of words.
%%perl
my $filename = 'lesson_files/lesson13_file1.txt';
open(my $inputfh, '<' , $filename) or die "can't open $filename\n";
while (my $line = <$inputfh>) {
chomp $line;
print uc $line , "\n";
}
THIS IS A FILE WITH LINES OF WORDS.
If we open a file for writing we to tell Perl to print output to that file handle, otherwise it will keep printing out to the terminal. We do this by putting the file handle between the print
statement and the stuff we want to print to that file. It is good practice to then close
your file when you are finished with it.
%%perl
my $newfilename = 'lesson13_file2.txt';
open(my $newfh , '>' , $newfilename) or die "Can't open $newfilename\n";
print $newfh lc "Print this TO a FILE in LOWER case\n";
close $newfh;
%%bash
cat lesson13_file2.txt
print this to a file in lower case
In biology we very often end up dealing with tab (or comma) delimited files. We can use some of the string functions we covered earlier such as split
, in combination with the various file operations Perl supports, to deal with the sorts of files in a very straightforward way.
Here is an example splitting the columns into elements of an array. The file we will use contains the chromosomal coordinates of a few genes downloaded from Ensembl, and has 4 columns: gene, chromosome, start, end. We want to read this file and then compute the length of each gene.
%%bash
cat lesson_files/lesson13_file3.txt
BRCA2 13 32889611 32973805 TNFAIP3 6 138188351 138204449 TCF7 5 133450402 133487556
%%perl
my $genefilename = 'lesson_files/lesson13_file3.txt';
open(my $genefh, '<' , $genefilename) or die "Can't open $genefilename\n";
while (my $line = <$genefh>) {
chomp $line;
my ($gene, $chr, $start, $end) = split "\t", $line;
## for csv (comma separated files) split on the comma
# split ',' , $line;
my $length = $end - $start + 1;
print "$gene is on chromosome $chr and is $length bps long\n";
}
close $genefh;
BRCA2 is on chromosome 13 and is 84195 bps long TNFAIP3 is on chromosome 6 and is 16099 bps long TCF7 is on chromosome 5 and is 37155 bps long
-e
, and opens the file for reading. For each line in the file, print out the line number and the length of the corresponding line. You can use lesson13_file4.txt
as an example input file, or you can create your own.lesson11_exercise2.pl
). The script should also open an output file for writing and then print the GC content for each input line on a separate line in the output file. You should check that you don't try to compute GC content on the header lines, which start with a '>' character. You can use lesson13_file5.fasta
as an example input file, which contains BRCA2 sequences for human and mouse, or you can create your own.%%bash
cat lesson_files/lesson13_file4.txt
CGGCTAGATCCAGAT CGTGTAA GTACACCCA GTCAACACTTA
%%bash
cat lesson_files/lesson13_file5.fasta
>13 dna:chromosome chromosome:GRCh37:13:32889611:32973805:1 GGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCTGTGGCACTGCT GCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTG AGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCAAAAAAGAACTG CACCTCTGGAGCGGGTTAGTGGTGGTGGTAGTGGGTTGGGACGAGCGCGTCTTCCGCAGT CCCAGTCCAGCGTGGCGGGGGAGCGCCTCACGCCCCGGGTCGCTGCCGCGGCTTCTTGCC CTTTTGTCTCTGCCAACCCCCACCCATGCCTGAGAGAAAGGTCCTTGCCCGAAGGCAGAT TTTCGCCAAGCAAATTCGAGCCCCGCCCCTTCCCTGGGTCTCCATTTCCCGCCTCCGGCC CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGA CGCCATCTGAAATTTCTTGGAAACACGATCACTTTAACGGAATATTGCTGTTTTGGGGAA >gi|1905987|gb|U89503.1|MMU89503 Mus musculus Brca2 (BRCA2) gene, partial cds GATATCCACTCTAAAAAATAAAAAAAGAAAGTTTATTTACTCTGTAAGTGATGATGCATCTCTTCAAGGA AAAAAACTACAGACACACAGACAGTTAGAGCTTACTAATCTTTCAGCCCAGCTTGAAGCAAGTGCTTTTG AAGTACCATTGACCTTTACAAATGTAAATTCAGGTATTATTCTTTTTTTCTTCTAATGTATATAGTCTTT AGGATGAGATTAAGTTTTCCTATTTGAATATAAGTTTTTTGTTTTTTGAATTTTAAGAGAAACACTTGCT TACTTTCTGTAGTTTTTCAGATCTCTTTTTGGTTTGGTTTGGTTTGGTTTGGTTTGGTTTGGTTTGGTGA GAACCCTGTGTAGCCCTGGCTGTCTTAGAACTTAGGCTGGCCTGGAACTTAGATATC
Very often in bioinformatics we will deal with line-oriented text files with multiple columns of information, and sometimes we want to convert between formats or compute simple statistics over a complete data file without having to write a full script (though that is a good idea if the computation we want to perform is complex, or we will be running the same analysis multiple times).
There are several unix tools that can be used to do this kind of work, such as cut
to extract particular columns, awk
to filter data and perform simple calculations, and sed
to substitute text. Perl was originally designed as a superset of all these tools, and supports working with data files in a very similar way to these tools in 'one-liner' mode.
Various flags can be supplied to the perl interpreter to make it useful in this context:
-e
execute the string passed on the command line as a perl program (the program should be supplied within quotes)-l
automatically chomp
s each line, and prints a trailing newline after each print
command-a
automatically split
s each line from the standard input into a special array called @F
, the delimiter defaults to whitespace, but can be controlled by specifying a delimiter with -F
, e.g. -F","
will split on commas, or -F"\t"
to split on tabs.-n
which tells perl to assume the following loop around your program: while (<>) { # your program goes here }
These can all be combined to produce a powerful mechanism for processing files and other data piped into your program from the command line. If you want to keep track of some data throughout the run and then produce some summary results at the end, you can use an END{}
block at the end of your code which will only be run once the whole file has been processed. You can also supply a BEGIN{}
block which will be run before any lines are processed and can be used, for example, to print out a header line for your output.
For example, we could solve the exercise from above with the following command line:
%%bash
perl -lane 'print $. , ": ", length $F[0]' < lesson_files/lesson13_file4.txt
1: 15 2: 7 3: 9 4: 11
Note that the special perl variable $.
contains the current line number being processed. As another example we could print out the gene names and corresponding chromosome from the example file of gene coordinates with:
%%bash
perl -lane 'print "$F[0] is on chromosome $F[1]"' < lesson_files/lesson13_file3.txt
BRCA2 is on chromosome 13 TNFAIP3 is on chromosome 6 TCF7 is on chromosome 5
Write a perl one liner to compute the length of each gene from lesson13_file3.txt
and print the gene name in lower case and it's length separated by a comma, you could print a header line as well with a BEGIN{}
block.
Advanced: write a perl one liner to compute the counts of all nucleotides occuring in lesson13_file4.txt
Another common use of Perl, particularly on UNIX systems, is to automate running external programs or scripts.
Perl makes this very easy to do, and there are two ways in which you can run an external command.
system
functionsystem
¶system
is a function that you can supply the command line statement to be executed as the first argument and the exit status is returned. Output cannot be captured in a variable using system
, but see backticks `` below for this feature.
system
returns:
$!
to find out the reason%%perl
my $sys = system ("date");
print "sys returned: $sys\n";
if ($sys == -1) {
print "Error: $!\n";
}
Thu 1 May 2014 20:38:40 BST sys returned: 0
You can also wrap the command you want to run (which may include variables) in backticks``. The advantage of this approach is that you can capture the output of the external command into a variable in your script, and then operate on this data. In array context backticks will give you all the output lines in an array, while in scalar context you will get all the output as a single string value.
%%perl
my $dir = `pwd`;
print "I am currently in $dir\n";
I am currently in /Users/grsr/git/perl-course
ls
to get a list the files in the current directory, then prints out all upper case versions of all the file names.Sorting data in one way or another is a very common operation. Perl provides us with a very flexible function called sort
that we can use to sort any kind of array, and because we can access the keys and values of hashes as arrays, it allows us to control the order in which we output values stored in hashes etc.
sort
is very customisable, but its default behaviour is to sort in alphabetical order (or ASCII order to be precise).
sort
is a function that takes an array as an argument and returns a copy of the array sorted according to our criteria.
%%perl
my @array = ('Valine', 'Leucine','Cysteine','Tyrosine');
my @sorted_array = sort @array;
foreach my $element (@sorted_array){
print "$element\n";
}
my @numbers = (3,400,1,100,4,20);
my @sorted_numbers = sort @numbers;
foreach my $element (@sorted_numbers){
print "$element\n";
}
Cysteine Leucine Tyrosine Valine 1 100 20 3 4 400
When we have numerical data we generally don't want this behaviour, we'd rather that the values were sorted by their values, possibly either ascending or descending.
To do this we have to customise the behaviour of sort
with a small block of code placed in between sort
and the array we are sorting.
The only thing sort
needs to know to be able to sort any array in any complicated way you can imagine is whether one value is "less than", "greater than" or "equal" to another, and that what the block of code you supply to sort tells it. Inside this block of code the two values sort
is asking you to compare are called $a
and $b
.
So, to sort a list of numbers we need some function that tells sort
how to compare two numbers. You could supply your own function, but this is a common requirement so Perl gives us the "spaceship" operator <=>
, and the block of code we will use is: {$a <=> $b}
.
This block will sort numbers in ascending order, if we want to reverse the sort order we can just swap the order of $a
and $b
.
%%perl
my @numbers = (3,400,1,100,4,20);
my @sorted_numbers = sort {$a <=> $b} @numbers;
foreach my $element (@sorted_numbers){
print "$element\n";
}
1 3 4 20 100 400
As well as sorting arrays according to the values of the elements, we can sort by any property of the values as well.
A useful example is to sort an array of strings by the lenght of each string, instead of their alphabetical ordering. We can do this just by slightly modifying the sort block.
%%perl
my @fragments = ( 'TCTCGAATC' , 'TTA' , 'A' , 'GCGTGATGTCGA' , 'GATC' );
my @sorted_by_length = sort { length($b) <=> length($a) } @fragments;
foreach my $element( @sorted_by_length ) {
print "$element\t", length ($element), "\n";
}
GCGTGATGTCGA 12 TCTCGAATC 9 GATC 4 TTA 3 A 1
Because hashes are inherently unordered people often want to access hashes in more controlled way by sorting either the keys or the values.
Sorting by key is easy, we just use keys
to get an array of the keys, sort this array, and then use this sorted array to access all of out values. Sorting by value is a little trickier.
Recall that to access the value corresponding to some key in a hash we use code like: $value = $hash{$key}
So to access the values of the hash in a sorted order, we will in fact have to sort the keys.
In our sort block $a
and $b
will hold the keys of the hash, but just as in the length example above we will be sorting by some property of the keys, namely their corresponding values. Here's an example:
Order of operations:
keys
function$value = $hash{$key}
)$nt
by the foreach loop%%perl
my %nt_count = (
'G' => 2,
'C' => 1,
'T' => 4,
'A' => 3,
);
@sorted_by_values = sort {$nt_count{$a} <=> $nt_count{$b}} keys %nt_count;
print join ' - ', @sorted_by_values;
print "\n";
foreach my $nt( @sorted_by_values ) {
print "$nt\t$nt_count{$nt}\n";
}
C - G - A - T C 1 G 2 A 3 T 4
In all sorts of programming contexts we will want to manipulate and analyse text. We have seen a number of useful string functions built into Perl that let us do things like look at substrings of strings, chop them up in various ways, and we can already combine these with Perl's logical operators to manipulate strings in fairly complex ways.
We don't need very complicated examples though to realise the limitations of manipulating text using what we've learned so far though. Consider the simple example of checking that a variable supplied by the user is a true number, and doesn't contain any non-numerical characters. How could you check for this?
This is just a very simple example, there are all sorts of patterns we might want to identify in data - in biology we might want to check that a string contains only valid DNA bases, or amino acid codes, or even try to identify specific motifs in a DNA sequence where proteins bind or restriction enzymes cut.
Perl provides us with a very powerful, but somewhat complicated, system to let us answer these kinds of questions called "Regular Expressions" ("regexes", "REs"). This is almost another language built inside Perl and we will only introduce some of the basics here.
When we write a regular expression we are essentially writing a template against which we will check values to see if they match the RE. We typically write these templates inside forward slashes //
, and we ask if a string matches the template with the =~
operator which is a logical operator like ==
. There is also an equivalent to !=
to check if the string does not match: !~
.
The simplest RE, that is still often very useful, just checks if a specific bit of text is found in the string, here we check if there is a stop codon anywhere in the DNA.
%%perl
my $dna = 'ACTTGATAG';
if ($dna =~ /TAG/) {
print "$dna contains a stop codon\n";
}
else {
print "No stop codons found";
}
ACTTGATAG contains a stop codon
As this example shows, inside the pattern, normal characters just match themselves, but we can also match more general classes of things:
.
character matches anything except a newline[ACGT]
matches any DNA base, [1234]
matches any number less than 5By default, the string we are testing is allowed match the pattern anywhere, so in our example above we could have a stop codon as the first 3 bases and the expression would still match. We can however control where in the string the pattern can match with the following special characters:
^
Matches the beginning of the string$
Matches the end of the stringWe can use all of these features to look for common biological signals in DNA sequences:
%%perl
my $dna = "ACAGGAGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATAGGCAGACGAGAACGTAG";
if ($dna =~ /CTGCAG/) {
print "Found PstI recognition site\n";
}
if ($dna =~ /GA.TC/) {
print "Found HinfI recognition site\n";
}
if ($dna =~ /[AG]GATC[TC]/) {
print "Found XhoII recognition site\n";
}
if ($dna =~ /TAG$/) {
print "Found STOP codon at the end of the sequence\n";
}
Found PstI recognition site Found HinfI recognition site Found XhoII recognition site Found STOP codon at the end of the sequence
\d
matches any digit ([0-9]
)\w
matches any word character ([a-zA-Z0-9_]
)\s
matches any whitespace ([ \t\n\r]
)Each of the atoms we have described so far will match a single character, but we can quantify how many times any of them will match with some special characters:
*
matches atom zero or more times+
matches atom one or more times%%perl
foreach my $genename ("BRCA1","BRCA11","BRCA 1","BRCA1 gene") {
if ($genename =~ /^\w+\d$/) {
print "$genename contains one or more word characters, one digit, and nothing else\n";
}
else {
print "$genename contains something other than one or more word characters, one digit and nothing else\n";
}
if ($genename =~ /\w+\s*\d+/) {
print "$genename contains one or more word characters, optional whitespace and one or more digits\n";
}
else {
print "$genename contains something other than one or more word characters, optional whitespace and one or more digits\n";
}
}
BRCA1 contains one or more word characters, one digit, and nothing else BRCA1 contains one or more word characters, optional whitespace and one or more digits BRCA11 contains one or more word characters, one digit, and nothing else BRCA11 contains one or more word characters, optional whitespace and one or more digits BRCA 1 contains something other than one or more word characters, one digit and nothing else BRCA 1 contains one or more word characters, optional whitespace and one or more digits BRCA1 gene contains something other than one or more word characters, one digit and nothing else BRCA1 gene contains one or more word characters, optional whitespace and one or more digits
CG
dinucleotides.So far we have only been checking that our values match the patterns we have been concocting. You can also use regular expressions to capture matching chunks of the string.
To tell Perl you want to capture some bit of the match, just surround the bit you're after in parentheses ()
. You can do this multiple times if you want. Once you have checked that the pattern matches, Perl will then create some (more) "magic" variables that will contain the matched portion of the string. The first captured match will be in a variable called $1
, the second in $2
and so on.
We can also look for alternate variations using the |
operator. So, for example, we could check for any of the three possible STOP codons and capture which one we have found:
%%perl
my $dna = "ACGTGTAG";
if ($dna =~ /(TAG|TAA|TGA)/) {
print "Found this stop codon: $1\n";
}
Found this stop codon: TAG
By default, regular expressions are "greedy". They try to match as much as they can. For example, suppose we wanted to look for gene sequences that start with a Methionine (start codon, ATG) and end with a STOP codon. We might try this, to pull out the two gene sequences in the following sequence (genes in uppercase just so you can see where they are, we turn the whole sequence to upper case straight away before matching):
%%perl
my $dna = 'caccATGAGACAGAACAGTAGgggacagttgcacATGCCAGGACGACCCATATAATAGaca';
$dna = uc $dna;
$dna =~ /(ATG[ACGT]+(TAG|TAA|TGA))/;
print "$1\n";
ATGAGACAGAACAGTAGGGGACAGTTGCACATGCCAGGACGACCCATATAATAG
Because the match is greedy, both of the genes have been extracted. To pull out just the first gene, we need to match the minimum number of times, not the maximum. We can do this using a ?
, which modifies the preceding quantifier and makes it non-greedy. Note that (unfortunately) the ?
is also a quantifier itself that specifies that the atom should match 0 or 1 times (i.e. make it optional).
%%perl
my $dna = 'caccATGAGACAGAACAGTAGgggacagttgcacATGCCAGGACGACCCATATAATAGaca';
$dna = uc $dna;
$dna =~ /(ATG[ACGT]+?(TAG|TAA|TGA))/;
print "$1\n";
ATGAGACAGAACAGTAG
What if we wanted to get both of the genes? We could do this by asking for the regular expression to do a global search, with the g
modifier. We can then use a while loop to get each match one at a time:
%%perl
my $dna = 'caccATGAGACAGAACAGTAGgggacagttgcacATGCCAGGACGACCCATATAATAGaca';
my $dna = uc $dna;
while ($dna =~ /(ATG[ACGT]+?(TAG|TAA|TGA))/g) {
print "$1\n";
}
ATGAGACAGAACAGTAG ATGCCAGGACGACCCATATAA
So far we have only matched parts of strings, but we can also replace patterns with the s///
function.
s///
has two parts: the regular expression and the string to replace it with: s/expression/replacement/
. By default, s///
will only substitute one instance of the match, if you want to substitute all instances then add a g
at the end: s/expression/replacement/g
.
For example, we can simulate the degradation of methylated CpG sites in the genome to TpG as follows:
%%perl
my $dna = "CGAGGAGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATAGGCAGACGAGAACGTCG";
print "$dna\n";
$dna =~ s/CG/TG/;
print "$dna\n"; # Whoops! only mutates the first CG
$dna =~ s/CG/TG/g;
print "$dna\n";
CGAGGAGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATAGGCAGACGAGAACGTCG TGAGGAGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATAGGCAGACGAGAACGTCG TGAGGAGGATCTACTGCAGGCCAGTGTGAAGAGACTCATATAGGCAGATGAGAATGTTG
Regular expression can also be used in several Perl built in functions that we have already seen, for example you can use a regular expression as the delimiter in a split
command. Here's how we could simulate a set of PstI restriction fragments, for example:
%%perl
my $psti = 'CTGCAG';
my $dna = "CGAGGACTGCAGGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATACTGCAGCGTCG";
my @psti_frags = split /$psti/, $dna;
print "Original sequence\n$dna\n";
print "PstI($psti) fragments:\n", join("\n", @psti_frags), "\n"
Original sequence CGAGGACTGCAGGGATCTACTGCAGGCCAGCGCGAAGAGACTCATATACTGCAGCGTCG PstI(CTGCAG) fragments: CGAGGA GGATCTA GCCAGCGCGAAGAGACTCATATA CGTCG
s///
operator to transcribe a DNA sequence into an RNA sequence.TATTAT
. Read in the FASTA file lesson13_file5.fasta
and check each of the sequences in this file for transcription start sites. If a start site is found, print out the sequence downstream of the start site, including any lines after the one containing the start site.In the lesson on subroutines, we saw that we can only pass a single array into a subroutine, and only return a single scalar or array. But we might want to pass in several data structures as input and return more than one value, or possibly more than one complex data structure. We know from looking at arrays earlier that if we put an array inside of an array, Perl will flatten the array and lose the internal structure. This will also happen if we pass in an array or a hash to a subroutine, which is no good.
Also, although arrays and hashes are very useful, our real world data is often more complicated than these structures. We may want to store arrays of arrays, hashes of arrays, arrays of hashes, hashes of hashes. But arrays and hashes can only store scalar values.
To take care of these problems, we can use references. All data and data structures have an address in memory. A reference is a Perl variable that refers to the memory address of some other variable. We never actually need to look at this address, we can just use the reference without thinking about what it points to explicitly. References are scalars, so we can use them to pass complex data structures into subroutines, by passing in references to these data structures. Similarly, to create an array of arrays, we create an array of references to other arrays.
Just so you know what a reference looks like, let's print one out. Ideally, you will never see this in real program output, but in practice, it's not uncommon to accidentally print out some references because of some bug in your code, so it's worth just seeing what they look like.
%%perl
my @codons = ('ACG','GAG','GCG','CCC','AGT','GAA');
my $codon_ref = \@codons;
print "$codon_ref\n";
my %geneticCode = ('TCT' => 'Serine',
'TCC' => 'Serine',
'TAG' => 'STOP',
);
my $geneticCode_ref = \%geneticCode;
print "$geneticCode_ref\n";
ARRAY(0x7ff95202dca0) HASH(0x7ff952031e18)
Let's see how we can use this to store an array reference as a hash value. Note that we are not copying the array; there is only one copy of the array, and when we change it, the hash will also see the change. We will use a module called Data::Dumper
to print more complex data structures.
%%perl
use Data::Dumper;
my @serine = ('TCA','TCC','TCG','TCT');
my @proline =('CCA','CCC','CCG','CCT');
my %aas;
$aas{'serine'} = \@serine;
$aas{'proline'} = \@proline;
print Dumper \%aas;
pop @serine;
shift @proline;
print "After array changes, the hash also changes\n";
print Dumper \%aas;
$VAR1 = { 'serine' => [ 'TCA', 'TCC', 'TCG', 'TCT' ], 'proline' => [ 'CCA', 'CCC', 'CCG', 'CCT' ] }; After array changes, the hash also changes $VAR1 = { 'serine' => [ 'TCA', 'TCC', 'TCG' ], 'proline' => [ 'CCC', 'CCG', 'CCT' ] };
In the Dumper output, you can see that array elements are wrapped in square brackets []
and hash elements are wrapped in curly brackets {}
. We can use these brackets directly to create anonymous arrays and anonymous hashes, which are references to arrays and hashes where the arrays and hashes themselves do not have names. This makes it easier to create multi-dimensional data structures.
%%perl -w
use Data::Dumper;
my %aas = ('serine' => ['TCA','TCC','TCG','TCT'],
'proline' => ['CCA','CCC','CCG','CCT'],
);
my %codes;
$codes{'earth'} = \%aas;
$codes{'mars'} = {'serine' => ['QWZ','QWX','QWQ','QWW'],
'proline' => ['ZXZ','ZXX','ZXQ','ZXW'],
};
print Dumper \%codes;
print "$codes{'earth'}{'serine'}[0]\n";
print "$codes{'mars'}{'proline'}[1]\n";
$VAR1 = { 'earth' => { 'serine' => [ 'TCA', 'TCC', 'TCG', 'TCT' ], 'proline' => [ 'CCA', 'CCC', 'CCG', 'CCT' ] }, 'mars' => { 'serine' => [ 'QWZ', 'QWX', 'QWQ', 'QWW' ], 'proline' => [ 'ZXZ', 'ZXX', 'ZXQ', 'ZXW' ] } }; TCA ZXX
Note the subtle difference between creating the hash %aa
, for which we use the normal brackets ()
as we have before, and creating the hash reference $codes{'mars'}
, for which we use the curly brackets {}
. Once we have created our multi-dimensional structure, we can access elements of it by chaining the hash and array indices. You will sometimes see people using arrows to make it clear we are chaining references (eg $codes->{'earth'}->{'serine'}->[0]
), but this is not strictly necessary.
We have seen how to create a reference for a data structure, but not how to get the data structure from a reference - how to dereference. This can be done by wrapping the reference in curly brackets.
%%perl
my $mars_ref = {'serine' => ['QWZ','QWX','QWQ','QWW'],
'proline' => ['ZXZ','ZXX','ZXQ','ZXW'],
};
my @aas = keys %{$mars_ref};
print "@aas\n";
serine proline
Finally, here's how we can use references to pass complex data structures in to a subroutine and get multiple values out.
%%perl
use Data::Dumper;
sub means {
my @means;
for my $array_ref (@_) {
my $length = @{$array_ref};
my $total = 0;
for my $num (@{$array_ref}) {
$total += $num;
}
push @means, $total / $length;
}
return @means;
}
my @mean_out = means([1,2,3],[4,5,6],[7,8,9]);
print Dumper \@mean_out;
$VAR1 = [ '2', '5', '8' ];
The means
subroutine takes a list of array references as input, with the references pointing to anonymous arrays, defined with square brackets. The subroutine itself loops through the array references and calculates the mean of each array. It then returns the @means
array, which we then copy into our @mean_out
output array.
%codes
hash defined above as input, identifies the unique bases of each code from the given codons, and returns two arrays, one containing the four bases of the earth genetic code, and one containing the four bases of the mars genetic code.HINTS: use multiple for loops to iterate through the hash of hashes of arrays and split up each codon by base. You will need to do multiple dereferences to get at the inner hashes and arrays. To find the unique bases, see the answer to Lesson 14 Exercise II.