Strings

We use text all the time in science and computing to store information like:

  • Species names
  • Site names
  • Genetic sequences
  • Information about methods

In Python we store this kind of data in strings.

Types

Strings can have one of two major types in Python:

  • str: which handles all of the characters in the Latin alphabet (basically anything you'll find on a keyboard in North America)
  • unicode: which handles basically anything that would be a on keyboard anywhere

We'll work with str here, but everything is basically the same using unicode.

Creating strings

Strings are created using either single or double quotes. It doesn't typically matter which kind of quotes you use, but they do need to match.

In [1]:
genus = 'Dipodomys'
species = "spectabilis"
print(genus)
print(species)
Dipodomys
spectabilis

If we want to create a string that has multiple lines we can do this using triple quotes.

In [2]:
ds_description = """Dipodomys spectabilis is the
scientific name for the
Banner-tailed Kangaroo Rat."""
print ds_description
Dipodomys spectabilis is the
scientific name for the
Banner-tailed Kangaroo Rat.

Determining the length of a string

Python uses a single function to determine the length of most things including strings, the len() function.

In [3]:
latin_binomial = "Dipodomys ordii"
len(latin_binomial)
Out[3]:
15

Concatenating strings

We can combine strings using the + operator.

In [4]:
genus + species + 'weighs about 125 grams.'
Out[4]:
'Dipodomysspectabilisweighs about 125 grams.'

If we want spaces between words we need to add them explicitly.

In [5]:
genus + ' ' + species + ' weighs about 125 grams.'
Out[5]:
'Dipodomys spectabilis weighs about 125 grams.'

Formatted strings

A better way to achieve this type of output in Python is using formatted strings. Everywhere we want to place a variable or a value in the string we place a % followed by a letter that tells it how we want the information formatted (like a string, an integer, a float, etc.) then after the string we add a % and then a comma separated list of the values/variables to insert in parentheses.

In [6]:
output = "%s %s weighs about %d grams." % (genus, species, 125)
print output
Dipodomys spectabilis weighs about 125 grams.

Escape Characters

Sometimes in programming we need to change the way a character works, or add a special character to a string. To do this we use escape characters. For example, what if we want to include an apostrophy in a string? If we just add it then things go wrong:

In [7]:
print('The individual's mass is 122 grams.')
  File "<ipython-input-7-cd1ab404344a>", line 1
    print('The individual's mass is 122 grams.')
                          ^
SyntaxError: invalid syntax

This happens because when Python encounters the apostrophy it thinks we're telling it to end the string and it doesn't understand what all of the stuff coming after the string is.

To tell Python that we actually want an apostrophy we use an escape character, the \ in this case, so instead of typing ' we type \'

In [8]:
print('The individual\'s mass is 122 grams.')
The individual's mass is 122 grams.

Other escape characters include:

  • \" - Double quotation mark
  • \t - Tab
  • \n - New line
  • \\ - Backslash

Doubling up the escape character to get the character itself is the standard approach to handling that character.

In fact, if we look at our multi-line string from above, we'll see that it is actually just a regular string, with some new lines inserted using \n.

In [9]:
ds_description
Out[9]:
'Dipodomys spectabilis is the\nscientific name for the\nBanner-tailed Kangaroo Rat.'

Because Python allows both single quotes and double quotes, there is also an easy way to avoid escaping characters in some cases. For example,

In [10]:
print("The individuals's mass is 122 grams.")
print('The original paper states that "The mass of Dipodomys spectabilis is approximately 125 grams."')
The individuals's mass is 122 grams.
The original paper states that "The mass of Dipodomys spectabilis is approximately 125 grams."

The String Module

There is a string module that has a lot of useful functions for working with strings. Functions in this module can change capitalization (upper, lower, capitalize), remove excess whitespace (strip), find the location of substrings (find), split a string into pieces (split), and count the number of occurrences of particular characters (count).

In [11]:
import string

genus = 'Dipodomys'
species = '    spectabilis'
latin_binomial = 'Dipodomys ordii'
dna_seq = 'atgcagatcctgtgtgtctagctaag'

print("The lower case version of genus is: %s" % string.lower(genus))
print("The upper case version of species is: %s" % string.upper(species))
print("The value of species without the leading whitespace is: %s" % string.strip(species))
print("The location of the start of the first 'tcct' in dna_seq is: %s" % string.find(dna_seq, 'tcct'))
print("The number of a's in dna_seq is: %s" % string.count(dna_seq, 'a'))
genus, species = string.split(latin_binomial)
print("The genus in latin_binomial is: %s" % genus)
print("The species in latin_binomial is: %s" % species)
The lower case version of genus is: dipodomys
The upper case version of species is:     SPECTABILIS
The value of species without the leading whitespace is: spectabilis
The location of the start of the first 'tcct' in dna_seq is: 7
The number of a's in dna_seq is: 6
The genus in latin_binomial is: Dipodomys
The species in latin_binomial is: ordii

String Methods

Many kinds of objects in Python carry their own functions with them. These kinds of functions are called methods.

Instead of doing something to any object, methods do something to the object they are attached to.

To call a method for a particular object we use the object name, followed by a period (called a dot), followed by the name of the method. For example, if we want to make all of the letters in a string capitals we can use the .upper() method.

In [12]:
genus = "Dipodomys"
upper_cased_genus = genus.upper()
print upper_cased_genus
DIPODOMYS

All of the functions that are available in the strings module are also available as methods.

In [13]:
genus = 'Dipodomys'
species = '    spectabilis'
latin_binomial = 'Dipodomys ordii'
dna_seq = 'atgcagatcctgtgtgtctagctaag'

print("The lower case version of genus is: %s" % genus.lower())
print("The upper case version of species is: %s" % species.upper())
print("The value of species without the leading whitespace is: %s" % species.strip())
print("The location of the start of the first 'tcct' in dna_seq is: %s" % dna_seq.find('tcct'))
print("The number of a's in dna_seq is: %s" % dna_seq.count('a'))
genus, species = latin_binomial.split()
print("The genus in latin_binomial is: %s" % genus)
print("The species in latin_binomial is: %s" % species)
The lower case version of genus is: dipodomys
The upper case version of species is:     SPECTABILIS
The value of species without the leading whitespace is: spectabilis
The location of the start of the first 'tcct' in dna_seq is: 7
The number of a's in dna_seq is: 6
The genus in latin_binomial is: Dipodomys
The species in latin_binomial is: ordii