Notebook

Intermediate Python: Programming¶

Class 4¶

So far in this course, we've learned about different programming structures in Python to assist in automating tasks, and also explored ways to document and define defaults for functions. In today's class, we'll complete the course by thinking about ways we can make our functions as robust and reusable as possible.

By the end of this class, you should be able to:

perform debugging on functions creating errors
understand principles of test-driven development
write command-line programs for Python code

Debugging¶

In our previous class, we discussed testing and validating our code. If you do identify a problem that either prevents the code from working, or that prevents the code from accomplishing the task you had in mind, you'll need to debug your code.

If you would like additional explanations for the concepts covered in this section, more detail is available in the original lessons from Software Carpentry.

Debugging code associated with research analysis is particularly challenging. We're writing code to find out an answer to a question, so validating that our answers are accurate is difficult.

Keeping in mind your overall goal for what the code should accomplish, a reasonable process for writing research software includes:

Testing with subset data: Use a few example data points for testing before moving on to the entire dataset.
Testing with simplified data: Use synthetic data, or subset to a simpler unit (e.g., only one chromosome instead of the entire genome.
Compare to known findings: Use well-established reports from previously published literature, specifically from model systems, to confirm your software is finding equivalent results.
Check conservation laws: Use summary statistics to confirm the number of samples included doesn't vary in unexpected ways. For example, if you are filtering out data, the number of data points should decreased rather than increase.
Visualize: Although it's difficult to compare figures in an automated way (e.g., with a computer), we can use data visualizations to confirm our assumptions are being met. In fact, we modeled this approach in earlier classes in this course.

We'll generally With these general steps in mind, let's explore some specific approaches to assist in the debugging process.

Ensure failures are consistent. Double check that you've executed all the code that your problem code needs to run, and that you're using the data you initially intended. It's easy to blame the code for not working, when it's actually a mistake we made when trying to run it.
Fail quickly and efficiently: Minimize the time it takes to get the error to resurface, and isolate the portion of the code involved.
Change one thing at a time: Make only one alteration before testing your code again, and be rational in choosing what change to try next.
Keep track of what you've done: Don't make yourself repeat an experiment, and also be able to remember what happened when you last tried something.
Ask for help: Whether someone in your lab group, other members of your computational community, or strangers online, you might be able to save yourself a lot of time and energy by leveraging someone else's expertise. As a bonus, it's possible that framing your problem in terms someone else could understand will help you figure it out on your own!

Challenge-debug¶

The following code computes the Body Mass Index (BMI) of patients. BMI is calculated as weight in kilograms divided by the square of height in metres. The test results indicate all patients appear to have have unusual and identical BMIs, despite having different physiques. Thinking about the tips for debugging described above, what suggestions do you have to improve this code?

patients = [[70, 1.8], [80, 1.9], [150, 1.7]]

def calculate_bmi(weight, height):
    return weight / (height ** 2)

for patient in patients:
    weight, height = patients[0]
    bmi = calculate_bmi(height, weight)
    print("Patient's BMI is: %f" % bmi)

Note: The syntax with percentage signs seen in the challenge above is a type of string formatting used to round bmi. For more information about string formatting, see this article.

Challenge-pair¶

Take one of the functions you've written for this course and deliberately introduce an error. Share that error with one of the other course participants and attempt to debug each other's errors.

Test-driven development¶

Our normal tendency when writing software is to:

Write a function.
Call it interactively on two or three different inputs.
If it produces the wrong answer, fix the function and re-run that test.

A better practice is to:

Write a short function for each test.
Write a function that should pass those tests.
If the function produces any wrong answers, fix it and re-run the test functions.

This latter process is called test-driven development (TDD), and is a way of thinking about designing code.

In this section, we'll write a function to assess whether input ranges overlap. We'll apply practices of test-driven development, and also apply many of the programming methods we've developed over the duration of this class.

We're not applying TDD to our inflammation data because we've already written the functions.

We've chosen a small stand-alone task for this section to demonstrate TDD, but we'll get back to our inflammation data in the next section.

We'll begin by writing three test functions for our function, which we'll call range_overlap. This first set of tests are "positive controls" that should work for multiple types of input:

# one input
assert range_overlap([ (0.0, 1.0) ]) == (0.0, 1.0) 
# two inputs
assert range_overlap([ (2.0, 3.0), (2.0, 4.0) ]) == (2.0, 3.0) 
# three inputs
assert range_overlap([ (0.0, 1.0), (0.0, 2.0), (-1.0, 1.0) ]) == (0.0, 1.0)

If we run these, errors are expected! We haven't written the function yet. If tests passed we'd know we were accidentally using someone else's function. These tests implicitly define what our input and output look like: a list of pairs as input, and produce a single pair as output.

Next, we can write a test for when ranges do not overlap:

assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == None

Here, we've made the decision that no overlap means there is no output. Alternatively, we could create a failure with an error message, or a special value ((0.0, 0.0)) to indicate no overlap.

Next, we'll test for when ranges touch at their endpoints:

assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == None

Again, we need to decide (in the context of our research) whether this counts as overlap. Here, we've decided overlaps need to have non-zero width, so no output will be reported.

Finally, it's good practice to include an error for when the input is completely missing:

assert range_overlap([]) == None

Next, we'll actually write our function (spoiler: there are errors in this):

range_overlap¶

def range_overlap(ranges):
    '''Return common overlap among a set of [left, right] ranges.'''
    max_left = 0.0
    min_right = 1.0
    for (left, right) in ranges:
        max_left = max(max_left, left)
        min_right = min(min_right, right)
    return (max_left, min_right)

And define our test function:

test_range_overlap¶

def test_range_overlap():
    assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == None
    assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == None
    assert range_overlap([ (0.0, 1.0) ]) == (0.0, 1.0)
    assert range_overlap([ (2.0, 3.0), (2.0, 4.0) ]) == (2.0, 3.0)
    assert range_overlap([ (0.0, 1.0), (0.0, 2.0), (-1.0, 1.0) ]) == (0.0, 1.0)
    assert range_overlap([]) == None

Finally, we'll test our function:

test_range_overlap()

The first test was supposed to produce None, but it fails! Now we know we need to correct our function. Note that we don't know if any other tests failed, because the program halts after the first error.

In this case, our function is not working because we've initialized with absolute values, instead of real data.

Challenge-range_overlap¶

Try to resolve the error in range_overlap. Rerun your test after each change you make. What other errors do you receive? How could you resolve them?

Command-line programs¶

In this final section, we'll start working with our python code in a different way. Rather than including the code directly in our interpreter, we'll include our functions in python scripts, which we can then execute on the command line.

We could open a separate program for command line work, but we can execute command line functions in our interpreter by encasing the commands inside system("").

For example, we can list the files in our project directory using ls, which is the unix command for list in two ways.

Using system():

system("ls")

Using !:

!ls

We'll be using the second method because it requires less typing. Regardless of the method you use, the output you should see includes:

['class1.ipynb',
 'class2.ipynb',
 'class3.ipynb',
 'class4.ipynb',
 'data',
 'python-novice-inflammation-data.zip']

If you are comfortable on the command line and have a preferred method of executing Unix code

(e.g., Terminal on Mac), you are welcome to use that interface instead.

In this section, we'll be writing a script to print summary statistics for inflammation per patient. We'll begin with a simple script, and gradually add features that expands its functionality.

At the end, our script should:

read data from standard input if no filename is given
read data from all files if more than one is given, and report statistics for each file separately
use flags (for min, mean, and max) to determine what statistic to print

If you'd like pre-written copies of the scripts we'll be using, feel free to execute the following code:

In [1]:

import os
import zipfile
import urllib.request

urllib.request.urlretrieve("https://swcarpentry.github.io/python-novice-inflammation/code/python-novice-inflammation-code.zip", "python-novice-inflammation-code.zip")
zipData = zipfile.ZipFile("python-novice-inflammation-code.zip")
zipData.extractall()

First, create a text file in your project directory called sys_version.py that contains the following text:

import sys
print('version is', sys.version)

If you are working in Jupyter notebooks,

you can create a new file using the same button as creating a new notebook, but select "New File" under "Other."

We can execute the command using:

!python sys_version.py

The output should be something like:

version is 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]

This describes the version of Python you are running.

Now, create another script called argv_list.py containing:

import sys
print('sys.argv is', sys.argv)

argv stands for "argument values." These are the things listed on the command line, including the script/function and other arguments.

If you run the script with no arguments:

!python argv_list.py

You should only see the name of the script as output:

sys.argv is ['argv_list.py']

We'll be using sys.argv for the duration of this lesson.

When you begin writing your own programs with more complex command line operations, we recommend using the argparse library.

If you add some arguments:

!python argv_list.py first second third

The output should now include:

sys.argv is ['argv_list.py', 'first', 'second', 'third']

Now that we have a basic understanding of these features, let's begin writing our inflammation script. We'll store this script in readings.py.

If you downloaded the zipped code files,

the specific script involved at each stage is listed above each relevant section.

readings_01.py¶

Our final desired code has a few different requirements. We'll start with a basic structure of an outlined function. Conventionally, this function is called main:

import sys
import numpy


def main():
    script = sys.argv[0]
    filename = sys.argv[1]
    data = numpy.loadtxt(filename, delimiter=',')
    for row_mean in numpy.mean(data, axis=1):
        print(row_mean)

sys.argv[0] is always the name of the script, and sys.argv[1] is the name of the file to process. Let's test it:

!python readings.py

readings_02.py¶

Now we've defined a function, but we need to actually call it in the script:

import sys
import numpy

def main():
    script = sys.argv[0]
    filename = sys.argv[1]
    data = numpy.loadtxt(filename, delimiter=',')
    for row_mean in numpy.mean(data, axis=1):
        print(row_mean)

if __name__ == '__main__':
   main()

The extra lines we've added at the bottom allow us to run this script as a stand-alone program (e.g., as readings.py), rather than importing as a module (e.g., readings()).

This general structure is the conventional way of writing command line programs:

def main():
    # code goes here


if __name__ == "__main__":
    main()

We're now ready to run it again:

!python readings.py data/small-01.csv

We're using one of our small data files, since it makes it easier to inspect the output while we're developing code.

readings_03.py¶

Our next step will be to include a for loop to run across multiple data files:

import sys
import numpy

def main():
    script = sys.argv[0]
    for filename in sys.argv[1:]:
        data = numpy.loadtxt(filename, delimiter=',')
        for row_mean in numpy.mean(data, axis=1):
            print(row_mean)

if __name__ == '__main__':
   main()

We can now test it on two files:

!python readings.py data/small-01.csv data/small-02.csv

readings_04.py¶

In our next step, we'll include multiple options for summary statistics:

import sys
import numpy

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]

    for filename in filenames:
        data = numpy.loadtxt(filename, delimiter=',')

        if action == '--min':
            values = numpy.min(data, axis=1)
        elif action == '--mean':
            values = numpy.mean(data, axis=1)
        elif action == '--max':
            values = numpy.max(data, axis=1)

        for val in values:
            print(val)

if __name__ == '__main__':
   main()

Testing:

!python readings.py --max data/small-01.csv

While this works, there are a few problems:

main is getting large
if we only specify a file name, we can end up with a silent failure
we need to ensure the submitted flag is one of the accepted options

readings_05.py¶

This improved structure resolves the three issues above:

import sys
import numpy

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]
    assert action in ['--min', '--mean', '--max'], \
           'Action is not one of --min, --mean, or --max: ' + action
    for filename in filenames:
        process(filename, action)

def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for val in values:
        print(val)

if __name__ == '__main__':
   main()

You should try this function with a few inputs to see how it works under different circumstances.

readings_06.py¶

To add the option to accept files via standard in (stdin):

import sys
import numpy

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    filenames = sys.argv[2:]
    assert action in ['--min', '--mean', '--max'], \
           'Action is not one of --min, --mean, or --max: ' + action
    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for filename in filenames:
            process(filename, action)

def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for val in values:
        print(val)

To try this, we need to use a different format on the command line:

!python readings.py --mean < data/small-01.csv

Challenge-readings_07.py¶

Rewrite readings.py so that it uses short flags (-n, -m, and -x) instead of --min, --mean, and --max, respectively, so you'll be able to use it as !python readings.py -x data/small-01.csv. Is the code easier to read? Is the program easier to understand?

readings_08.py¶

Finally, if we wanted to include a help message if no file is input:

import sys
import numpy

def main():
    script = sys.argv[0]
    if len(sys.argv) == 1: # no arguments, so print help message
        print('''Usage: python readings_08.py action filenames
              action must be one of --min --mean --max
              if filenames is blank, input is taken from stdin;
              otherwise, each filename in the list of arguments 
              is processed in turn''')
        return

    action = sys.argv[1]
    filenames = sys.argv[2:]
    assert action in ['--min', '--mean', '--max'], \
           'Action is not one of --min, --mean, or --max: ' + action
    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for filename in filenames:
            process(filename, action)

def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for val in values:
        print(val)

main()

Which we can run:

!python readings.py -m

readings_09.py¶

We can set mean as the default action if none is specified:

import sys
import numpy

def main():
    script = sys.argv[0]
    action = sys.argv[1]
    if action not in ['--min', '--mean', '--max']: # if no action given
        action = '--mean'    # set a default action, that being mean
        filenames = sys.argv[1:] # start the filenames one place earlier in the argv list
    else:
        filenames = sys.argv[2:]

    if len(filenames) == 0:
        process(sys.stdin, action)
    else:
        for filename in filenames:
            process(filename, action)

def process(filename, action):
    data = numpy.loadtxt(filename, delimiter=',')

    if action == '--min':
        values = numpy.min(data, axis=1)
    elif action == '--mean':
        values = numpy.mean(data, axis=1)
    elif action == '--max':
        values = numpy.max(data, axis=1)

    for val in values:
        print(val)

main()

To try it:

!python readings.py data/small-01.csv

Challenge-check_arguments¶

Write a program called check_arguments.py that prints usage then exits the program if no arguments are provided. (Hint: You can use sys.exit() to exit the program.)

Example usage:

!python check_arguments.py

Output:

usage: python check_argument.py filename.txt

Another usage:

!python check_arguments.py filename.txt

Output

Thanks for specifying arguments!

Wrapping up¶

In this class, we discussed the process of debugging, test-driven development, and command line programs. This rounds out our course content covering basic Python programming structures and how to apply them in creating robust, reusable code.

Please view the links on your class website for more information about expanding your Python skills.

In [ ]: