So far in this course, we've learned about different programming structures in Python to assist in automating tasks, and also explored ways to document and define defaults for functions. In today's class, we'll complete the course by thinking about ways we can make our functions as robust and reusable as possible.
By the end of this class, you should be able to:
In our previous class, we discussed testing and validating our code. If you do identify a problem that either prevents the code from working, or that prevents the code from accomplishing the task you had in mind, you'll need to debug your code.
If you would like additional explanations for the concepts covered in this section, more detail is available in the original lessons from Software Carpentry.
Debugging code associated with research analysis is particularly challenging. We're writing code to find out an answer to a question, so validating that our answers are accurate is difficult.
Keeping in mind your overall goal for what the code should accomplish, a reasonable process for writing research software includes:
We'll generally With these general steps in mind, let's explore some specific approaches to assist in the debugging process.
Challenge-debug¶
The following code computes the Body Mass Index (BMI) of patients. BMI is calculated as weight in kilograms divided by the square of height in metres. The test results indicate all patients appear to have have unusual and identical BMIs, despite having different physiques. Thinking about the tips for debugging described above, what suggestions do you have to improve this code?
patients = [[70, 1.8], [80, 1.9], [150, 1.7]]
def calculate_bmi(weight, height):
return weight / (height ** 2)
for patient in patients:
weight, height = patients[0]
bmi = calculate_bmi(height, weight)
print("Patient's BMI is: %f" % bmi)
Note: The syntax with percentage signs seen in the challenge above is a type of string formatting used to round bmi
.
For more information about string formatting, see this article.
Challenge-pair¶
Take one of the functions you've written for this course and deliberately introduce an error. Share that error with one of the other course participants and attempt to debug each other's errors.
Our normal tendency when writing software is to:
A better practice is to:
This latter process is called test-driven development (TDD), and is a way of thinking about designing code.
In this section, we'll write a function to assess whether input ranges overlap. We'll apply practices of test-driven development, and also apply many of the programming methods we've developed over the duration of this class.
We're not applying TDD to our inflammation data because we've already written the functions.
We've chosen a small stand-alone task for this section to demonstrate TDD, but we'll get back to our inflammation data in the next section.
We'll begin by writing three test functions for our function,
which we'll call range_overlap
.
This first set of tests are "positive controls"
that should work for multiple types of input:
# one input
assert range_overlap([ (0.0, 1.0) ]) == (0.0, 1.0)
# two inputs
assert range_overlap([ (2.0, 3.0), (2.0, 4.0) ]) == (2.0, 3.0)
# three inputs
assert range_overlap([ (0.0, 1.0), (0.0, 2.0), (-1.0, 1.0) ]) == (0.0, 1.0)
If we run these, errors are expected! We haven't written the function yet. If tests passed we'd know we were accidentally using someone else's function. These tests implicitly define what our input and output look like: a list of pairs as input, and produce a single pair as output.
Next, we can write a test for when ranges do not overlap:
assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == None
Here, we've made the decision that no overlap means there is no output.
Alternatively, we could create a failure with an error message,
or a special value ((0.0, 0.0)
) to indicate no overlap.
Next, we'll test for when ranges touch at their endpoints:
assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == None
Again, we need to decide (in the context of our research) whether this counts as overlap. Here, we've decided overlaps need to have non-zero width, so no output will be reported.
Finally, it's good practice to include an error for when the input is completely missing:
assert range_overlap([]) == None
Next, we'll actually write our function (spoiler: there are errors in this):
def range_overlap(ranges):
'''Return common overlap among a set of [left, right] ranges.'''
max_left = 0.0
min_right = 1.0
for (left, right) in ranges:
max_left = max(max_left, left)
min_right = min(min_right, right)
return (max_left, min_right)
And define our test function:
def test_range_overlap():
assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == None
assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == None
assert range_overlap([ (0.0, 1.0) ]) == (0.0, 1.0)
assert range_overlap([ (2.0, 3.0), (2.0, 4.0) ]) == (2.0, 3.0)
assert range_overlap([ (0.0, 1.0), (0.0, 2.0), (-1.0, 1.0) ]) == (0.0, 1.0)
assert range_overlap([]) == None
Finally, we'll test our function:
test_range_overlap()
The first test was supposed to produce None
,
but it fails!
Now we know we need to correct our function.
Note that we don't know if any other tests failed,
because the program halts after the first error.
In this case, our function is not working because we've initialized with absolute values, instead of real data.
Challenge-range_overlap¶
Try to resolve the error in range_overlap
.
Rerun your test after each change you make.
What other errors do you receive?
How could you resolve them?
In this final section, we'll start working with our python code in a different way. Rather than including the code directly in our interpreter, we'll include our functions in python scripts, which we can then execute on the command line.
We could open a separate program for command line work,
but we can execute command line functions in our interpreter by encasing the commands inside system("")
.
For example, we can list the files in our project directory using ls
,
which is the unix command for list in two ways.
Using system()
:
system("ls")
Using !
:
!ls
We'll be using the second method because it requires less typing. Regardless of the method you use, the output you should see includes:
['class1.ipynb',
'class2.ipynb',
'class3.ipynb',
'class4.ipynb',
'data',
'python-novice-inflammation-data.zip']
If you are comfortable on the command line and have a preferred method of executing Unix code
(e.g., Terminal on Mac), you are welcome to use that interface instead.
In this section, we'll be writing a script to print summary statistics for inflammation per patient. We'll begin with a simple script, and gradually add features that expands its functionality.
At the end, our script should:
If you'd like pre-written copies of the scripts we'll be using, feel free to execute the following code:
import os
import zipfile
import urllib.request
urllib.request.urlretrieve("https://swcarpentry.github.io/python-novice-inflammation/code/python-novice-inflammation-code.zip", "python-novice-inflammation-code.zip")
zipData = zipfile.ZipFile("python-novice-inflammation-code.zip")
zipData.extractall()
First, create a text file in your project directory called sys_version.py
that contains the following text:
import sys
print('version is', sys.version)
If you are working in Jupyter notebooks,
you can create a new file using the same button as creating a new notebook, but select "New File" under "Other."
We can execute the command using:
!python sys_version.py
The output should be something like:
version is 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
This describes the version of Python you are running.
Now, create another script called argv_list.py
containing:
import sys
print('sys.argv is', sys.argv)
argv
stands for "argument values."
These are the things listed on the command line,
including the script/function and other arguments.
If you run the script with no arguments:
!python argv_list.py
You should only see the name of the script as output:
sys.argv is ['argv_list.py']
We'll be using
sys.argv
for the duration of this lesson.
When you begin writing your own programs with more complex command line operations,
we recommend using the argparse
library.
If you add some arguments:
!python argv_list.py first second third
The output should now include:
sys.argv is ['argv_list.py', 'first', 'second', 'third']
Now that we have a basic understanding of these features, let's begin writing our inflammation script.
We'll store this script in readings.py
.
If you downloaded the zipped code files,
the specific script involved at each stage is listed above each relevant section.
Our final desired code has a few different requirements.
We'll start with a basic structure of an outlined function.
Conventionally, this function is called main
:
import sys
import numpy
def main():
script = sys.argv[0]
filename = sys.argv[1]
data = numpy.loadtxt(filename, delimiter=',')
for row_mean in numpy.mean(data, axis=1):
print(row_mean)
sys.argv[0]
is always the name of the script,
and sys.argv[1]
is the name of the file to process.
Let's test it:
!python readings.py
Now we've defined a function, but we need to actually call it in the script:
import sys
import numpy
def main():
script = sys.argv[0]
filename = sys.argv[1]
data = numpy.loadtxt(filename, delimiter=',')
for row_mean in numpy.mean(data, axis=1):
print(row_mean)
if __name__ == '__main__':
main()
The extra lines we've added at the bottom allow us to run this script as a stand-alone program (e.g., as readings.py
),
rather than importing as a module (e.g., readings()
).
This general structure is the conventional way of writing command line programs:
def main():
# code goes here
if __name__ == "__main__":
main()
We're now ready to run it again:
!python readings.py data/small-01.csv
We're using one of our small data files, since it makes it easier to inspect the output while we're developing code.
Our next step will be to include a for loop to run across multiple data files:
import sys
import numpy
def main():
script = sys.argv[0]
for filename in sys.argv[1:]:
data = numpy.loadtxt(filename, delimiter=',')
for row_mean in numpy.mean(data, axis=1):
print(row_mean)
if __name__ == '__main__':
main()
We can now test it on two files:
!python readings.py data/small-01.csv data/small-02.csv
In our next step, we'll include multiple options for summary statistics:
import sys
import numpy
def main():
script = sys.argv[0]
action = sys.argv[1]
filenames = sys.argv[2:]
for filename in filenames:
data = numpy.loadtxt(filename, delimiter=',')
if action == '--min':
values = numpy.min(data, axis=1)
elif action == '--mean':
values = numpy.mean(data, axis=1)
elif action == '--max':
values = numpy.max(data, axis=1)
for val in values:
print(val)
if __name__ == '__main__':
main()
Testing:
!python readings.py --max data/small-01.csv
While this works, there are a few problems:
main
is getting largeThis improved structure resolves the three issues above:
import sys
import numpy
def main():
script = sys.argv[0]
action = sys.argv[1]
filenames = sys.argv[2:]
assert action in ['--min', '--mean', '--max'], \
'Action is not one of --min, --mean, or --max: ' + action
for filename in filenames:
process(filename, action)
def process(filename, action):
data = numpy.loadtxt(filename, delimiter=',')
if action == '--min':
values = numpy.min(data, axis=1)
elif action == '--mean':
values = numpy.mean(data, axis=1)
elif action == '--max':
values = numpy.max(data, axis=1)
for val in values:
print(val)
if __name__ == '__main__':
main()
You should try this function with a few inputs to see how it works under different circumstances.
To add the option to accept files via standard in (stdin):
import sys
import numpy
def main():
script = sys.argv[0]
action = sys.argv[1]
filenames = sys.argv[2:]
assert action in ['--min', '--mean', '--max'], \
'Action is not one of --min, --mean, or --max: ' + action
if len(filenames) == 0:
process(sys.stdin, action)
else:
for filename in filenames:
process(filename, action)
def process(filename, action):
data = numpy.loadtxt(filename, delimiter=',')
if action == '--min':
values = numpy.min(data, axis=1)
elif action == '--mean':
values = numpy.mean(data, axis=1)
elif action == '--max':
values = numpy.max(data, axis=1)
for val in values:
print(val)
To try this, we need to use a different format on the command line:
!python readings.py --mean < data/small-01.csv
Challenge-readings_07.py¶
Rewrite readings.py
so that it uses short flags (-n, -m, and -x) instead of --min, --mean, and --max, respectively,
so you'll be able to use it as !python readings.py -x data/small-01.csv
.
Is the code easier to read?
Is the program easier to understand?
readings_08.py¶
Finally, if we wanted to include a help message if no file is input:
import sys
import numpy
def main():
script = sys.argv[0]
if len(sys.argv) == 1: # no arguments, so print help message
print('''Usage: python readings_08.py action filenames
action must be one of --min --mean --max
if filenames is blank, input is taken from stdin;
otherwise, each filename in the list of arguments
is processed in turn''')
return
action = sys.argv[1]
filenames = sys.argv[2:]
assert action in ['--min', '--mean', '--max'], \
'Action is not one of --min, --mean, or --max: ' + action
if len(filenames) == 0:
process(sys.stdin, action)
else:
for filename in filenames:
process(filename, action)
def process(filename, action):
data = numpy.loadtxt(filename, delimiter=',')
if action == '--min':
values = numpy.min(data, axis=1)
elif action == '--mean':
values = numpy.mean(data, axis=1)
elif action == '--max':
values = numpy.max(data, axis=1)
for val in values:
print(val)
main()
Which we can run:
!python readings.py -m
We can set mean as the default action if none is specified:
import sys
import numpy
def main():
script = sys.argv[0]
action = sys.argv[1]
if action not in ['--min', '--mean', '--max']: # if no action given
action = '--mean' # set a default action, that being mean
filenames = sys.argv[1:] # start the filenames one place earlier in the argv list
else:
filenames = sys.argv[2:]
if len(filenames) == 0:
process(sys.stdin, action)
else:
for filename in filenames:
process(filename, action)
def process(filename, action):
data = numpy.loadtxt(filename, delimiter=',')
if action == '--min':
values = numpy.min(data, axis=1)
elif action == '--mean':
values = numpy.mean(data, axis=1)
elif action == '--max':
values = numpy.max(data, axis=1)
for val in values:
print(val)
main()
To try it:
!python readings.py data/small-01.csv
Challenge-check_arguments¶
Write a program called check_arguments.py
that prints usage then exits the program if no arguments are provided.
(Hint: You can use sys.exit() to exit the program.)
Example usage:
!python check_arguments.py
Output:
usage: python check_argument.py filename.txt
Another usage:
!python check_arguments.py filename.txt
Output
Thanks for specifying arguments!
In this class, we discussed the process of debugging, test-driven development, and command line programs. This rounds out our course content covering basic Python programming structures and how to apply them in creating robust, reusable code.
Please view the links on your class website for more information about expanding your Python skills.