Welcome to Introduction to Python from fredhutch.io! This course introduces you Python by working through common tasks in data science: importing, manipulating, and visualizing data.
Python is a computer programming language widely used for a variety of applications. For more information about Python and ways to use it at Fred Hutch, please see the Python entry for the Fred Hutch Biomedical Data Science Wiki.
Before proceeding with these training materials,
please ensure you have installed Python and Jupyter notebooks via Anaconda as described
here.
Please note you'll also need to install plotnine
separately for the last module.
By the end of this first module, you should be able to:
Python is a commonly used programming language among researchers, and has a large community and set of tools available to support its use. As a result, there are many different ways to interact with Python, the choice of which depends on your specific need for coding. In this class, we'll be using Jupyter notebooks to write, run, and maintain a record of our work.
A Jupyter notebook is an interface operated in a web browser that allows inclusion of code, output (including graphics) and explanatory text all in the same document. In fact, these lesson materials are written in a Jupyter notebook. Jupyter notebooks can also be used as a method of communicating research methods, such as this notebook associated with a published manuscript from Rasi Subramaniam's lab at Fred Hutch.
You can access Jupyter notebooks through Anaconda, which is the software you used to install Python as well. Anaconda is a version of conda, a package manager that helps you install and update software.
Open the Anaconda Navigator software on your computer, then click on Jupyter Notebook (note that this is different than Jupyter Lab!).
You'll see your default web browser open a new tab. On a Mac, you may also see a Terminal window open; this window needs to stay open for Python to run, but we recommend you minimize it so it stays out of the way.
In the Jupyter notebook window in your web browser,
note the URL at the top:
it should start with something like http://localhost:8888/tree
.
In the browser window, you should see folders like "Documents" and "Desktop."
This window represents a different way to interact with the files on your computer.
Although you're viewing these files in a web browser,
you're not necessarily working with files online.
This means that you can securely use Jupyter notebooks to work with sensitive data,
as long as those data are stored in a secure location.
We're going to create a project directory for the purposes of this course. You can think of a project as a discrete unit of work, such as a chapter of a thesis/dissertation, analysis for a manuscript, or a monthly report. We recommend organizing your code, data, and other associated files as projects, which allows you to keep all parts of an analysis together for easier access.
Create a new project for this class using the Jupyter notebook file browser:
ipynb
.Jupyter notebooks have a handy "auto-save" feature so you don't have to manually save constantly.
You may see messages appearing at the top of your notebook referencing "checkpoints," which means the auto-save feature is functioning.
Now that we have a new project and an empty notebook set up, we can begin orienting ourselves to how notebooks work to hold our text, code, and output.
The pale gray box you see at the top of your screen with In [ ]
to the left is a cell.
By default, each cell is created as a code cell.
Because our notebook is Python 3,
our code cells are able to execute Python code.
We can test this out by entering 3 + 4
into the cell,
then holding down the Shift key and pressing Enter/Return.
This executes (runs) the code in the cell and prints the output below,
prefaced by Out[ ]
.
Executing the code this way also creates a new cell below the one you executed.
If a new cell doesn't appear,
you can add one using the +
button in the toolbar at the top of the screen.
Cells can also be used to enter text using Markdown formatting.
Change the type of your new cell by going to the dropdown box in the tool bar at the top of the window and changing "Code" to "Markdown."
Add a subtitle in this cell by entering ## Operators, functions, and data types
,
then using Shift + Enter to execute the cell,
which formats the text as large and bold.
The link above includes more information about Markdown formatting,
but we'll generally use only plain text and subtitles for this course.
Jupyter notebooks include many other features, which you can explore in the toolbar and dropdown menus at the top of the screen. Additional keyboard shortcuts are also available under "Help -> Keyboard Shortcuts".
Now that we have a notebook created, as well as a basic understanding of how to write and execute code, we can begin learning more about Python syntax, which are rules that dictate how combinations of words and symbols are interpreted in a language.
# mathematical operator
4 + 5
9
The first line in the example above is a code comment.
It is not interpreted by Python, but is a human-readable explanation of the code that follows.
In Python, anything to the right of one or more #
symbols represents a comment.
Syntax differs among language. So far in this lesson, we've learned that Markdown interprets
#
as a way of formatting titles and subtitles, while in Python the same symbol represents a code comment.
As we proceed through these lessons,
we recommend trying to type the example code so it appears as similar as possible to what is presented here.
From the example above,
you may now be wondering if the spaces on either side of the +
are required.
We can test this for ourselves:
4+5
9
The code above indicates that the spaces are not required, but are convention. Code convention and style doesn’t make or break the ability of your code to run, but it does affect whether other people can easily understand your code. We'll try to model appropriate code convention for this course, and you can read more about Python formatting recommendations here.
We can also use logical operators to evaluate whether a given statement is true or false:
3 > 4
False
In addition to logical data, python possesses a few other built-in data types:
# data types in python
number = 42
pi_value = 3.1415
text = "Fred Hutch"
In the code above, we have assigned three new variables. Like in math, a variable is a word used to represent data, which can be a single value or more complex collections.
We can use the variables we just created to explore other built-in data types using functions. Functions are pre-defined sets of code that allow you to repeat particular actions:
# use function to identify data type
type(number)
int
In the code above, type
is the function and number
is the variable we assigned earlier.
This code is asking what type of data number
represents,
and the output, int
, stands for integer (whole number data).
type(pi_value)
float
float
data represents numbers with decimal points.
type(text)
str
str
represents character data, also referred to as strings.
These data include anything that can be included inside quotation marks,
including letters, numbers, punctuation, and even emoji.
We can also use functions to convert data among these types:
# convert float to integer
int(pi_value) # decimals removed
3
When we assigned (created) this variable,
the two decimal places instructed Python to interpret it as a float value.
By using the function int
,
we can convert the value to integer.
If we again inspect the type of pi_value
, though:
type(pi_value)
float
We see the data type is still float. This is because we haven't altered the data type of the original variable, only the data type of the output printed to the screen.
We can change the data type of our original variable by reassigning back to the same name:
# reassign variable
pi_value = int(pi_value)
type(pi_value)
int
Now we see the type of the variable has changed to integer.
Similarly, we can convert integers to float:
# convert integer to float
float(number)
42.0
Although the numerical value hasn't changed, the presence of the decimal in the output indicates it is a float.
Notebooks allow you a handy shortcut to view the contents of a variable by executing only the variable name:
text
'Fred Hutch'
This approach will work well enough for us in this class,
since we'll be using notebooks the whole time.
If you are using code written by other people,
or begin writing code in scripts (outside notebooks),
you'll often see the print
function used:
# print output to screen
print(text)
Fred Hutch
The data output are the same for each of the two previous code cells,
though they look slightly different.
It's useful to note that the notebook will only print the result of the last command executed in a code cell,
so if there are is other output in the cell you'd like to see,
you may need to use the print
function then as well.
If you would like to find help on a function, there's a function for that:
# find help on a function
help(print)
Help on built-in function print in module builtins: print(...) print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False) Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream.
The help documentation may seem difficult to decipher right now, but includes following relevant information:
Help on built-in function print in module builtins:
is a title for the information belowPrints the values to a stream, or to sys.stdout by default.
is the explanation for the functionprint
and type
are built in functions that come with your installation of Python.
There are many other functions available that allow you to perform common tasks while programming.
You can also write your own functions
(which we'll cover at the end of these materials),
as well as load additional functions contained in packages written by other people
(which we'll cover in our next class).
So far we've been working with variables containing a single value. It's often the case that we would like to use a variable to reference collections of values. Sequences are a data structure which hold collections of elements. Lists are one type of sequence, and are defined in Python using square brackets:
# assign a list to a variable
numbers = [1, 2, 3]
numbers
[1, 2, 3]
Now that we've created a list, we can access different portions of it:
# access first element in list
numbers[0]
1
The number in the square brackets above indicates the position, or index, of the element we are accessing.
Python begins indexing (counting) at 0,
so the index positions in numbers
are 0
, 1
, and 2
.
If you need to find information about your variable,
you can run ?numbers
in a code cell and a help window will pop up containing information about things like the variable's type and length.
Similar (but more extensive) information appears in an output cell if you run help(numbers)
this additional detail may be useful to you as your programming skills develop.
We can modify lists after they are created:
# add element (number) to end of list
numbers.append(4)
Note that nothing is printed as output unless we specifically ask for it:
print(numbers)
[1, 2, 3, 4]
append()
is a method, or function associated with a particular variable.
In this case, it is a method associated with lists that allows us to directly modify it.
You can learn more about this method by typing ?numbers.append
in a new code cell,
which presents a help window with the following information:
Docstring: L.append(object) -> None -- append object to end
Type: builtin_function_or_method
You can view other methods available for lists by typing ?numbers.
in a new code cell and hitting the tab
key.
This provides a drop-down list that shows all methods available for the variable.
Although we've worked so far with numerical data (integers and floats), we can also create lists using string data:
# lists of string data
organs = ["lung", "breast", "prostate"]
organs
['lung', 'breast', 'prostate']
Challenge-numbers¶
What happens when you execute numbers[1] = 5
?
Challenge-add¶
What online search term could you use to determine a method for adding multiple values to a list?
Challenge-remove¶
How do you remove items from a list?
Now that we have a basic understanding of lists, we can take a look at another type of sequence: tuples. A tuple is a list with an ordered sequence of elements, and they are created using parentheses:
# assign a tuple variable
a_tuple = (1, 2, 3)
Challenge-tuple¶
What happens when you execute a_tuple[2] = 5
?
The output you see when running the code above is called Traceback, or a multi-line block of information about an error. It includes information about what went wrong, and where in the code it happened (this is useful when dealing with multi-line code chunks!).
If you have code in your notebook that will cause an error to occur,
we recommend commenting out the code if you would like to retain the information, but not continue executing it with the rest of your functional code.
Lists and tuples differ in their mutability, or ability to be changed once created: lists are sequences that can be modified, tuples are sequences that cannot be modified. Python recognizes the difference between these data structures based on the symbols used to create them.
We've worked with sequences so far that contain a single data type, but sequences can contain more than one data type:
# create tuple containing multiple data types
mix_tuple = ("lung", 200, "chromosome 1")
mix_tuple
('lung', 200, 'chromosome 1')
We can also create lists of mixed data types, though it's more common they represent a single data type.
We've been printing the contents of lists so far to the screen,
but we often would like to access each element in a structure once at a time.
We can accomplish this using a programming structure called a for loop.
For loops exist in many programming languages,
and can be used to repeat actions across a set of things.
Here, we'll access elements in mixed_tuple
one at a time:
# for loop to access elements in tuple one at a time
for num in mix_tuple:
print(num)
lung 200 chromosome 1
In the code above, num
represents a variable used inside the for loop.
There is a predictable format for the syntax of a for loop,
Loops require specific syntax, including for
, in
, and :
in the first line;
we'll work through some more examples later,
and you can read about for loop structures in Python here.
Now that we have a basic understanding of lists and tuples, we can explore another data structure: dictionaries. A dictionary holds elements that are paired (key and value). We can create an example containing two such pairs:
# create a dictionary
translation = {"one": 1, "two": 2}
translation["one"]
1
In the code above, the strings (e.g., "one") represent the keys, and the numbers (e.g., 2) are the values, so a single pair would be "one" and 1. This may seem like an odd way to store data, but it can be useful if you need to reference particular matched values repeatedly (for example, when reverse-complementing nucleotide sequences).
It's useful to note that the values can include lists:
# create dictionary with a list as the value
list_value = {"yes": [1, 2, 3]}
list_value
{'yes': [1, 2, 3]}
However, keys cannot be a list.
You can try this by attempting to execute list_key = {[1, 2, 3]: "nope"}
.
You'll see an error indicating that a list is "unhashable."
While our translation
variable represents keys that are strings and values that are integers,
we can create a dictionary with those data types reversed:
# create dictionary with integer as key and string as value
rev = {1: "one", 2: "two"}
rev
{1: 'one', 2: 'two'}
We can use this variable to demonstrate an approach to add a new pair to the dictionary:
# add items to dictionaries by assigning new value to key
rev[3] = "three"
rev
{1: 'one', 2: 'two', 3: 'three'}
With can now combine this understanding of dictionaries with our earlier exploration of for loops, and examine two different approaches for printing the key/value pairs in a dictionary.
The first way accesses each element (pair) using the method dict.keys
:
for key in rev.keys():
print(key, "->", rev[key])
1 -> one 2 -> two 3 -> three
Here, rev.keys()
is used to list all keys in the dictionary,
with the value printed from accessing each of the respective keys.
The second way accesses each pair using the method dict.items
:
# access each element using dict.items
for key, value in rev.items():
print(key, "->", value)
1 -> one 2 -> two 3 -> three
Because rev.items()
accesses both the key and value of the pair
(you can confirm this by printing rev.items()
),
you can print each directly from the respective variable internal to the for loop.
Challenge-applesauce¶
- print only the values of the
rev
dictionary to the screen- Reassign the second value (in the key value pair) so that it no longer reads “two” but instead “apple-sauce”
- Print the values of
rev
to the screen again to see if the value has changed
In this last section, we'll briefly overview how to write our own custom functions:
# define a chunk of code as function
def add_function(a, b):
result = a + b
return result
The first line of code defines the function with the name add_function()
that accepts two items as input (a
and b
).
The second line performs the action,
and the last line determines what is output.
We can test the function by evaluating its use on data with an easily predictable outcome:
z = add_function(20, 22)
print(z)
42
Challenge-function¶
Define a new function called subtract_function
that subtracts d
from c
and test on numbers of your choice
This first section introduced you to Python syntax and Jupyter notebooks. We've covered general data types, a few data structures, and two basic programming structures (for loops and defining functions). We won't be relying heavily on these data and programming structures for the rest of the course, but you should now have a good idea of some basic functionality of Python.
In the next session, we'll begin working with a large clinical cancer dataset, similar to other spreadsheet-style data you're likely to encounter in your own work.
When you are done working with Python in Jupyter notebooks, you should ensure the auto-save feature has captured your work (either by checking the time stamp on in your Jupyter file browser, or by using the manual "Save" option in your notebook). If you look in the Jupyter file browser window, an active (running) notebook will have a green book icon with "Running" also listed. Closing the browser windows for Jupyter notebook and the file browser (and even shutting down Anaconda Navigator) will not shut down the Python processes running in the background (these are sometimes referred to as the Python kernel). To shut everything down, go to "File -> Close and Halt" for the notebook. When you revisit the Jupyter file browser window, the book icon next to the notebook should be gray. Now, closing the browser windows and Anaconda Navigator will complete the shut down process. For more information on how to shut down Jupyter Notebooks, please see the official documentation.
If you are running Juptyer notebooks on Mac,
you can stop the kernel running by closing the Terminal window.
Similarly, you can launch jupyter notebooks without Anaconda Navigator by opening a Terminal window,
typing jupyter notebook
,
and hitting enter.
Your Jupyter windows in your file browser will launch as through Anaconda Navigator.
If you need to reopen your project after closing Jupyter notebooks, you'll need to reopen Anaconda Navigator and re-launch Jupyter notebooks. If you completely closed down your notebooks in your last session (or restarted your computer), your file browser will show each notebook with a gray book icon. Re-launch the notebook by clicking on the file name to work with the code in that notebook again. Although both your code and output will appear in the browser window, Python won't be able to "remember" any of this work. You'll need to re-execute all cells starting from the top of the notebook to be able to continue working in the same document. Also remember to only have one window open for each Jupyter notebook at any point in time; multiple open windows for the same notebook (e.g., "class1.ipynb") will also result in errors.
Answers to all challenge exercises are available here.