The aim of this workshop is to teach you basic concepts, skills, and tools for working with data so that you can get more done in less time, and while having more fun. We will show you how to use the programming language Python to replace many of the tasks you would normally do in spreadsheet software such as Excel, and also do more advanced analysis.
This morning will be a bit of an intro to communicating with your computer via text rather than by pointing and clicking in a graphical user interface, which might be what you are used to. So just so that I get an idea, how many here have worked with a programming language before?
Before we get into practically doing things, I want to give some background to the idea of computing. Essentially, computing is about humans communicating with the computer to modulate flows of current in the hardware, in order to get the computer to carry out advanced calculations that we are unable to efficiently compute ourselves. Early examples of human-computer communication was quite primitive and included actually disconnecting a wire and connecting it again in a different spot. Luckily, we are not doing this anymore, instead we have graphical user interfaces with menus and buttons, which is what you are commonly using on your laptop. These graphical interfaces can be thought of as a layer or shell around the internal components of your operating system and they exist as a middle man making it easier for us to express our thoughts, and for computers to interpret them.
An example of such a program that I think many of you are familiar with is spreadsheet software such as Microsoft Excel and LibreOffice Calc. Here, all the functionality of the program is accessible via hierarchical menus, and clicking buttons sends instructions to the computer, which then responds and sends the results back to your screen. For instance, I can click a button to send the instruction of coloring this cell yellow, and the computer interprets my instructions and then displays the results on the screen; in this case, the cell is highlighted yellow.
Spreadsheet software is great for viewing and entering small data sets and creating simple visualizations fast. However, it can be tricky to design publication-ready figures, create automatic reproducible analysis workflows, perform advanced calculations, and reliably clean data sets. Even when using a spreadsheet program to record data, it is often beneficial to have some some basic programming skills to facilitate the analyses of those data.
Today, I will teach you about communicating to your computer via text, rather than graphical point and click. Typing instruction to the computer might at first seems counterintuitive, why do we need it when it is so easy to point and click with the mouse? Well, graphical user interfaces can be nice when you are new to something, but text based interfaces are more powerful, faster and actually also easier to use once you get comfortable with them.
We can compare it to learning a language, in the beginning it's nice to look things up in a dictionary (or a menu in a graphical program), and slowly string together sentences one word at a time. But once we become more proficient in the language and know what we want to say, it is easier to say or type it directly, instead of having to look up every word in the dictionary first. By extension, it would be even faster to speak or just think of what you want to do and have it executed by the computer, this is what speech- and brain-computer interfaces are concerned with.
Text interfaces are also less resource intensive than their graphical counterparts and easier to develop programs for since you don't have to code the graphical components. Very important, is that it is easy to automate and repeat any task once you have all the instructions written down. This facilitates reproducibility of analysis, not only between studies from different labs, but also between researchers in the same lab: compare being shown how to perform a certain analysis in spreadsheet software, where the instruction will essentially be "first you click here, then here, then here...", with being handed the same workflow written down in several lines of codes which you can analyse and understand at your own pace.
Since text is the easiest way for people who are fluent in computer languages to interact with computer, many powerful programs are written without a graphical user interface (which makes it faster to create these programs) and to use these programs you often need to know how to use a text interface. For example, many the best data analysis and machine learning packages are written in Python or R, and you need to know these languages to use them. Even if the program or package you want to use is not written in Python, much of the knowledge you gain from understanding one programming language can be transferred to others. In addition, most powerful computers that you can log into remotely might only give you a text interface to work with and there is no way to launch a graphical user interface.
To communicate with the computer via Python, we first need to open the Python interpreter. This will interpret our typed commands into machine language so that the computer can understand it. On Windows open the
Anaconda Prompt, on MacOS open
terminal.app, and on Linux open whichever terminal you prefer (e.g.
konsole). Then type in
python and hit Enter. You should see something like this:
Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:39:56) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux Type "help", "copyright", "credits" or "license" for more information. >>>
There should be a blinking cursor after the
>>>, which is prompting you to enter a command (for this reason, the interpreter can also be referred to as a "prompt"). Now let's speak Python!
While English and other spoken language are referred to as "natural" languages, computer languages are said to be "formal" languages. You might think it is quite tricky to learn formal languages, but it is actually not! You already know one: mathematics, which in fact written largely the same way in Python as you would write it by hand.
4 + 5
The Python interpreter returns the result directly under our input and prompts us to enter new instructions. This is another strength of using Python for data analysis, some programming languages requires an additional step where the typed instructions are compiled into machine language and saved as a separate file that they computer can run. Although compiling code often results in faster execution time, Python allows us to very quickly experiment and test new code, which is where most of the time is spent when doing exploratory data analysis.
The sparseness in the input
4 + 5 is much more efficient than typing "Hello computer, could you please add 4 and 5 for me?". Formal computer languages also avoid the ambiguity present in natural languages such as English. You can think of Python as a combination of math and a formal, succinct version of English. Since it is designed to reduce ambiguity, Python lacks the edge cases and special rules that can make English so difficult to learn, and there is almost always a logical reason for how the Python language is designed, not only a historical one.
The syntax for assigning a value to a variable is also similar to how this is written in math.
a = 4
a * 2
In my experience, learning programming really is similar to learning a foreign language - you will often learn the most from just trying to do something and receiving feedback (from the computer or another person)! When there is something you can't wrap you head around, or if you are actively trying to find a new way of expressing a thought, then look it up, just as you would with a natural language.
Although the Python interpreter is very powerful, it is commonly bundled with other useful tools in interfaces specifically designed for exploratory data analysis. One such interface is the Jupyter Notebook, which is what we will be using today. Open it by running
juptyerlab from the terminal, or by finding it in the
Anaconda navigator from your operating system menu. This should output some text in the terminal and open new tab in your default browser.
Jupyter originates from a project called IPython, an effort to make Python development more interactive. Since its inception, the scope of the project expanded to include additional programming languages, such as Julia, Python, and R, so the name was changed to "Jupyter" as a reference to these core languages. Today, Jupyter supports many more languages, but we will be using it only for Python code. Specifically, we will be using the notebook from Jupyter, which allows us to easily take notes about our analysis and view plots within the same document where we code. This facilitates sharing and reproducibility of analyses, and the notebook interface is easily accessible through any web browser as well as exportable as a PDF or HTML page.
In the new browser tab, click the plus sign to the left and select to create a new notebook in the Python language (also
File --> New --> Notebook).
Initially the notebook has no name other than "Untitled". If you click on "Untitled" you will be given the option of changing the name to whatever you want.
The notebook is divided into cells. Initially there will be a single input cell. You can type Python code directly into the cell, just as we did before. To run the output, press Shift + Enter or click the play button in the toolbar.
4 + 5
By default, the code in the current cell is interpreted and the next existing cell is selected or a new empty one is created (you can press Ctrl + Enter to stay on the current cell). You can split the code across several lines as needed.
a = 4 a * 2
The little counter on the left of each cell keeps track of in which order the cells were executed, and changing to an
* when the computer is processing the computation (only noticeable for computation that takes longer time).
The notebook is saved automatically, but it can also be done manually from the toolbar or by hitting Ctrl + s. Both the input and the output cells are saved so any plots that you make will be present in the notebook next time you open it up without the need to rerun any code. This allows you to create complete documents with both your code and the output of the code in a single place instead of spread across text files for your codes and separate image files for each of your graphs.
You can also change the cell type from Python code to Markdown using the Cell | Cell Type option. Markdown is a simple formatting system which allows you to create documentation for your code, again all within the same notebook structure. You might already be famliar with markdown if you have typed comments in online forums or use use a chat app like slack or whatsapp. A short example of the syntax:
markdown # Heading level one - A bullet point - *Emphasis in italics* - **Strong emphasis in bold** This is a [link to learn more about markdown](https://guides.github.com/features/mastering-markdown/)
The Notebook itself is stored as a JSON file with an .ipynb extension. These are specially formatted text files, which can be exported and imported into another Jupyter system. This allows you to share your code, results, and documentation with others. You can also export the notebook to HTML, PDF, and many other formats to make sharing even easier! This is done via
File --> Export Notebook As...
The data analysis environment provided by the Jupyter Notebook is very powerful and facilitates reproducible analysis. It is possible to write an entire paper in this environment, and it is very handy for reports, such as progress updates since you can share your comments on the analysis together with the analysis itself.
It is also possible to open up other document types in the JupyterLab interface, e.g. text documents and terminals. These can be placed side by side with the notebook through drag and drop, and all running programs can be viewed in the "Running" tab to the left. To search among all available commands for the notebook, the "Commands" tab can be used. Existing documents can be opened from the "Files" tab.
To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions so that they would not fit in a menu. Instead of clicking
File -> Open and chose the file, you would type something similar to
file.open('<filename>') in a programming language. Don't worry if you forget the exact expression, it is often enough to just type the few first letters and then hit Tab, to show the available options, more on that later.
Since there are so many functions available in Python, it is unnecessary to include all of them with the default installation of the programming language (it would be as if your new phone came with every single app preinstalled). Instead, more advanced functionality is grouped into separate packages, which can be accessed by typing
import <package_name> in Python. You can think of this as that you are telling the program which menu items you want to activate (similar to how Excel hides the
Developer menu by default since most people rarely use it and you need activate it in the settings if you want to access its functionality). The Anaconda Python distribution comes with many of the scientific Python packages preinstalled, but other packages need to be downloaded before they can be used, just like downloading an addon to a browser or mobile phone.
Just like in spreadsheet software menus, there are lots of different tools within each Python package. For example, if I want to use numerical Python functions, I can import the numerical python module,
numpy. I can then access any function by writing
import numpy numpy.mean([1, 2, 3, 4, 5])
Once you start out using Python, you don't know what functions are availble within each package. Luckily, in the Jupyter Notebook, you can type
numpy.Tab (that is numpy + period + tab-key) and a small menu will pop up that shows you all the available functions in that module. This is analogous to clicking a 'numpy-menu' and then going through the list of functions. As I mentioned earlier, there are plenty of available functions and it can be helpful to filter the menu by typing the initial letters of the function name.
To get more info on the function you want to use, you can type out the full name and then press Shift + Tab once to bring up a help dialogue and again to expand that dialogue. We can see that to use this function, we need to supply it with the argument
a, which should be 'array-like'. An array is essentially just a sequence of numbers. We just saw that one way of doing this was to enclose numbers in brackets
, which in Python means that these numbers are in a list, something you will hear more about later. Instead of manually activating the menu every time, the JupyterLab offers a tool called the "Inspector" which displays help information automatically. I find this very useful and always have it open next to my Notebook. More help is available via the "Help" menu, which links to useful online resources (for example
Help --> Numpy Reference).
When you start getting familiar with typing function names, you will notice that this is often faster than looking for functions in menus. It is similar to getting fluent in a language. I know what English words I want to type right now, and it is much easier for me to type them out, than to select each one from a menu. However, sometimes I forget and it is useful to get some hints as described above.
It is common to give packages nicknames, so that it is faster to type. This is not necessary, but can save some work in long files and make code less verbose so that it is easier to read:
import numpy as np np.mean([1, 2, 3, 4, 5])
The Python package that is most commonly used to work with spreadsheet-like data is called
pandas, the name is derived from "panel data", an econometrics term for multidimensional structured data sets. Data are easily loaded into
.csv or other spreadsheet formats. The format
pandas uses to store this data is called a dataframe.
I do not have any good data set lying around, so I will load a public dataset from the web (you can view the data by pasting the url into your browser). This sample data set describes the length and width of sepals and petals for three species of iris flowers. When you open a file in a graphical spreadsheet program, it will immediately display the content in the window. Likewise, Python will display the information of the data set when you read it in.
import pandas as pd pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
150 rows × 5 columns
However, to do useful and interesting things to data, we need to assign this value to a variable name so that it is easy to access later. Let's save our data into an object called
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
150 rows × 5 columns
head() can be used to display only the first few rows of the data frame.
And a single column can be selected with the following syntax
0 5.1 1 4.9 2 4.7 3 4.6 4 5.0 Name: sepal_length, dtype: float64
We could calculate the mean of all columns easily.
sepal_length 5.843333 sepal_width 3.057333 petal_length 3.758000 petal_width 1.199333 dtype: float64
And even devide it into groups depending on which species or iris flower the observations belong to.
This technique is often referred to as "split-apply-combine". The
groupby() method split the observations into groups,
mean() applied an operation to each group, and the results were automatically combined into the table that we can see here. We will learn much more about this in a later lecture.
Before creating any plots, we will set an option so that all the plots appear in the notebook, and not in a separate figure.
To visualize the results in with plots, we will use Python package dedicated to statistical visualization,
seaborn. To count the observations in each species, we could use
import seaborn as sns sns.countplot(x='species', data=iris)
<matplotlib.axes._subplots.AxesSubplot at 0x7f096d92d898>
We can see that there are 50 observations recorded for each species of iris. More interesting could be to compare the sepal lengths between the plants to see if there are differences depending on species. We could use
swarmplot() for this, which plots every single observation as a dot.
sns.swarmplot(x='species', y='sepal_length', data=iris)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0962caa160>
Since the number of observations for each species was 50, there will be 50 dots for each species here. This plot corresponds with what we saw when looking at the mean values for each species earlier.
There is much more to learn about plotting, which will do in a later lecture. One last example to illustrate the power of programmatic data analysis and how straightforward it can be to create very complex visualizations. A common exploratory visualization is the investigate the pairwise relationship between variables, are there measurements that are correlated with each other? This can be done with the
<seaborn.axisgrid.PairGrid at 0x7f0962c89748>
This graph shows the histograms of the distribution of each variable on the diagonal. The same pairwise relationships between the columns in the data set are shown in scatter plots below and above the diagonal. It is easy to see that some variables certainly seem to depend on each other. For example, the petal width and petal height, increase simultaneously indicated that some flowers have bigger petals than others.
We can also make out some clusters or groups of points within each pairwise scatter plot. It would be interesting to know if this corresponded to some inherent structure of the data frame. Maybe it is observations from the different species that cluster together? To find this out, we can instruct
seaborn to adjust the color hue of the data points according to their species affiliation with a minor modification to the line of code above.
<seaborn.axisgrid.PairGrid at 0x7f09624e3860>
It certainly looks like observations from the same species are close together and a lot of the variation in our data can be explained with which species the observation belongs to!
This has been an introduction to what data analysis looks like with Python in the Jupyter Notebook. We will get into details about all the steps of this workflow in the following lectures, and you can keep referring back to this lecture as a high level overview of the process.