Getting Started

You are expected to have strong familiarity with git and know the basics of python.

We do not expect you to have experience with Jupyter or writing educational resources.

Note: this guide was not written for Jupyter lab. Jupyter lab is still in beta.

As you run this notebook I want you to pay close attention to the state issues. Whether the code in the notebook cells runs successfully is dependant on the order in which you run the notebook cells. We want to mitigate this as much as possible so, unlike this notebook, I encourage you to avoid inter-cell dependencies. We are currently working on functions to improve stability.

Basics

Hotkeys are under the menu item: Help, Keyboard Shortcuts

In [ ]:
?who # Put a ? in front of a word to access the Jupyter Docs
In [ ]:
!ls # Putting ! in front of a command tells Jupyter that it is a bash command
In [ ]:
# You will probably get permission errors when installing on the hub, use --user
!pip3 install plotly --user;

Cell Inputs and Outputs

In [35]:
2 + 4
Out[35]:
6
In [36]:
print(2 + 4) 
6

Notice how the print function does not produce a cell output. As seen below you can use a ; to suppress output however this only works if there is a cell Out[] to suppress. This may seem simple but it can have important ramifications, for example when trying to capture the output of javascript visualizations.

In [37]:
print(3 + 7);
10
In [38]:
3 + 7; 
In [39]:
_ + 5 # takes the output of the previous cell and adds 5. 
# It is not taking the result from 3 + 7; because that cell does not have output.
Out[39]:
11

You can access cell outputs through the _n and Out[n] variables:

In [7]:
_4 == Out[4] 
# If you get a "name '_4' is not defined" error
# go back to the cell labelled In[4] and see if it has an output. 
# How can you fix this issue?
# What happens if you rerun a cell you already ran?
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-4ab9fa893971> in <module>()
----> 1 _4 == Out[4]
      2 # If you get a "name '_4' is not defined" error
      3 # go back to the cell labelled In[4] and see if it has an output. How can you fix this issue?
      4 # What happens if you rerun a cell you already ran?

NameError: name '_4' is not defined

Be careful when hardcoding inputs/outputs into your notebooks because the indexing depends on the order in which you execute the cells.

In [ ]:
Out

There are shorthands for the previous cells' output:

In [ ]:
print('previous cells output:', _)
print('the one before the previous cell   :', __)
print('the output three cells back   :', ___)

Similarly you can access the cell inputs:

In [ ]:
In[1]
In [ ]:
print('previous cells input:', _i)
print('the one before the previous cell  :', _ii)
print('the input three cells back  :', _iii)
In [ ]:
%history # history of inputs

Data

You can use the %load magic for general imports.

Pandas is important for data manipulation in Python, check out this pandas guide.

Read remote files from an url

In [ ]:
# This code will not actually run because foo is not a valid url. 
# You can try replacing foo.xlsx with https://education.alberta.ca/media/3680582/diploma-multiyear-auth-list-annual.xlsx

excelUrl = 'foo.xlsx'
csvUrl = 'bar.csv'

pd.read_excel('url') 
pd.read_csv('url') 

Advanced File Reading Example, csv into a pandas dataframe

This will read through csv files whose urls differ by year, add them all to a pandas dataframe, and then select specific columns of the dataframe. All data here is open source.

In [ ]:
import pandas as pd

df = pd.DataFrame()

startYear = 1995
endYear   = 1997  # The last year is not included, so if it was 2017 it means we include the 2016 collection but not 2017.

for year in range(startYear, endYear):
    file = 'https://s3.ca-central-1.amazonaws.com/open-data-ro/NSERC/NSERC_GRT_FYR' + str(year) + '_AWARD.csv.gz'
    df = df.append(pd.read_csv(file, 
                               compression='gzip', # .gz file extension because it is compressed to speed up the transfer
                               usecols = [1, 2, 3, 4, 5, 7, 9, 11, 12, 13, 17, 28], # only add these columns to the dataframe
                               encoding='latin-1'
                              )
                  )  
    print(year) # Print each year as it reads it so you can see the progress.
    
    
In [ ]:
df.head() # Show the contents at the head of the dataframe.
#df # show the entire dataframe, commented out to keep things tidy

If you are unfamiliar with data science in general, you will have a huge learning curve that cannot be covered in a simple tutorial. Feel free to reach out on the developer slack channel if you need help. Google is your friend.

Tricky Python

Equality vs Identity

"==" for equality, "is" for identity!

In [13]:
a = 'hello world'
b = 'hello world'
print('a is b,', a is b)
print('a == b,', a == b)
a is b, False
a == b, True

Identity does not imply equality.

In [14]:
a = float('nan')
print('a is a,', a is a)
print('a == a,', a == a)
a is a, True
a == a, False

Python keeps an array of small integer objects (i.e., integers between -5 and 256, see the doc)

In [11]:
print('256 is 257-1', 256 is 257-1)
print('257 is 258-1', 257 is 258 - 1)
print('-5 is -6+1', -5 is -6+1)
print('-7 is -6-1', -7 is -6-1)
256 is 257-1 True
257 is 258-1 False
-5 is -6+1 True
-7 is -6-1 False

Logical Operators

a or b == a if a else b

a and b == b if a else a

In [18]:
result = (3 or 5) * (7 and 9)
print('3 * 9 =', result)
3 * 9 = 27

bool is a subclass of int

In [19]:
print('isinstance(True, int):', isinstance(True, int))
print('True + True:', True + True)
print('3*True + True:', 3*True + True)
print('3*True - False:', 3*True - False)
isinstance(True, int): True
True + True: 2
3*True + True: 4
3*True - False: 3

Modifying a List in a Loop

In [26]:
a = [1, 2, 3, 4, 5]
for i in a:
    if not i % 2:
        a.remove(i)
print(a)
[1, 3, 5]
In [27]:
b = [2, 4, 5, 6]
for i in b:
     if not i % 2:
         b.remove(i)
print(b)
[4, 5]

Hopefully this example will make it clear why this happens:

In [22]:
b = [2, 4, 5, 6]
for index, item in enumerate(b):
    print(index, item)
    if not item % 2:
        b.remove(item)
print(b)
0 2
1 5
2 6
[4, 5]

Troubleshooting List Slicing

Get an index error as expected:

In [28]:
my_list = [1, 2, 3, 4, 5]
print(my_list[5])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-28-2f6b582502c3> in <module>()
      1 my_list = [1, 2, 3, 4, 5]
----> 2 print(my_list[5])

IndexError: list index out of range

No index error, hard to troubleshoot:

In [30]:
my_list = [1, 2, 3, 4, 5]
print(my_list[5:])
[]

*args vs **kwargs

Both of these are used to allow function inputs of arbitrary length.

Here are the differences between the two of them:

In [31]:
def a_func(*args):
    print('type of args:', type(args))
    print('args contents:', args)
    print('1st argument:', args[0])

a_func(0, 1, 'a', 'b', 'c')
type of args: <class 'tuple'>
args contents: (0, 1, 'a', 'b', 'c')
1st argument: 0
In [32]:
def b_func(**kwargs):
    print('type of kwargs:', type(kwargs))
    print('kwargs contents: ', kwargs)
    print('value of argument a:', kwargs['a'])
    
b_func(a=1, b=2, c=3, d=4)
type of kwargs: <class 'dict'>
kwargs contents:  {'a': 1, 'b': 2, 'c': 3, 'd': 4}
value of argument a: 1

Misc

You won't necessarily need this.

In [ ]:
%%javascript
document.getElementById('<name>').contentWindow
In [ ]:
# Read in an HTML file
from IPython.display import HTML
with open('index.html', 'r') as f:
    inputForm = f.read()
HTML(inputForm)
In [ ]:
# JS variable that python can access
IPython.notebook.kernel.execute('query = "AU='.concat(typed, '"'));
In [ ]:
# Iterate through and execute the jupyter cells in order. Does it by index, not necessarily starting with the top cell.
# textContent call to figure out which cell has the content you want to run
for (i = 1; i < 15; i++) { 
                    IPython.notebook.get_cell(i).execute();
                }