You are expected to have strong familiarity with git and know the basics of python.
We do not expect you to have experience with Jupyter or writing educational resources.
Note: this guide was not written for Jupyter lab. Jupyter lab is still in beta.
As you run this notebook I want you to pay close attention to the state issues. Whether the code in the notebook cells runs successfully is dependant on the order in which you run the notebook cells. We want to mitigate this as much as possible so, unlike this notebook, I encourage you to avoid inter-cell dependencies. We are currently working on functions to improve stability.
Hotkeys are under the menu item: Help, Keyboard Shortcuts
?who # Put a ? in front of a word to access the Jupyter Docs
!ls # Putting ! in front of a command tells Jupyter that it is a bash command
# You will probably get permission errors when installing on the hub, use --user
!pip3 install plotly --user;
2 + 4
6
print(2 + 4)
6
Notice how the print function does not produce a cell output.
As seen below you can use a ;
to suppress output however this only works if there is a cell Out[]
to suppress. This may seem simple but it can have important ramifications, for example when trying to capture the output of javascript visualizations.
print(3 + 7);
10
3 + 7;
_ + 5 # takes the output of the previous cell and adds 5.
# It is not taking the result from 3 + 7; because that cell does not have output.
11
You can access cell outputs through the _n
and Out[n]
variables:
_4 == Out[4]
# If you get a "name '_4' is not defined" error
# go back to the cell labelled In[4] and see if it has an output.
# How can you fix this issue?
# What happens if you rerun a cell you already ran?
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-4ab9fa893971> in <module>() ----> 1 _4 == Out[4] 2 # If you get a "name '_4' is not defined" error 3 # go back to the cell labelled In[4] and see if it has an output. How can you fix this issue? 4 # What happens if you rerun a cell you already ran? NameError: name '_4' is not defined
Be careful when hardcoding inputs/outputs into your notebooks because the indexing depends on the order in which you execute the cells.
Out
There are shorthands for the previous cells' output:
print('previous cells output:', _)
print('the one before the previous cell :', __)
print('the output three cells back :', ___)
Similarly you can access the cell inputs:
In[1]
print('previous cells input:', _i)
print('the one before the previous cell :', _ii)
print('the input three cells back :', _iii)
%history # history of inputs
You can use the %load magic for general imports.
Pandas is important for data manipulation in Python, check out this pandas guide.
# This code will not actually run because foo is not a valid url.
# You can try replacing foo.xlsx with https://education.alberta.ca/media/3680582/diploma-multiyear-auth-list-annual.xlsx
excelUrl = 'foo.xlsx'
csvUrl = 'bar.csv'
pd.read_excel('url')
pd.read_csv('url')
This will read through csv files whose urls differ by year, add them all to a pandas dataframe, and then select specific columns of the dataframe. All data here is open source.
import pandas as pd
df = pd.DataFrame()
startYear = 1995
endYear = 1997 # The last year is not included, so if it was 2017 it means we include the 2016 collection but not 2017.
for year in range(startYear, endYear):
file = 'https://s3.ca-central-1.amazonaws.com/open-data-ro/NSERC/NSERC_GRT_FYR' + str(year) + '_AWARD.csv.gz'
df = df.append(pd.read_csv(file,
compression='gzip', # .gz file extension because it is compressed to speed up the transfer
usecols = [1, 2, 3, 4, 5, 7, 9, 11, 12, 13, 17, 28], # only add these columns to the dataframe
encoding='latin-1'
)
)
print(year) # Print each year as it reads it so you can see the progress.
df.head() # Show the contents at the head of the dataframe.
#df # show the entire dataframe, commented out to keep things tidy
If you are unfamiliar with data science in general, you will have a huge learning curve that cannot be covered in a simple tutorial. Feel free to reach out on the developer slack channel if you need help. Google is your friend.
"==" for equality, "is" for identity!
a = 'hello world'
b = 'hello world'
print('a is b,', a is b)
print('a == b,', a == b)
a is b, False a == b, True
Identity does not imply equality.
a = float('nan')
print('a is a,', a is a)
print('a == a,', a == a)
a is a, True a == a, False
Python keeps an array of small integer objects (i.e., integers between -5 and 256, see the doc)
print('256 is 257-1', 256 is 257-1)
print('257 is 258-1', 257 is 258 - 1)
print('-5 is -6+1', -5 is -6+1)
print('-7 is -6-1', -7 is -6-1)
256 is 257-1 True 257 is 258-1 False -5 is -6+1 True -7 is -6-1 False
result = (3 or 5) * (7 and 9)
print('3 * 9 =', result)
3 * 9 = 27
bool
is a subclass of int
¶print('isinstance(True, int):', isinstance(True, int))
print('True + True:', True + True)
print('3*True + True:', 3*True + True)
print('3*True - False:', 3*True - False)
isinstance(True, int): True True + True: 2 3*True + True: 4 3*True - False: 3
a = [1, 2, 3, 4, 5]
for i in a:
if not i % 2:
a.remove(i)
print(a)
[1, 3, 5]
b = [2, 4, 5, 6]
for i in b:
if not i % 2:
b.remove(i)
print(b)
[4, 5]
Hopefully this example will make it clear why this happens:
b = [2, 4, 5, 6]
for index, item in enumerate(b):
print(index, item)
if not item % 2:
b.remove(item)
print(b)
0 2 1 5 2 6 [4, 5]
Get an index error as expected:
my_list = [1, 2, 3, 4, 5]
print(my_list[5])
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-28-2f6b582502c3> in <module>() 1 my_list = [1, 2, 3, 4, 5] ----> 2 print(my_list[5]) IndexError: list index out of range
No index error, hard to troubleshoot:
my_list = [1, 2, 3, 4, 5]
print(my_list[5:])
[]
*args
vs **kwargs
¶Both of these are used to allow function inputs of arbitrary length.
Here are the differences between the two of them:
def a_func(*args):
print('type of args:', type(args))
print('args contents:', args)
print('1st argument:', args[0])
a_func(0, 1, 'a', 'b', 'c')
type of args: <class 'tuple'> args contents: (0, 1, 'a', 'b', 'c') 1st argument: 0
def b_func(**kwargs):
print('type of kwargs:', type(kwargs))
print('kwargs contents: ', kwargs)
print('value of argument a:', kwargs['a'])
b_func(a=1, b=2, c=3, d=4)
type of kwargs: <class 'dict'> kwargs contents: {'a': 1, 'b': 2, 'c': 3, 'd': 4} value of argument a: 1
You won't necessarily need this.
%%javascript
document.getElementById('<name>').contentWindow
# Read in an HTML file
from IPython.display import HTML
with open('index.html', 'r') as f:
inputForm = f.read()
HTML(inputForm)
# JS variable that python can access
IPython.notebook.kernel.execute('query = "AU='.concat(typed, '"'));
# Iterate through and execute the jupyter cells in order. Does it by index, not necessarily starting with the top cell.
# textContent call to figure out which cell has the content you want to run
for (i = 1; i < 15; i++) {
IPython.notebook.get_cell(i).execute();
}