Python for High School (Summer 2022)¶

Table of Contents

Looking Back (and Ahead)¶

Welcome to the last Chapter of Python for High School. Our chapters document a chronological sequence in that my workflow was to prepare Notebooks ahead of each meetup.

However the Topics need follow no specific order, except some are prerequisite to the others, in a kind of directed graph. You may find yourself "running up the down escalator" (a figure of speech) if you tackle some levels before others.

Review: Number Theory
¶

For example, appreciation for the elegance of the RSA algorithm (public key crypto) deepens with one's appreciation for Euler's Theorem, the one that generalizes Fermat's Little Theorem.

Fermat's Little Theorem:

Let $p$ be a prime number, and a be any integer. Then $a^{p}−a$ is always divisible by $p$.

In [54]:

b = 5                # the base
p = 23               # try any prime here
divmod(b**p - b, p)  # no remainder, true if p is prime

Out[54]:

(518301258916440, 0)

In [53]:

divmod(36, 12)

Out[53]:

(3, 0)

In [51]:

5**23 - 5

Out[51]:

11920928955078120

In [3]:

(5**23 - 5) // p

Out[3]:

518301258916440

The converse of Fermat's Little Theorem is not true however. Some numbers $p$ pass the "Fermat Test", no matter the base $b$, and yet are not prime.

In [4]:

divmod(b**561 - b, 561)   # 561 is a Carmichael Number, any b will do

Out[4]:

(236161757013279006596205856527083708100196146653181706890446891437099112647726025559639680169088609951680950265194714570201302725740616911013462999989900903652916737725765458649007151093057800234250721538430156379493080234813016314639586372502409927294685286807480615872096596435637874175022915074729196884816391323389905397446092098430290978560132791452898294579400935163804447799655417920,
 0)

Skill Sets
¶

The skill sets we have been most developing in the foreground encompass:

Jupyter -- how we keep our Notes and sometimes publish them
- MarkDown
- $LaTeX$
- % magics
python -- there's always a next level

In the background we've been looking at:

numpy -- the number crunching king of vectorized operations
matplotlib -- plotly imitates Matlab's way of doing things
pandas -- sophisticated DataFrames, isomorphic to spreadsheets
sympy -- computer algebra, high precision numbers

In [57]:

%lsmagic

Out[57]:

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

Topics
¶

Mostly, though, we have organized our thinking around Topics.

Such as:

Cryptography
- Number Theory
- Primes versus Composities
- Totative and Totient
Permutations
- Finite Groups
- Group Theory
Fractals
- ASCII and Unicode Art
- CropCircle Tractors
Logarithms
- exponential function
- bases
Graph Theory
- adjacency matrix
- weighted and directional graphs
- polyedrons
Vectors
- XYZ ("Earthling Math"), goes with unit cube
- IVM ("Martian Math"), goes with unit tetrahedron
Ray Tracing
- POV-Ray
- Blender
Machine Learning
- history
- future
- relation to AI

There's no need to be exhaustive and/or all-inclusive.

Then come the skillsets we might optionally cultivate in the background.

We might tackle one or more IDEs (vim, vscode, spyder...), remember our HTML (Requests package), and practice our Regular Expressions.

Actually, we could add Regexes to the foreground list of skills.

$LaTeX$ has been a subskill under Jupyter. To maximize our Markdown, we want to master typesetting convertional mathematical expressions.

What are some other Topics we might reasonably suggest have a footprint in high school? I mention Machine Learning above, but we're only getting to it now, in the last chapter.

Some of these build on Topics already addressed:

ml (machine learning)
- sklearn (native Python)
- tensorflow (from Google)
- pytorch (from Facebook)
- other (keep scanning)
sql (structured query language)
python (there's always a next level)
geospatial data (in the cards for us all)
jupyter books (extending notebook skills)

In Preview mode then, lets talk about: Machine Learning.

Preview: What is ML?
¶

Does Machine Learning belong in high school?

Since the dawn of history, a core aim of both logic and superstition has been to predict the future in some way. We have always needed to divine the future, and yet encounter limits on predictability.

The way a casino is set up, the house will win money in the long run, but given individuals may prove the exception, a fact the individuals count on when risking betting against the house.

The age old project to distill our intuitions about such concepts of "likelihood", "expectations", "confidance", into a science is part of the heritage of data science. Statistics married Computer Science, and Data Science was their offspring.

In ordinary, everyday language we may ask with what measure of confidance do I expect X to happen? Does my measure of confidance change with respect to some "by when?".

I might expect X to happen someday with 100% confidance, yet with no confidance at all about X happening tomorrow or next week.

These may sound like obvious truths, however in getting clear on such topics, we learn to think more logically. We may also come to invent new methods of computation.

We don't stop with some hazy notion of "average"; we break it out into mean, medium and mode, and ways of computing each. Standard deviation follows, and variance. The concept of a Bell Curve begins to emerge.

We hope to intelligently anticipate (rather than wildly or blindly guess), based on the many models we have developed.

By model, we could mean a simulation. Does our simulation run as a computer program? Not necessarily. People were simulating complex systems long before they had invented silicon chip computers.

We could also mean, my "model", some actual Python object developed for us by one of the model makers (KM, SVM...).

Who are the model makers?

The model makers are well-known and/or still experimental algorithms. We categorize them in various ways, such as into supervised and unsupervised. A good example of a working model would be a recommendation engine attached to a website. "Based on your choice of books so far, a next one of interest might be...".

A model need not be mysterious and opaque. A simple line through a bunch of dots well summarizes many of them. However some models attain their high powers of prediction at the expense of being able to give us a set of rules.

A Quick List of ML Algorithms

More concretely, in the supervised learning setting, we show the features (X) and the right answers (y) to a "learner" or "recognizer" known as a neural net, consider deep after N layers. The feedback of getting it wrong or right is used to "weight" the "neurons" through a process known as "gradient descent" (uses calculus, partial derivatives, finds a downward path for an error function).

The above paragraph puts into words a lot of number crunchy math, which numpy is good at. What we get back from such a process is a Python object that has powers of prediction. But to what degree? How good is it? Should we turn it loose in the real world?

THE ML PROCESS
¶

Think about how you yourself learn from experience? Having a strong sense of a right answer or how you want things to go, is motivational. Machine Learning sets up a similar feedback loop in the algebra, especially in a supervised setting.

The main question before us is one of reliability. Does our model do the job? Is a newer model an improvement?

One standard approach, to test reliability, is as follows:

give the right answers on a percentage of the total data (training)
try the model against the balance (the rest), never seen
do more testing

As you explore Sci-Kit Learn, you will find this standard approach is baked in, through time-saving functions that automatically divvy the data into training and testing sets.

The process is akin to shuffling cards as the same total data may be divvyed into testing and training in multiple ways, allowing for more averaging and perhaps fine tuning. We looked at similar bootstrapping ideas in connection with the Confidance Interval (Seaborn barplot etc.).

What's left out of the above description is the whole matter of selecting the appropriate ML engine (model maker), with fine tuned hyperparameters.

That process itself suggests a feedback loop: why not try several ML engines on the same data and find by experiment which seems to work best. Indeed, that's a thing.

You will find sklearn support "ensembles" of model makers to work together. Sometimes they each reach a conclusion then hold a vote. Random Forests made of Decision Trees behave in ensemble fashion.

Cluster Finding Machines
¶

Finding clusters is sometimes an "unsupervised" form of ML, meaning we do not have a dataset with "right answers" for training purposes. We are looking for patterns in target data without knowing ourselves what they are beforehand.

Other times, we have our clusters defined for training data, in which case the process is "supervised".

For example, we might want to cluster tweets and short reports into "topics" (what the cluster around). We don't know in advance what those topics will be. Here's a tutorial

When learning about cluster finding, we also learn about cluster making. Here's a data science teacher mixing up some clustered data, in order to discover if SVM (one of the model makers) will find them.

Citing the tutorial at General Abstract Nonsense:

In [5]:

from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
# from mlxtend.plotting import plot_decision_regions

# create some V shaped data
np.random.seed(6)
# normalized floats, 200 rows, 2 columns
X = np.random.randn(200, 2)
X[:10,:]

Out[5]:

array([[-0.31178367,  0.72900392],
       [ 0.21782079, -0.8990918 ],
       [-2.48678065,  0.91325152],
       [ 1.12706373, -1.51409323],
       [ 1.63929108, -0.4298936 ],
       [ 2.63128056,  0.60182225],
       [-0.33588161,  1.23773784],
       [ 0.11112817,  0.12915125],
       [ 0.07612761, -0.15512816],
       [ 0.63422534,  0.810655  ]])

The rule here is: y is True if the right column exceeds the absolute value of the left. Throw away negative signs on the left and ask if the resulting number is less than what's to its right. y is False if not.

In [6]:

# create a "predict me" vector
y = X[:, 1] > np.absolute(X[:, 0])
y[:10]

Out[6]:

array([ True, False, False, False, False, False,  True,  True, False,
        True])

We can recast y as 1s and negative 1s using the ever-useful np.where.

In [7]:

y = np.where(y, 1, -1)
y[:10]

Out[7]:

array([ 1, -1, -1, -1, -1, -1,  1,  1, -1,  1])

Based on the provided data, we find the Support Vector Machine is pretty good at teasing apart the y=1 from the y=-1, but will not get it right 100% of the time.

We're not requiring mlxtend.plotting to run this Notebook, but Google colab has it, so here's a screen shot:

In [58]:

# train a Support Vector Classifier using the rbf kernel
svm = SVC(kernel='rbf', random_state=0, gamma=0.5, C=10.0)
svm.fit(X, y)

Out[58]:

SVC(C=10.0, gamma=0.5, random_state=0)

Make up some new X points. Point = any number of features (columns)

In [9]:

data = np.array([[-0.99, 1.1],
                  [.12, .33],
                  [-3, 1]])
data

Out[9]:

array([[-0.99,  1.1 ],
       [ 0.12,  0.33],
       [-3.  ,  1.  ]])

What would the model predict?

In [10]:

svm.predict(data)

Out[10]:

array([ 1,  1, -1])

What would we consider correct?

In [11]:

new_y = data[:, 1] > np.absolute(data[:, 0])
new_y

Out[11]:

array([ True,  True, False])

Lets do more study into the reliability of svm in this instance.

In [12]:

from sklearn.metrics import accuracy_score

In [13]:

svm.predict(X)

Out[13]:

array([ 1, -1, -1, -1, -1, -1,  1,  1, -1,  1,  1, -1, -1, -1,  1, -1,  1,
       -1, -1, -1, -1,  1, -1, -1, -1,  1,  1,  1, -1,  1, -1,  1, -1, -1,
       -1, -1,  1, -1, -1, -1,  1, -1,  1, -1, -1, -1, -1, -1,  1, -1, -1,
       -1, -1, -1, -1,  1, -1, -1, -1,  1, -1, -1, -1, -1, -1, -1, -1,  1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1, -1, -1,  1, -1, -1,
        1, -1, -1, -1, -1,  1, -1, -1, -1,  1,  1,  1, -1, -1, -1,  1,  1,
        1, -1, -1, -1, -1, -1,  1,  1, -1, -1,  1, -1,  1, -1, -1, -1,  1,
       -1, -1, -1,  1, -1, -1, -1, -1, -1,  1, -1, -1, -1,  1,  1, -1,  1,
        1, -1,  1, -1,  1,  1, -1, -1,  1, -1, -1, -1, -1,  1,  1,  1, -1,
       -1, -1, -1, -1,  1,  1, -1,  1,  1, -1, -1, -1, -1, -1,  1, -1,  1,
       -1,  1, -1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1,  1, -1,
        1, -1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1, -1])

In [14]:

Out[14]:

array([ 1, -1, -1, -1, -1, -1,  1,  1, -1,  1,  1, -1, -1, -1,  1, -1,  1,
       -1, -1, -1, -1,  1, -1, -1, -1,  1,  1,  1, -1,  1, -1,  1, -1, -1,
       -1, -1,  1, -1, -1, -1,  1, -1,  1, -1, -1, -1, -1, -1,  1, -1, -1,
       -1, -1, -1, -1,  1, -1, -1, -1,  1, -1, -1, -1, -1, -1, -1, -1,  1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1, -1, -1,  1, -1, -1,
        1, -1, -1, -1, -1,  1, -1, -1, -1,  1,  1,  1, -1, -1, -1,  1,  1,
        1, -1, -1, -1, -1, -1,  1,  1, -1, -1,  1, -1,  1, -1, -1, -1,  1,
       -1, -1,  1,  1, -1, -1, -1, -1, -1,  1, -1, -1, -1,  1,  1, -1,  1,
        1, -1,  1, -1,  1,  1, -1, -1,  1, -1, -1,  1, -1,  1,  1,  1, -1,
       -1, -1, -1, -1,  1,  1, -1,  1,  1, -1, -1, -1, -1, -1,  1, -1,  1,
       -1,  1, -1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1, -1,  1, -1,
        1, -1, -1, -1, -1, -1, -1,  1, -1, -1, -1, -1, -1])

In [15]:

accuracy_score(y, svm.predict(X))

Out[15]:

0.99

In [16]:

y == svm.predict(X)

Out[16]:

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])

Categorizing Machines
¶

Clickbait Versus Headlines

We will need pandas for this next example, with its amazing ability to read csv files over the web, taking a URL as input.

The file below is actually just a txt file, a csv with no headers, and it's delimited by tab. No problemo:

In [59]:

import pandas as pd

In [60]:

df_clickbait = pd.read_csv("https://raw.githubusercontent.com/sixhobbits/sklearn-intro/master/clickbait.txt", sep="\t", header=None)

In [61]:

df_clickbait.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       10000 non-null  object
 1   1       10000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 156.4+ KB

In [20]:

df_clickbait.describe()

Out[20]:

	1
count	10000.000000
mean	0.500000
std	0.500025
min	0.000000
25%	0.000000
50%	0.500000
75%	1.000000
max	1.000000

In [21]:

df_clickbait

Out[21]:

	0	1
0	Egypt's top envoy in Iraq confirmed killed	0
1	Carter: Race relations in Palestine are worse ...	0
2	After Years Of Dutiful Service, The Shiba Who ...	1
3	In Books on Two Powerbrokers, Hints of the Future	0
4	These Horrifyingly Satisfying Photos Of "Baby ...	1
...	...	...
9995	What Is Your Weirdest Fear	1
9996	Felipe Massa wins 2008 French Grand Prix	0
9997	Bottled water concerns health experts	0
9998	Death of Nancy Benoit rumour posted on Wikiped...	0
9999	US Dept. of Justice IP address blocked after '...	0

10000 rows × 2 columns

In [62]:

df_clickbait.columns = ["Headline", "Category"]
df_clickbait.head()

Out[62]:

	Headline	Category
0	Egypt's top envoy in Iraq confirmed killed	0
1	Carter: Race relations in Palestine are worse ...	0
2	After Years Of Dutiful Service, The Shiba Who ...	1
3	In Books on Two Powerbrokers, Hints of the Future	0
4	These Horrifyingly Satisfying Photos Of "Baby ...	1

In [63]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

headlines = df_clickbait["Headline"]
labels = df_clickbait["Category"]

In [64]:

headlines.head()

Out[64]:

0           Egypt's top envoy in Iraq confirmed killed
1    Carter: Race relations in Palestine are worse ...
2    After Years Of Dutiful Service, The Shiba Who ...
3    In Books on Two Powerbrokers, Hints of the Future
4    These Horrifyingly Satisfying Photos Of "Baby ...
Name: Headline, dtype: object

In [25]:

labels.head()

Out[25]:

0    0
1    0
2    1
3    0
4    1
Name: Category, dtype: int64

In [26]:

# Break dataset into test and train sets
train_headlines = headlines[:8000]
test_headlines = headlines[8000:]

train_labels = labels[:8000]
test_labels = labels[8000:]

Remember one-hot encoding, otherwise know as get_dummies in pandas? We were able to unpack a column, full of ungainly strings, into new columns, with 1s and 0s for values.

That's close to what we do when "vectorizing" a "bag of words". The specific algorithm we will be using, already in the can for us in sklearn, throws away what we might call "stop words" (used too frequently) and also words that occur too infrequently.

We might do some data cleaning of front, before we vectorize. Remove dates? Excise serial numbers? The end goal is to distill a document space to a vocabulary and frequency count for each word. The ML engine (which one we pick) should be able to munch on this kind of numeric data.

TF-IDF in Python with Scikit Learn -- From Python Tutorials for the Digital Humanities

Topic Modeling

In addition to initializing a vectorizer, which is used in a next cell, we have to make the important decision regarding which ML engine to use. In this case: LinearSVC.

LinearSVC is a subtype of a Support Vector Machine (SVM). When separating data into clusters, you want to find a "cut" or "slice" through the data that maximizes its distance from the closest element in any cluster. Optimizing means making the cut ever more effective at delineating two groups.

In this case, we are attempting to separate the data using known cluster scores (1 or 0) for the training data.

In [27]:

vectorizer = TfidfVectorizer()
svm = LinearSVC()

In [65]:

# Transform our text data into numerical vectors
train_vectors = vectorizer.fit_transform(train_headlines)
test_vectors = vectorizer.transform(test_headlines)

In [66]:

train_vectors[:6]

Out[66]:

<6x11773 sparse matrix of type '<class 'numpy.float64'>'
	with 56 stored elements in Compressed Sparse Row format>

In [67]:

train_vectors[1]

Out[67]:

<1x11773 sparse matrix of type '<class 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [68]:

# Train the classifier and predict on test set
svm.fit(train_vectors, train_labels)

predictions = svm.predict(test_vectors)

accuracy_score(test_labels, predictions)

Out[68]:

0.962

In [69]:

new_headlines = ["10 Cities That Every Hipster Will Be Moving To Soon", 
                 'Vice President Mike Pence Leaves NFL Game Saying Players Showed "Disrespect" Of Anthem, Flag']
new_vectors = vectorizer.transform(new_headlines)
new_predictions = svm.predict(new_vectors)

new_predictions

Out[69]:

array([1, 0])

Preview: Linear Regression
¶

One of the many famous datasets, which you can find on Kaggle, is tips.

Assuming we have seaborn, that's a dataset we can access locally.

In [33]:

import seaborn as sns
tips = sns.load_dataset("tips")
tips.head()

Out[33]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

In [ ]:

In [34]:

sns.jointplot(kind="reg", data=tips, x="total_bill", y="tip");

In [35]:

! ls roller*.*

roller_coasters.db

What we're reviewing here is sqlite3, which we only touched upon briefly, in connection with airports.db -- a flatfile of some 7K world airports, categorized in a few ways, some with lat/long. That's a thumbnail data dictionary.

Here's a link to it on Kaggle.

A potentially interesting project would be to compare the Roller Coasters dataset on Kaggle, with the one provided. Another idea for a capstone project. Perhaps a newer table could be merged from both sources?

In [36]:

import sqlite3 as sql

In [37]:

conn = sql.connect("roller_coasters.db")

In [38]:

coasters_df = pd.read_sql("select * from Coasters;", con=conn)

In [39]:

coasters_df

Out[39]:

	Name	Park	State	Country	Duration	Speed	Height	VertDrop	Length	Yr_Opened	Inversions
0	Top Thrill Dragster	Cedar Point	Ohio	USA	60	120.0	420.0	400.00	2800.00	2003	0
1	Superman The Escape	Six Flags Magic Mountain	California	USA	28	100.0	415.0	328.10	1235.00	1997	0
2	Millennium Force	Cedar Point	Ohio	USA	165	93.0	310.0	300.00	6595.00	2000	0
3	Goliath	Six Flags Magic Mountain	California	USA	180	85.0	235.0	255.00	4500.00	2000	0
4	Titan	Space World	Kitakyushu	Japan	180	71.5	166.0	178.00	5019.67	1994	0
...	...	...	...	...	...	...	...	...	...	...	...
71	Oblivion	Alton Towers	Alton	England	75	68.0	65.0	180.00	1222.00	1998	0
72	Stunt Fall	Warner Bros. Movie World	San Martin de la Vega	Spain	92	65.6	191.6	177.00	1204.00	2002	3
73	Hayabusa	Tokyo SummerLand	Tokyo	Japan	108	60.3	137.8	124.67	2559.10	1992	0
74	Top Gun	Paramount Canada's Wonderland	Vaughan	Canada	125	56.0	102.0	93.00	2170.00	1995	5
75	Wild Beast	Paramount Canada's Wonderland	Vaughan	Canada	150	56.0	415.0	78.00	3150.00	1981	0

76 rows × 11 columns

What Linear Regression model shall we try? The "vertical drop" and the "speed" would seem connected by the exceptionless law of gravity, whatever that is. Let's check:

In [40]:

sns.jointplot(kind="reg", data=coasters_df, x="VertDrop", y="Speed");

That seems like a pretty good fit if you ask me. Call this the simplest of the Machine Learning model makers, the "LRM" or "linear regression machine".

But now let's pause to exercise our secondary skills, which soon could become primary. Let's modify our SQL:

In [41]:

coasters_df = pd.read_sql("select name, park, state, country from Coasters ORDER BY country, state, name;", con=conn)

In [42]:

coasters_df

Out[42]:

	Name	Park	State	Country
0	Tower of Terror	Dreamworld	Coomera	Australia
1	Top Gun	Paramount Canada's Wonderland	Vaughan	Canada
2	Wild Beast	Paramount Canada's Wonderland	Vaughan	Canada
3	Oblivion	Alton Towers	Alton	England
4	Fujiyama	Fuji-Q Highlands	FujiYoshida-shi	Japan
...	...	...	...	...
71	Alpengeist	Busch Gardens Williamsburg	Virginia	USA
72	Apollo's Chariot	Busch Gardens Williamsburg	Virginia	USA
73	HyperSonic XLC	Paramount's Kings Dominion	Virginia	USA
74	Volcano	Paramount's Kings Dominion	Virginia	USA
75	Coaster Thrill Ride	Puyallup Fair	Washington	USA

76 rows × 4 columns

In [43]:

conn.close()

Review: Python's Context Manager Construct
¶

We went down this rabbit hole at least once. Or call it "following a trail" (they all branch into each other, in a large dark forest).

The context manager is part of everyday Python. There's a contextlib library that leverages their power.

In [44]:

class CM:
    
    def __init__(self, myname): # initializer
        self.name = myname
        
    def __enter__(self):
        print("with me('') as it:  do something with it")
        return self
        
    def __exit__(self, *oops):
        print("done doing")

In [45]:

with CM("talker") as it:
    print("I'm inside the context -- __enter__ has run")
    print(it.name)
    print("I will exit now...")

with me('') as it:  do something with it
I'm inside the context -- __enter__ has run
talker
I will exit now...
done doing

Lets reuse that same basic skeleton to talk to databases.

In [46]:

import sqlite3 as sql

class CM:
    
    def __init__(self, myname):
        self.name = myname
        
    def __enter__(self):
        try:
            self.conn = sql.connect(self.name)
        except:
            print("No connection")
            raise
        return self
        
    def __exit__(self, *oops):  # trap error data if any
        self.conn.close()
        if oops[0]:       # hoping oops is (None, None, None)
            return False  # something exceptional happened
        return True       # all OK

In [47]:

with CM("roller_coasters.db") as db:
    curs = db.conn.cursor()
    curs.execute("select * from Coasters ORDER By name;")
    for rec in curs.fetchall():
        print(rec[0])
    print("--- GN")
print("--- Connection closed")
db.conn

Afterburner
Alpengeist
American Eagle
Apollo's Chariot
Batman Knight Flight
Beast
Blue Streak
Boss
Cannon Ball
Canyon Blaster
Chang
Cheetah
Coaster Thrill Ride
Colossus
Comet
Corkscrew
Deja Vu
Desperado
Fujiyama
Goliath
Great American Scream Machine
Hangman
Hayabusa
Hercules
Hurricane
HyperSonic XLC
Incredible Hulk
Invertigo
Iron Wolf
Kong
Kraken
Magnum XL-200
Mamba
Manhattan Express
Mean Streak
Medusa
Millennium Force
Mind Eraser
New Mexico Rattler
Nitro
Oblivion
Orient Express
Phantom's Revenge
Raging Bull
Rattler
Riddler's Revenge
Scream!
Screamin' Eagle
Silver Bullet
Son Of Beast
Starliner
Steel Dragon 2000
Steel Eel
Steel Force
Stunt Fall
Superman - Ride Of Steel
Superman The Escape
T2
Tennessee Tornado
Texas Giant
Thunder Dolphin
Thunderbolt
Timber Wolf
Titan
Top Gun
Top Thrill Dragster
Tower of Terror
Viper
Volcano
Whizzer
Wild Beast
Wild One
Wild Thing
Wildfire
X
Xcelerator
--- GN
--- Connection closed

Out[47]:

<sqlite3.Connection at 0x7fae46b3ce40>

In [48]:

import os

In [49]:

os.chdir("/Users/kirbyurner/Documents/elite_school")

In [50]:

! pwd

/Users/kirbyurner/Documents/elite_school

Python for High School (Summer 2022)¶

Looking Back (and Ahead)¶

Review: Number Theory¶

Skill Sets¶

Topics¶

Preview: What is ML?¶

THE ML PROCESS¶

Cluster Finding Machines¶

Categorizing Machines¶

Preview: Linear Regression¶

Review: Python's Context Manager Construct¶