#!/usr/bin/env python # coding: utf-8 # ##### Python for High School (Summer 2022) # # * [Table of Contents](PY4HS.ipynb) # * Open in Colab # * [![nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/4dsolutions/elite_school/blob/master/Py4HS_August_26_2022.ipynb) # # OMSI Lower Right # # # Looking Back (and Ahead) # # Welcome to the last Chapter of Python for High School. Our chapters document a chronological sequence in that my workflow was to prepare Notebooks ahead of each meetup. # # However the Topics need follow no specific order, except some are prerequisite to the others, in a kind of directed graph. You may find yourself "running up the down escalator" (a figure of speech) if you tackle some levels before others. # ##

Review: Number Theory

# # For example, appreciation for the elegance of the RSA algorithm (public key crypto) deepens with one's appreciation for [Euler's Theorem](https://brilliant.org/wiki/eulers-theorem/), the one that generalizes [Fermat's Little Theorem](https://brilliant.org/wiki/fermats-little-theorem/). # # Fermat's Little Theorem: # # Let $p$ be a prime number, and a be any integer. Then $a^{p}−a$ is always divisible by $p$. # In[54]: b = 5 # the base p = 23 # try any prime here divmod(b**p - b, p) # no remainder, true if p is prime # In[53]: divmod(36, 12) # In[51]: 5**23 - 5 # In[3]: (5**23 - 5) // p # The converse of Fermat's Little Theorem is not true however. Some numbers $p$ pass the "Fermat Test", no matter the base $b$, and yet are not prime. # In[4]: divmod(b**561 - b, 561) # 561 is a Carmichael Number, any b will do # ##

Skill Sets

# # The skill sets we have been most developing in the foreground encompass: # # * Jupyter -- how we keep our Notes and sometimes publish them # - MarkDown # - $LaTeX$ # - % magics # * python -- there's always a next level # # In the background we've been looking at: # # * numpy -- the number crunching king of vectorized operations # * matplotlib -- plotly imitates Matlab's way of doing things # * pandas -- sophisticated DataFrames, isomorphic to spreadsheets # * sympy -- computer algebra, high precision numbers # In[57]: get_ipython().run_line_magic('lsmagic', '') # ##

Topics

# # # Mostly, though, we have organized our thinking around Topics. # # Such as: # # * Cryptography # - Number Theory # - Primes versus Composities # - Totative and Totient # * Permutations # - Finite Groups # - Group Theory # * Fractals # - ASCII and Unicode Art # - CropCircle Tractors # * Logarithms # - exponential function # - bases # * Graph Theory # - adjacency matrix # - weighted and directional graphs # - polyedrons # * Vectors # - XYZ ("Earthling Math"), goes with unit cube # - IVM ("Martian Math"), goes with unit tetrahedron # * Ray Tracing # - [POV-Ray](https://www.povray.org) # - [Blender](https://www.blender.org/) # * Machine Learning # - history # - future # - relation to AI # # There's no need to be exhaustive and/or all-inclusive. # # Then come the skillsets we might optionally cultivate in the background. # # We might tackle one or more IDEs (vim, vscode, spyder...), remember our HTML (Requests package), and practice our Regular Expressions. # # Actually, we could add Regexes to the foreground list of skills. # # $LaTeX$ has been a subskill under Jupyter. To maximize our Markdown, we want to master typesetting convertional mathematical expressions. # What are some other Topics we might reasonably suggest have a footprint in high school? I mention Machine Learning above, but we're only getting to it now, in the last chapter. # # Some of these build on Topics already addressed: # # * ml (machine learning) # - sklearn (native Python) # - tensorflow (from Google) # - pytorch (from Facebook) # - other (keep scanning) # * sql (structured query language) # * python (there's always a next level) # * geospatial data (in the cards for us all) # * [jupyter books](https://jupyterbook.org/en/stable/intro.html) (extending notebook skills) # # In Preview mode then, lets talk about: *Machine Learning*. # ##

Preview: What is ML?

# # Does Machine Learning belong in high school? # # Since the dawn of history, a core aim of both logic and superstition has been to predict the future in some way. We have always needed to divine the future, and yet encounter limits on predictability. # # The way a casino is set up, the house will win money in the long run, but given individuals may prove the exception, a fact the individuals count on when risking betting against the house. # # The age old project to distill our intuitions about such concepts of "likelihood", "expectations", "confidance", into a science is part of the heritage of data science. Statistics married Computer Science, and Data Science was their offspring. # # In ordinary, everyday language we may ask with what measure of confidance do I expect X to happen? Does my measure of confidance change with respect to some "by when?". # # I might expect X to happen someday with 100% confidance, yet with no confidance at all about X happening tomorrow or next week. # # These may sound like obvious truths, however in getting clear on such topics, we learn to think more logically. We may also come to invent new methods of computation. # # We don't stop with some hazy notion of "average"; we break it out into mean, medium and mode, and ways of computing each. Standard deviation follows, and variance. The concept of a Bell Curve begins to emerge. # # We hope to intelligently anticipate (rather than wildly or blindly guess), based on the many models we have developed. # # By model, we could mean a simulation. Does our simulation run as a computer program? Not necessarily. People were simulating complex systems long before they had invented silicon chip computers. # # We could also mean, my "model", some actual Python object developed for us by one of the model makers (KM, SVM...). # # Who are the model makers? # # The model makers are well-known and/or still experimental algorithms. We categorize them in various ways, such as into supervised and unsupervised. A good example of a working model would be a recommendation engine attached to a website. "Based on your choice of books so far, a next one of interest might be...". # # A model need not be mysterious and opaque. A simple line through a bunch of dots well summarizes many of them. However some models attain their high powers of prediction at the expense of being able to give us a set of rules. # # [A Quick List of ML Algorithms](https://howtolearnmachinelearning.com/articles/a-quick-list-of-machine-learning-algorithms/) # # More concretely, in the supervised learning setting, we show the features (X) and the right answers (y) to a "learner" or "recognizer" known as a neural net, consider deep after N layers. The feedback of getting it wrong or right is used to "weight" the "neurons" through a process known as "gradient descent" (uses calculus, partial derivatives, finds a downward path for an error function). # # The above paragraph puts into words a lot of number crunchy math, which numpy is good at. What we get back from such a process is a Python object that has powers of prediction. But to what degree? How good is it? Should we turn it loose in the real world? # ##

THE ML PROCESS

# # Think about how you yourself learn from experience? Having a strong sense of a right answer or how you want things to go, is motivational. Machine Learning sets up a similar feedback loop in the algebra, especially in a supervised setting. # # The main question before us is one of reliability. Does our model do the job? Is a newer model an improvement? # # One standard approach, to test reliability, is as follows: # # * give the right answers on a percentage of the total data (training) # * try the model against the balance (the rest), never seen # * do more testing # # As you explore Sci-Kit Learn, you will find this standard approach is baked in, through time-saving functions that automatically divvy the data into training and testing sets. # # The process is akin to shuffling cards as the same total data may be divvyed into testing and training in multiple ways, allowing for more averaging and perhaps fine tuning. We looked at similar bootstrapping ideas in connection with the Confidance Interval (Seaborn barplot etc.). # # What's left out of the above description is the whole matter of selecting the appropriate ML engine (model maker), with fine tuned hyperparameters. # # That process itself suggests a feedback loop: why not try several ML engines on the same data and find by experiment which seems to work best. Indeed, that's a thing. # # You will find sklearn support "ensembles" of model makers to work together. Sometimes they each reach a conclusion then hold a vote. Random Forests made of Decision Trees behave in ensemble fashion. # ##

Cluster Finding Machines

# # Finding clusters is sometimes an "unsupervised" form of ML, meaning we do not have a dataset with "right answers" for training purposes. We are looking for patterns in target data without knowing ourselves what they are beforehand. # # Other times, we have our clusters defined for training data, in which case the process is "supervised". # # For example, we might want to cluster tweets and short reports into "topics" (what the cluster around). We don't know in advance what those topics will be. Here's a tutorial # # When learning about cluster finding, we also learn about cluster making. Here's a data science teacher mixing up some clustered data, in order to discover if SVM (one of the model makers) will find them. # Citing the tutorial at [General Abstract Nonsense](https://generalabstractnonsense.com/2017/03/A-quick-look-at-Support-Vector-Machines/): # In[5]: from sklearn.svm import SVC import numpy as np import matplotlib.pyplot as plt # from mlxtend.plotting import plot_decision_regions # create some V shaped data np.random.seed(6) # normalized floats, 200 rows, 2 columns X = np.random.randn(200, 2) X[:10,:] # The rule here is: y is True if the right column exceeds the absolute value of the left. Throw away negative signs on the left and ask if the resulting number is less than what's to its right. y is False if not. # In[6]: # create a "predict me" vector y = X[:, 1] > np.absolute(X[:, 0]) y[:10] # We can recast y as 1s and negative 1s using the ever-useful `np.where`. # In[7]: y = np.where(y, 1, -1) y[:10] # Based on the provided data, we find the Support Vector Machine is pretty good at teasing apart the y=1 from the y=-1, but will not get it right 100% of the time. # # We're not requiring `mlxtend.plotting` to run this Notebook, but Google colab has it, so here's a screen shot: # Screen Shot 2022-08-26 at 11.25.15 AM # In[58]: # train a Support Vector Classifier using the rbf kernel svm = SVC(kernel='rbf', random_state=0, gamma=0.5, C=10.0) svm.fit(X, y) # Make up some new X points. Point = any number of features (columns) # In[9]: data = np.array([[-0.99, 1.1], [.12, .33], [-3, 1]]) data # What would the model predict? # In[10]: svm.predict(data) # What would we consider correct? # In[11]: new_y = data[:, 1] > np.absolute(data[:, 0]) new_y # Lets do more study into the reliability of svm in this instance. # In[12]: from sklearn.metrics import accuracy_score # In[13]: svm.predict(X) # In[14]: y # In[15]: accuracy_score(y, svm.predict(X)) # In[16]: y == svm.predict(X) # ##

Categorizing Machines

# **Clickbait Versus Headlines** # # We will need pandas for this next example, with its amazing ability to read csv files over the web, taking a URL as input. # # The file below is actually just a txt file, a csv with no headers, and it's delimited by tab. No problemo: # In[59]: import pandas as pd # In[60]: df_clickbait = pd.read_csv("https://raw.githubusercontent.com/sixhobbits/sklearn-intro/master/clickbait.txt", sep="\t", header=None) # In[61]: df_clickbait.info() # In[20]: df_clickbait.describe() # In[21]: df_clickbait # In[62]: df_clickbait.columns = ["Headline", "Category"] df_clickbait.head() # In[63]: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC from sklearn.metrics import accuracy_score headlines = df_clickbait["Headline"] labels = df_clickbait["Category"] # In[64]: headlines.head() # In[25]: labels.head() # In[26]: # Break dataset into test and train sets train_headlines = headlines[:8000] test_headlines = headlines[8000:] train_labels = labels[:8000] test_labels = labels[8000:] # Remember one-hot encoding, otherwise know as [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) in pandas? We were able to unpack a column, full of ungainly strings, into new columns, with 1s and 0s for values. # # That's close to what we do when "vectorizing" a "bag of words". The specific algorithm we will be using, already in the can for us in sklearn, throws away what we might call "stop words" (used too frequently) and also words that occur too infrequently. # # We might do some data cleaning of front, before we vectorize. Remove dates? Excise serial numbers? The end goal is to distill a document space to a vocabulary and frequency count for each word. The ML engine (which one we pick) should be able to munch on this kind of numeric data. # # [TF-IDF in Python with Scikit Learn](https://youtu.be/i74DVqMsRWY) -- From Python Tutorials for the Digital Humanities # # [Topic Modeling](http://topic-modeling.pythonhumanities.com/01_01_introduction_to_topic_modeling.html) # In addition to initializing a vectorizer, which is used in a next cell, we have to make the important decision regarding which ML engine to use. In this case: LinearSVC. # # LinearSVC is a subtype of a Support Vector Machine (SVM). When separating data into clusters, you want to find a "cut" or "slice" through the data that maximizes its distance from the closest element in any cluster. Optimizing means making the cut ever more effective at delineating two groups. # # In this case, we are attempting to separate the data using known cluster scores (1 or 0) for the training data. # In[27]: vectorizer = TfidfVectorizer() svm = LinearSVC() # In[65]: # Transform our text data into numerical vectors train_vectors = vectorizer.fit_transform(train_headlines) test_vectors = vectorizer.transform(test_headlines) # In[66]: train_vectors[:6] # In[67]: train_vectors[1] # In[68]: # Train the classifier and predict on test set svm.fit(train_vectors, train_labels) predictions = svm.predict(test_vectors) accuracy_score(test_labels, predictions) # In[69]: new_headlines = ["10 Cities That Every Hipster Will Be Moving To Soon", 'Vice President Mike Pence Leaves NFL Game Saying Players Showed "Disrespect" Of Anthem, Flag'] new_vectors = vectorizer.transform(new_headlines) new_predictions = svm.predict(new_vectors) new_predictions # ###

Preview: Linear Regression

# # One of the many famous datasets, which you can find on Kaggle, is [tips](https://www.kaggle.com/datasets/jsphyg/tipping). # # Assuming we have seaborn, that's a dataset we can access locally. # In[33]: import seaborn as sns tips = sns.load_dataset("tips") tips.head() # In[ ]: # In[34]: sns.jointplot(kind="reg", data=tips, x="total_bill", y="tip"); # In[35]: get_ipython().system(' ls roller*.*') # What we're reviewing here is sqlite3, which we only touched upon briefly, in connection with `airports.db` -- a flatfile of some 7K world airports, categorized in a few ways, some with lat/long. That's a thumbnail data dictionary. # # Here's a link to it [on Kaggle](https://www.kaggle.com/datasets/jonatancr/airports). # # A potentially interesting project would be to compare the Roller Coasters dataset on Kaggle, with the one provided. Another idea for a capstone project. Perhaps a newer table could be merged from both sources? # In[36]: import sqlite3 as sql # In[37]: conn = sql.connect("roller_coasters.db") # In[38]: coasters_df = pd.read_sql("select * from Coasters;", con=conn) # In[39]: coasters_df # What Linear Regression model shall we try? The "vertical drop" and the "speed" would seem connected by the exceptionless law of gravity, whatever that is. Let's check: # In[40]: sns.jointplot(kind="reg", data=coasters_df, x="VertDrop", y="Speed"); # That seems like a pretty good fit if you ask me. Call this the simplest of the Machine Learning model makers, the "LRM" or "linear regression machine". # # # But now let's pause to exercise our secondary skills, which soon could become primary. Let's modify our SQL: # In[41]: coasters_df = pd.read_sql("select name, park, state, country from Coasters ORDER BY country, state, name;", con=conn) # In[42]: coasters_df # In[43]: conn.close() # ###

Review: Python's Context Manager Construct

# # We went down this rabbit hole at least once. Or call it "following a trail" (they all branch into each other, in a large dark forest). # # The context manager is part of everyday Python. There's a `contextlib` library that leverages their power. # In[44]: class CM: def __init__(self, myname): # initializer self.name = myname def __enter__(self): print("with me('') as it: do something with it") return self def __exit__(self, *oops): print("done doing") # In[45]: with CM("talker") as it: print("I'm inside the context -- __enter__ has run") print(it.name) print("I will exit now...") # Lets reuse that same basic skeleton to talk to databases. # In[46]: import sqlite3 as sql class CM: def __init__(self, myname): self.name = myname def __enter__(self): try: self.conn = sql.connect(self.name) except: print("No connection") raise return self def __exit__(self, *oops): # trap error data if any self.conn.close() if oops[0]: # hoping oops is (None, None, None) return False # something exceptional happened return True # all OK # In[47]: with CM("roller_coasters.db") as db: curs = db.conn.cursor() curs.execute("select * from Coasters ORDER By name;") for rec in curs.fetchall(): print(rec[0]) print("--- GN") print("--- Connection closed") db.conn # In[48]: import os # In[49]: os.chdir("/Users/kirbyurner/Documents/elite_school") # In[50]: get_ipython().system(' pwd')