Notebook

Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .

An Introduction to Python and Programming¶

This book is a thorough introduction to programming in Python .

It teaches the concepts behind and the syntax of the core Python language as defined by the Python Software Foundation in the official language reference . Furthermore, it introduces commonly used functionalities from the standard library and popular third-party libraries like numpy , pandas , matplotlib , and others.

Prerequisites¶

There are no prerequisites for reading this book.

Objective¶

The main goal of this introduction is to prepare the student for further studies in the "field" of data science.

Why data science?¶

The term data science is rather vague and does not refer to an academic discipline. Instead, the term was popularized by the tech industry, who also coined non-meaningful job titles such as "rockstar" or "ninja developers." Most serious definitions describe the field as being multi-disciplinary integrating scientific methods, algorithms, and systems thinking to extract knowledge from structured and unstructured data, and also emphasize the importance of domain knowledge .

Recently, this integration aspect feeds back into the academic world. The MIT, for example, created the new Stephen A. Schwarzman College of Computing for artificial intelligence with a 1 billion dollar initial investment where students undergo a "bilingual" curriculum with half the classes in quantitative and method-centric fields and the other half in domains such as biology, business, chemistry, politics, (art) history, or linguistics (cf., the official Q&As or this NYT article). Their strategists see a future where programming skills are just as naturally embedded into students' curricula as are nowadays subjects like calculus, statistics, or academic writing. Then, programming literacy is not just another "nice to have" skill but a prerequisite, or an enabler, to understanding more advanced topics in the actual domains studied.

Installation¶

To "read" this book in the most meaningful way, a working installation of Python 3.8 with JupyterLab is needed.

For a tutorial on how to install Python on your computer, follow the instructions in the README.md file in the project's GitHub repository . If you cannot install Python on your own machine, you may open the book interactively in the cloud with Binder .

Jupyter Notebooks¶

The document you are viewing is a so-called Jupyter notebook, a file format introduced by the Jupyter Project.

"Jupyter" is an acronym derived from the names of the three major programming languages Julia, Python , and R, all of which play significant roles in the world of data science. The Jupyter Project's idea is to serve as an integrating platform such that different programming languages and software packages can be used together within the same project.

Jupyter notebooks have become a de-facto standard for communicating and exchanging results in the data science community - both in academia and business - and provide an alternative to command-line interface (CLI or "terminal") based ways of running Python code. As an example for the latter case, we could start the default Python interpreter that comes with every installation by typing the python command into a CLI (or poetry run python if the project is managed with the poetry CLI tool as explained in the README.md file). Then, as the screenshot below shows, we could execute Python code like 1 + 2 or print("Hello World") line by line simply by typing it following the >>> prompt and pressing the Enter key. For an introductory course, however, this would be rather tedious and probably scare off many beginners.

One reason for the popularity of Jupyter notebooks is that they allow mixing text with code in the same document. Text may be formatted with the Markdown language and mathematical formulas typeset with LaTeX. Moreover, we may include pictures, plots, and even videos. Because of these features, the notebooks developed for this book come in a self-contained "tutorial" style enabling students to simply read them from top to bottom while executing the code snippets.

Other ways of running Python code are to use the IPython CLI tool instead of the default interpreter or a full-fledged Integrated Development Environment (e.g., the commercial PyCharm or the free Spyder that comes with the Anaconda Distribution).

Markdown Cells vs. Code Cells¶

A Jupyter notebook consists of cells that have a type associated with them. So far, only cells of type "Markdown" have been used, which is the default way to present formatted text.

The cells below are examples of "Code" cells containing actual Python code: They calculate the sum of 1 and 2 and print out "Hello World" when executed, respectively. To edit an existing code cell, enter into it with a mouse click. You are "in" a code cell if its frame is highlighted in blue. We call that the edit mode.

There is also a command mode that you reach by hitting the Escape key. That un-highlights the frame. You are now "out" of but still "on" the cell. If you were already in command mode, hitting the Escape key does nothing.

Using the Enter and Escape keys, you can now switch between the two modes.

To execute, or "run," a code cell, hold down the Control key and press Enter. Note how you do not go to the subsequent cell if you keep re-executing the cell you are on. Alternatively, you can hold the Shift key and press Enter, which executes a cell and places your focus on the subsequent cell or creates a new one if there is none.

In [1]:

1 + 2

Out[1]:

In [2]:

print("Hello World")

Hello World

Similarly, a Markdown cell is also in either edit or command mode. For example, double-click on the text you are reading: This puts you into edit mode. Now, you could change the formatting (e.g., print a word in italics or bold) and "execute" the cell to render the text as specified.

To change a cell's type, choose either "Code" or "Markdown" in the navigation bar at the top. Alternatively, you can press either the Y or M key in command mode.

Sometimes, a code cell starts with an exclamation mark !. Then, the Jupyter notebook behaves as if the following command were typed directly into a terminal. The cell below asks the python CLI to show its version number and is not Python code but a command in the Shell language. The ! is useful to execute short CLI commands without leaving a Jupyter notebook.

In [3]:

!python --version

Python 3.12.2

Programming vs. Computer Science vs. IT¶

In this book, programming is defined as

a structured way of problem-solving
by expressing the steps of a computation or process
and thereby documenting the process in a formal way.

Programming is always concrete and based on a particular case. It exhibits elements of an art or a craft as we hear programmers call code "beautiful" or "ugly" or talk about the "expressive" power of an application.

That is different from computer science, which is

a field of study comparable to applied mathematics that
asks abstract questions (e.g., "Is something computable at all?"),
develops and analyses algorithms and data structures,
and proves the correctness of a program.

In a sense, a computer scientist does not need to know a programming language to work, and many computer scientists only know how to produce "ugly" looking code in the eyes of professional programmers.

IT or information technology is a term that has many meanings to different people. Often, it has something to do with hardware or physical devices, both of which are out of scope for programmers and computer scientists. Sometimes, it refers to a support function within a company. Many computer scientists and programmers are more than happy if their printer and internet connection work as they often do not know a lot more about that than "non-technical" people.

Why Python?¶

What is Python?¶

Here is a brief history of and some background on Python (cf., also this TechRepublic article for a more elaborate story):

Guido van Rossum (Python’s Benevolent Dictator for Life ) was bored during a week around Christmas 1989 and started Python as a hobby project "that would keep [him] occupied" for some days
the idea was to create a general-purpose scripting language that would allow fast prototyping and would run on every operating system
Python grew through the 90s as van Rossum promoted it via his "Computer Programming for Everybody" initiative that had the goal to encourage a basic level of coding literacy as an equal knowledge alongside English literacy and math skills
to become more independent from its creator, the next major version Python 2 - released in 2000 and still in heavy use as of today - was open-source from the get-go which attracted a large and global community of programmers that contributed their expertise and best practices in their free time to make Python even better
Python 3 resulted from a significant overhaul of the language in 2008 taking into account the learnings from almost two decades, streamlining the language, and getting ready for the age of big data
the language is named after the sketch comedy group Monty Python

Summary¶

Python is a general-purpose programming language that allows for fast development, is easy to read, open-source, long-established, unifies the knowledge of hundreds of thousands of experts around the world, runs on basically every machine, and can handle the complexities of applications involving big data.

Why open-source?¶

Couldn't a company like Google, Facebook, or Microsoft come up with a better programming language? The following is an argument on why this can likely not be the case.

Wouldn't it be weird if professors and scholars of English literature and language studies dictated how we'd have to speak in day-to-day casual conversations or how authors of poesy and novels should use language constructs to achieve a particular type of mood? If you agree with that premise, it makes sense to assume that even programming languages should evolve in a "natural" way as users use the language over time and in new and unpredictable contexts creating new conventions.

Loose communities are the primary building block around which open-source software projects are built. Someone - like Guido - starts a project and makes it free to use for anybody (e.g., on a code-sharing platform like GitHub ). People find it useful enough to solve one of their daily problems and start using it. They see how a project could be improved and provide new use cases (e.g., via the popularized concept of a pull request ). The project grows both in lines of code and people using it. After a while, people start local user groups to share their same interests and meet regularly (e.g., this is a big market for companies like Meetup or non-profits like PyData ). Out of these local and usually monthly meetups grow yearly conferences on the country or even continental level (e.g., the original PyCon in the US, EuroPython , or PyCon.DE ). The content presented at these conferences is made publicly available via GitHub and YouTube (e.g., PyCon 2019 or EuroPython ) and serves as references on what people are working on and introductions to the endless number of specialized fields.

While these communities are somewhat loose and continuously changing, smaller in-groups, often democratically organized and elected (e.g., the Python Software Foundation ), take care of, for example, the development of the "core" Python language itself.

Python itself is just a specification (i.e., a set of rules) as to what is allowed and what not: It must first be implemented (c.f., next section below). The current version of Python can always be looked up in the Python Language Reference . To make changes to that, anyone can make a so-called Python Enhancement Proposal , or PEP for short, where it needs to be specified what exact changes are to be made and argued why that is a good thing to do. These PEPs are reviewed by the core developers and interested people and are then either accepted, modified, or rejected if, for example, the change introduces internal inconsistencies. This process is similar to the double-blind peer review established in academia, just a lot more transparent. Many of the contributors even held or hold positions in academia, one more indicator of the high quality standards in the Python community. To learn more about PEPs, check out PEP 1 that describes the entire process.

In total, no one single entity can control how the language evolves, and the users' needs and ideas always feed back to the language specification via a quality controlled and "democratic" process.

Besides being free as in "free beer," a major benefit of open-source is that one can always look up how something works in detail: That is the literal meaning of open source and a difference to commercial languages (e.g., MATLAB) as a programmer can always continue to study best practices or find out how things are implemented. Along this way, many errors are uncovered, as well. Furthermore, if one runs an open-source application, one can be reasonably sure that no bad people built in a "backdoor." Free software is consequently free of charge but brings many other freedoms with it, most notably the freedom to change the code.

Isn't C a lot faster?¶

The default Python implementation is written in the C language and called CPython. This is also what the Anaconda Distribution uses.

C and C++ (cf., this introduction) are wide-spread and long-established (i.e., since the 1970s) programming languages employed in many mission-critical software systems (e.g., operating systems themselves, low latency databases and web servers, nuclear reactor control systems, airplanes, ...). They are fast, mainly because the programmer not only needs to come up with the business logic but also manage the computer's memory.

In contrast, Python automatically manages the memory for the programmer. So, speed here is a trade-off between application run time and engineering/development time. Often, the program's run time is not that important: For example, what if C needs 0.001 seconds in a case where Python needs 0.1 seconds to do the same thing? When the requirements change and computing speed becomes an issue, the Python community offers many third-party libraries - usually also written in C - where specific problems can be solved in near-C time.

Summary¶

While it is true that a language like C is a lot faster than Python when it comes to pure computation time, this does not matter in many cases as the significantly shorter development cycles are the more significant cost factor in a rapidly changing world.

Who uses it?¶

While ad-hominem arguments are usually not the best kind of reasoning, we briefly look at some examples of who uses Python and leave it up to the reader to decide if this is convincing or not:

Massachusetts Institute of Technology
- teaches Python in its introductory course to computer science independent of the student's major
- replaced the infamous course on the Scheme language (cf., source )
Google
- used the strategy "Python where we can, C++ where we must" from its early days on to stay flexible in a rapidly changing environment (cf., source )
- the very first web-crawler was written in Java and so difficult to maintain that it was rewritten in Python right away (cf., source)
- Guido van Rossom was hired by Google from 2005 to 2012 to advance the language there
NASA open-sources many of its projects, often written in Python and regarding analyses with big data (cf., source)
Facebook uses Python besides C++ and its legacy PHP (a language for building websites; the "cool kid" from the early 2000s)
Instagram operates the largest installation of the popular web framework Django (cf., source)
Spotify bases its data science on Python (cf., source)
Netflix also runs its predictive models on Python (cf., source)
Dropbox "stole" Guido van Rossom from Google to help scale the platform (cf., source)
JPMorgan Chase requires new employees to learn Python as part of the onboarding process starting with the 2018 intake (cf., source)

As images tell more than words, here are two plots of popular languages' "market shares" based on the number of questions asked on Stack Overflow , the most relevant platform for answering programming-related questions: As of late 2017, Python surpassed Java, heavily used in big corporates, and JavaScript, the "language of the internet" that does everything in web browsers, in popularity. Two blog posts from "technical" people explain this in more depth to the layman: Stack Overflow and DataCamp.

As the graph below shows, neither Google's very own language Go nor R, a domain-specific language in the niche of statistics, can compete with Python's year-to-year growth.

IEEE Sprectrum provides a more recent comparison of programming language's popularity. Even news and media outlets notice the recent popularity of Python: Economist, Huffington Post, TechRepublic, and QZ.

How to learn Programming¶

ABC Rule¶

Always be coding.

Programming is more than just writing code into a text file. It means reading through parts of the documentation , blogs with best practices, and tutorials, or researching problems on Stack Overflow while trying to implement features in the application at hand. Also, it means using command-line tools to automate some part of the work or manage different versions of a program, for example, with git. In short, programming involves a lot of "muscle memory," which can only be built and kept up through near-daily usage.

Further, many aspects of software architecture and best practices can only be understood after having implemented some requirements for the very first time. Coding also means "breaking" things to find out what makes them work in the first place.

Therefore, coding is learned best by just doing it for some time on a daily or at least a regular basis and not right before some task is due, just like learning a "real" language.

The Maker's Schedule¶

Y Combinator co-founder Paul Graham wrote a very popular and often cited article where he divides every person into belonging to one of two groups:

Managers: People that need to organize things and command others (e.g., a "boss" or manager). Their schedule is usually organized by the hour or even 30-minute intervals.
Makers: People that create things (e.g., programmers, artists, or writers). Such people think in half days or full days.

Have you ever wondered why so many tech people work during nights and sleep at "weird" times? The reason is that many programming-related tasks require a "flow" state in one's mind that is hard to achieve when one can get interrupted, even if it is only for one short question. Graham describes that only knowing that one has an appointment in three hours can cause a programmer to not get into a flow state.

As a result, do not set aside a certain amount of time for learning something but rather plan in an entire evening or a rainy Sunday where you can work on a problem in an open end setting. And do not be surprised anymore to hear "I looked at it over the weekend" from a programmer.

Phase Iteration¶

When being asked the above question, most programmers answer something that can be classified into one of two broader groups.

1) Toy Problem, Case Study, or Prototype: Pick some problem, break it down into smaller sub-problems, and solve them with an end in mind.

2) Books, Video Tutorials, and Courses: Research the best book, blog, video, or tutorial for something and work it through from start to end.

The truth is that you need to iterate between these two phases.

Building a prototype always reveals issues no book or tutorial can think of before. Data is never as clean as it should be. An algorithm from a textbook must be adapted to a peculiar aspect of a case study. It is essential to learn to "ship a product" because only then will one have looked at all the aspects.

The major downside of this approach is that one likely learns bad "patterns" overfitted to the case at hand, and one does not get the big picture or mental concepts behind a solution. This gap can be filled in by well-written books: For example, check the Python/programming books offered by Packt or O’Reilly.

Contents¶

Part A: Expressing Logic

What is a programming language? What kind of words exist?
- Chapter 1: Elements of a Program
- Chapter 2: Functions & Modularization
What is the flow of execution? How can we form sentences from words?
- Chapter 3: Conditionals & Exceptions
- Chapter 4: Recursion & Looping

Part B: Managing Data and Memory

How is data stored in memory?
- Chapter 5: Numbers & Bits
- Chapter 6: Text & Bytes
- Chapter 7: Sequential Data
- Chapter 8: Map, Filter, & Reduce
- Chapter 9: Mappings & Sets
- Chapter 10: Arrays & Dataframes
How can we create custom data types?
- Chapter 11: Classes & Instances

xkcd Comic¶

As with every good book, there has to be a xkcd comic somewhere.

In [4]:

import antigravity