Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
This book is a thorough introduction to programming in Python .
It teaches the concepts behind and the syntax of the core Python language as defined by the Python Software Foundation in the official language reference
. Furthermore, it introduces commonly used functionalities from the standard library
and popular third-party libraries like numpy
, pandas
, matplotlib
, and others.
There are no prerequisites for reading this book.
The main goal of this introduction is to prepare the student for further studies in the "field" of data science.
The term data science is rather vague and does not refer to an academic discipline. Instead, the term was popularized by the tech industry, who also coined non-meaningful job titles such as "rockstar" or "ninja developers." Most serious definitions describe the field as being multi-disciplinary integrating scientific methods, algorithms, and systems thinking to extract knowledge from structured and unstructured data, and also emphasize the importance of domain knowledge
.
Recently, this integration aspect feeds back into the academic world. The MIT, for example, created the new Stephen A. Schwarzman College of Computing for artificial intelligence with a 1 billion dollar initial investment where students undergo a "bilingual" curriculum with half the classes in quantitative and method-centric fields and the other half in domains such as biology, business, chemistry, politics, (art) history, or linguistics (cf., the official Q&As or this NYT article). Their strategists see a future where programming skills are just as naturally embedded into students' curricula as are nowadays subjects like calculus, statistics, or academic writing. Then, programming literacy is not just another "nice to have" skill but a prerequisite, or an enabler, to understanding more advanced topics in the actual domains studied.
To "read" this book in the most meaningful way, a working installation of Python 3.8 with JupyterLab is needed.
For a tutorial on how to install Python on your computer, follow the instructions in the README.md file in the project's GitHub repository . If you cannot install Python on your own machine, you may open the book interactively in the cloud with Binder
.
The document you are viewing is a so-called Jupyter notebook, a file format introduced by the Jupyter Project.
"Jupyter" is an acronym derived from the names of the three major programming languages Julia, Python
, and R, all of which play significant roles in the world of data science. The Jupyter Project's idea is to serve as an integrating platform such that different programming languages and software packages can be used together within the same project.
Jupyter notebooks have become a de-facto standard for communicating and exchanging results in the data science community - both in academia and business - and provide an alternative to command-line interface (CLI or "terminal") based ways of running Python code. As an example for the latter case, we could start the default Python interpreter that comes with every installation by typing the
python
command into a CLI (or poetry run python
if the project is managed with the poetry CLI tool as explained in the README.md file). Then, as the screenshot below shows, we could execute Python code like 1 + 2
or print("Hello World")
line by line simply by typing it following the >>>
prompt and pressing the Enter key. For an introductory course, however, this would be rather tedious and probably scare off many beginners.
One reason for the popularity of Jupyter notebooks is that they allow mixing text with code in the same document. Text may be formatted with the Markdown language and mathematical formulas typeset with LaTeX. Moreover, we may include pictures, plots, and even videos. Because of these features, the notebooks developed for this book come in a self-contained "tutorial" style enabling students to simply read them from top to bottom while executing the code snippets.
Other ways of running Python code are to use the IPython CLI tool instead of the default interpreter or a full-fledged Integrated Development Environment
(e.g., the commercial PyCharm or the free Spyder
that comes with the Anaconda Distribution).
A Jupyter notebook consists of cells that have a type associated with them. So far, only cells of type "Markdown" have been used, which is the default way to present formatted text.
The cells below are examples of "Code" cells containing actual Python code: They calculate the sum of 1
and 2
and print out "Hello World"
when executed, respectively. To edit an existing code cell, enter into it with a mouse click. You are "in" a code cell if its frame is highlighted in blue. We call that the edit mode.
There is also a command mode that you reach by hitting the Escape key. That un-highlights the frame. You are now "out" of but still "on" the cell. If you were already in command mode, hitting the Escape key does nothing.
Using the Enter and Escape keys, you can now switch between the two modes.
To execute, or "run," a code cell, hold down the Control key and press Enter. Note how you do not go to the subsequent cell if you keep re-executing the cell you are on. Alternatively, you can hold the Shift key and press Enter, which executes a cell and places your focus on the subsequent cell or creates a new one if there is none.
1 + 2
3
print("Hello World")
Hello World
Similarly, a Markdown cell is also in either edit or command mode. For example, double-click on the text you are reading: This puts you into edit mode. Now, you could change the formatting (e.g., print a word in italics or bold) and "execute" the cell to render the text as specified.
To change a cell's type, choose either "Code" or "Markdown" in the navigation bar at the top. Alternatively, you can press either the Y or M key in command mode.
Sometimes, a code cell starts with an exclamation mark !
. Then, the Jupyter notebook behaves as if the following command were typed directly into a terminal. The cell below asks the python
CLI to show its version number and is not Python code but a command in the Shell language. The
!
is useful to execute short CLI commands without leaving a Jupyter notebook.
!python --version
Python 3.12.2
In this book, programming is defined as
Programming is always concrete and based on a particular case. It exhibits elements of an art or a craft as we hear programmers call code "beautiful" or "ugly" or talk about the "expressive" power of an application.
That is different from computer science, which is
In a sense, a computer scientist does not need to know a programming language to work, and many computer scientists only know how to produce "ugly" looking code in the eyes of professional programmers.
IT or information technology is a term that has many meanings to different people. Often, it has something to do with hardware or physical devices, both of which are out of scope for programmers and computer scientists. Sometimes, it refers to a support function within a company. Many computer scientists and programmers are more than happy if their printer and internet connection work as they often do not know a lot more about that than "non-technical" people.
Here is a brief history of and some background on Python (cf., also this TechRepublic article for a more elaborate story):
Python is a general-purpose programming language that allows for fast development, is easy to read, open-source, long-established, unifies the knowledge of hundreds of thousands of experts around the world, runs on basically every machine, and can handle the complexities of applications involving big data.
Couldn't a company like Google, Facebook, or Microsoft come up with a better programming language? The following is an argument on why this can likely not be the case.
Wouldn't it be weird if professors and scholars of English literature and language studies dictated how we'd have to speak in day-to-day casual conversations or how authors of poesy and novels should use language constructs to achieve a particular type of mood? If you agree with that premise, it makes sense to assume that even programming languages should evolve in a "natural" way as users use the language over time and in new and unpredictable contexts creating new conventions.
Loose communities are the primary building block around which open-source software projects are built. Someone - like Guido - starts a project and makes it free to use for anybody (e.g., on a code-sharing platform like GitHub ). People find it useful enough to solve one of their daily problems and start using it. They see how a project could be improved and provide new use cases (e.g., via the popularized concept of a pull request
). The project grows both in lines of code and people using it. After a while, people start local user groups to share their same interests and meet regularly (e.g., this is a big market for companies like Meetup or non-profits like PyData
). Out of these local and usually monthly meetups grow yearly conferences on the country or even continental level (e.g., the original PyCon
in the US, EuroPython
, or PyCon.DE
). The content presented at these conferences is made publicly available via GitHub and YouTube (e.g., PyCon 2019
or EuroPython
) and serves as references on what people are working on and introductions to the endless number of specialized fields.
While these communities are somewhat loose and continuously changing, smaller in-groups, often democratically organized and elected (e.g., the Python Software Foundation ), take care of, for example, the development of the "core" Python language itself.
Python itself is just a specification (i.e., a set of rules) as to what is allowed and what not: It must first be implemented (c.f., next section below). The current version of Python can always be looked up in the Python Language Reference . To make changes to that, anyone can make a so-called Python Enhancement Proposal
, or PEP for short, where it needs to be specified what exact changes are to be made and argued why that is a good thing to do. These PEPs are reviewed by the core developers
and interested people and are then either accepted, modified, or rejected if, for example, the change introduces internal inconsistencies. This process is similar to the double-blind peer review established in academia, just a lot more transparent. Many of the contributors even held or hold positions in academia, one more indicator of the high quality standards in the Python community. To learn more about PEPs, check out PEP 1
that describes the entire process.
In total, no one single entity can control how the language evolves, and the users' needs and ideas always feed back to the language specification via a quality controlled and "democratic" process.
Besides being free as in "free beer," a major benefit of open-source is that one can always look up how something works in detail: That is the literal meaning of open source and a difference to commercial languages (e.g., MATLAB) as a programmer can always continue to study best practices or find out how things are implemented. Along this way, many errors are uncovered, as well. Furthermore, if one runs an open-source application, one can be reasonably sure that no bad people built in a "backdoor." Free software is consequently free of charge but brings many other freedoms with it, most notably the freedom to change the code.
The default Python implementation is written in the C language and called CPython. This is also what the Anaconda Distribution uses.
C and C++
(cf., this introduction) are wide-spread and long-established (i.e., since the 1970s) programming languages employed in many mission-critical software systems (e.g., operating systems themselves, low latency databases and web servers, nuclear reactor control systems, airplanes, ...). They are fast, mainly because the programmer not only needs to come up with the business logic but also manage the computer's memory.
In contrast, Python automatically manages the memory for the programmer. So, speed here is a trade-off between application run time and engineering/development time. Often, the program's run time is not that important: For example, what if C needs 0.001 seconds in a case where Python needs 0.1 seconds to do the same thing? When the requirements change and computing speed becomes an issue, the Python community offers many third-party libraries - usually also written in C - where specific problems can be solved in near-C time.
While it is true that a language like C is a lot faster than Python when it comes to pure computation time, this does not matter in many cases as the significantly shorter development cycles are the more significant cost factor in a rapidly changing world.
While ad-hominem arguments are usually not the best kind of reasoning, we briefly look at some examples of who uses Python and leave it up to the reader to decide if this is convincing or not:
As images tell more than words, here are two plots of popular languages' "market shares" based on the number of questions asked on Stack Overflow , the most relevant platform for answering programming-related questions: As of late 2017, Python surpassed Java, heavily used in big corporates, and JavaScript, the "language of the internet" that does everything in web browsers, in popularity. Two blog posts from "technical" people explain this in more depth to the layman: Stack Overflow
and DataCamp.
IEEE Sprectrum provides a more recent comparison of programming language's popularity. Even news and media outlets notice the recent popularity of Python: Economist, Huffington Post, TechRepublic, and QZ.
Always be coding.
Programming is more than just writing code into a text file. It means reading through parts of the documentation , blogs with best practices, and tutorials, or researching problems on Stack Overflow
while trying to implement features in the application at hand. Also, it means using command-line tools to automate some part of the work or manage different versions of a program, for example, with git. In short, programming involves a lot of "muscle memory," which can only be built and kept up through near-daily usage.
Further, many aspects of software architecture and best practices can only be understood after having implemented some requirements for the very first time. Coding also means "breaking" things to find out what makes them work in the first place.
Therefore, coding is learned best by just doing it for some time on a daily or at least a regular basis and not right before some task is due, just like learning a "real" language.
Y Combinator co-founder Paul Graham
wrote a very popular and often cited article where he divides every person into belonging to one of two groups:
Have you ever wondered why so many tech people work during nights and sleep at "weird" times? The reason is that many programming-related tasks require a "flow" state in one's mind that is hard to achieve when one can get interrupted, even if it is only for one short question. Graham describes that only knowing that one has an appointment in three hours can cause a programmer to not get into a flow state.
As a result, do not set aside a certain amount of time for learning something but rather plan in an entire evening or a rainy Sunday where you can work on a problem in an open end setting. And do not be surprised anymore to hear "I looked at it over the weekend" from a programmer.
When being asked the above question, most programmers answer something that can be classified into one of two broader groups.
1) Toy Problem, Case Study, or Prototype: Pick some problem, break it down into smaller sub-problems, and solve them with an end in mind.
2) Books, Video Tutorials, and Courses: Research the best book, blog, video, or tutorial for something and work it through from start to end.
The truth is that you need to iterate between these two phases.
Building a prototype always reveals issues no book or tutorial can think of before. Data is never as clean as it should be. An algorithm from a textbook must be adapted to a peculiar aspect of a case study. It is essential to learn to "ship a product" because only then will one have looked at all the aspects.
The major downside of this approach is that one likely learns bad "patterns" overfitted to the case at hand, and one does not get the big picture or mental concepts behind a solution. This gap can be filled in by well-written books: For example, check the Python/programming books offered by Packt or O’Reilly.
Part A: Expressing Logic
Part B: Managing Data and Memory
As with every good book, there has to be a xkcd comic somewhere.
import antigravity