For years I used SAS® software in my professional career. I have always been impressed with its flexibility, its ability to manage large data sets and its ability to take input from just about any source. These characteristics help to explain its ubiquity in businessed throghout the world. In the early days of data analysis business users had few choices for tools when it came to doing ad-hoc analysis.
One of the tools available then was Base SAS® software. For clarity, we will refer to SAS and Base SAS as the language as opposed to the company, SAS Institute. Back then, there was no concept of self-serve, so users were mostly left to queue their requests for analysis and reports to a central IT group. Eventually, a few intrepid users discovered that SAS software was used for mainframe capacity planning. Mainframes were the dominate systems back then, and their costs were large enough to warrant the practice of analyzing utilization. SAS was the primary software used for this activity. CEO’s needed to know when they were obliged to deliver a substantial capital expenditure check to IBM.
By taking matters into their own hands users learned the mechanics of submitting SAS batch jobs (no interactive processing back then) and soon discovered they too could access data, munge it, and produce the sorts of reports and analysis that had meaningful impacts. As the number of computing platforms expanded thought the 1980’s and 1990’s SAS became available for them as well. All of which lead to a substantial number of SAS users.
All of this sounds quaint in light of today’s ability to visit a web page and by simply clicking a few buttons, you can spin-up a cluster of hundreds or even thousands of machines with an enormous number of open-source (and proprietary as well) software components. All of this is available in a matter of minutes by just using credit-cards. A lot has changed since back then. And that’s my motivation to write these examples, in the spirit of learning additional ways to analyze data.
There are plenty of substantive open source software projects out there for data scientists, so why choose Python? After all, there is R. R is a robust and well-supported language written initially by statistician for statisticians. The view is not to promote one solution over the other. The goal is to illustrate how the addition of Python to a SAS user’s skill set can broaden ones range of capabilities. And besides, Bob Muenchen has already written, R for SAS and SPSS Users.
Python has its heritage in scientific and technical computing domains and it has a compact syntax. The latter making for a relatively easy language to learn while the former means it scales to offer good performance with massive data volumes. This is one of the reasons why Google uses it so extensively and has developed an outstanding tutorial for programmers.
Another aspect both languages have in common is the wealth of information available on the web.
You would think having a plethora of content available it is straightforward proposition to learn a new language. But, at times I experienced information overload. As I worked though examples, I was not sure until an good investment of time if what I was learning was applicable to my learning objectives.
Sure there is learning for leaning’s sake. But not every tutorial or text I read was fruitful, however, most were. It was not until later in this endeavor that I realized I needed a specific context for ingesting new information.
Like most people, I want fast results. And like most SAS users, I have developed a mental model for data analysis focused on a series of iterable steps.
What I was lacking was someone to identify both the content to utilize as well as the order in which it should be consumed. I wanted to initially invest time in just those topics that I needed before getting on with the task of data analysis.
These chapters are meant to be read in order as they start with foundational concepts used to build up more complex ideas. The chapters are:
1. Introduction (This chapter) 2. Python Data Sctructures 3. Python Data Types and Formatting 4. Pandas, Part 1 a. Read from .csv b. Inspection c. Missing Data Detection d. Missing Data Replacement 5. Pandas, Part 2 a. Slicing b. Dicing c. Subsetting 6. Understanding Indexes 7. Understanding Datetime Arithmetic 8. Panda Readers for Data Input
A philosophical word (or two) about the merits of Python and SAS as languages. From my perspective, it is simply a question of finding the right tool for the job. Both languages have advantages and disadvantages. And since they are programming languages, their designers had to make certain tradeoffs which can manifest themselves as features or quirks, depending on one’s perspective.
The goal is to provide a quick start for users already familiar with the SAS lanaguage and enable them to become familiar with Python. The choice of which tool to utilize typically comes down to a combination of what you as a user are familiar with and the context of the problem being solved.
The approach taken is to introduce a concept(s) in Python with a description of how the program works followed by a code cell for the Python program. This is then followed by an example program in the language of SAS to present a compare and constrast approach. Not every Python example has an analog SAS example.
The Python code examples will always be inside a code cell within this notebook. The comparison SAS language example is contained inside a Raw NBConvert cell. To make the examples easy to follow, where reasonable, I have written the output to the SAS log. All of the SAS code examples are here github url goes here. Their names follow the convention:
* # is the chapter number from this notebook * python_comment_header is the comment block beginning the Python example * .sas is the file extension
The SAS language programs were written and verified with Version: 188.8.131.52.15680 of the WPS Workbench for Windows. World Programming System offers a SAS language interpreter and can be reached at: https://worldprogramming.com/us/.
This approach is illustrated by the next 3 cells below. The analog SAS program is called C1_Python_for_loop.sas
The list of numbers contained inside the square brackets [ ] make up the elements in a Python list. In Python, a list is a data structure that holds an arbitrary collection of items. i is an integer used as the index for the for loop. product holds the integer value from the arithmetic assignment of product * i Finally, the print() method writes the output. The same program is written in SAS as shown below.
# Python for loop numbers = [2, 4, 6, 8, 11] product = 1 for i in numbers: product = product * i print('The product is:', product)
The product is: 4224
Python permits an object-oriented programming model. SAS is a procedural programming language. These examples use a procedural programming model for Python given the goal is to map SAS programming constructs into Python.
This object-oriented programming model provides a number of classes with objects being instances of the class. The Python program in the cell below illustrates the int class (integers). Variable x is an instance (object) of the int class. You can execute help(int) to read more.
Objects are said to belong to a class. Variables that belong to a class or objects are strictly speaking, refered to as fields. Objects have capabilities belonging to the class and are called methods().
My early experiences was that the object types I created were not always obvious from the code context. I neeed to know what type of object was being created. The type() method returns the object's type as illustrated in the cell below.
Python has a number of built-in functions and types that are always available which are documented
here. Later on, we will see how Python expands its capabilities through importing packages (libraries).
# x is an instance of the int class x = 201; print(x) print(type(x))
201 <class 'int'>
Consider the program in the cell below. Don't worry about the syntax for now. a_list is a Python list object. The a_list list is copied to another list object using the assignment:
b_list = a_list
It turns out that while a_list and b_list are equivalent, they both point to the same memory location. In other words, b_list refers to the object a_list and does not represent the object itself.
The program statement:
removes the first item from the a_list list object. When we print the list objects, you see how both have the first item removed. The effect is subtle, but not one you will likely encounter a great deal. But certainly a subtly to be aware of.
# object reference example a_list = ['elephant', 'onyx', 'zebra', 'money', 'lemur'] # b_list is another name pointing to the same object b_list = a_list # remove the first item in a_list del a_list # print both lists print('a_list is:', a_list) print('b_list is:', b_list)
a_list is: ['onyx', 'zebra', 'money', 'lemur'] b_list is: ['onyx', 'zebra', 'money', 'lemur']
To quote, Eric Raymond, "A language that makes it hard to write elegant code makes it hard to write good code." From his essay, entitled, "Why Python", located at: http://www.linuxjournal.com/article/3882
The Python program in the cell below is the same as the one 6 cells above, with one exception. The line after the for block is not indented. This results in the interpreter raising the error:
IndentationError: expected an indented block
Once you get over the shock of how Python imposes the indentation requirements, you will come to see how this is an important feature used to create legible and easy-to-understand code.
Notice also there appear to be no symbols used to end a program statement. The end-of-line character is used to end a Python statement. This also helps to enforce ledgibility by keeping each statement on a separate physical line.
Coincidently, like SAS, Python will also honor a semi-colon as an end of statement terminator. However, you rarely see this. That's because multiple statements on the same physical line is considered an affront to program ledgibility.
# Python for loop_2 numbers = [2, 4, 6, 8, 11] product = 1 for i in numbers: product = product * i print('The product is:', product)
File "<ipython-input-1-f7f6f3436cff>", line 6 product = product * i ^ IndentationError: expected an indented block
Should you find you have a line of code that needs to extend past the physical line (i.e. wrap), then use the backslash (\). This causes the Python interpreter to ignore the physical end-of-line terminator on the current line and continuing scanning for the next end-of-line terminator.
# Line continuation numbers = [2, 4, 6, 8, 11, 13, 21, \ 17, 31] product = 1 for i in numbers: product = product * i print('The product is:', product)
The product is: 607711104
Of course, the incorrect spelling of keywords is a source of error. Unlike SAS, in Python, object names are case sensetive.
# Python object names are case sensetive X = 201 print(x)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-5e466655f071> in <module>() 1 X = 201 ----> 2 print(x) NameError: name 'x' is not defined
SAS keywords and variable names are case insensetive.
4 data _null_; 5 6 X = 201; 7 put x ; 201