#!/usr/bin/env python # coding: utf-8 # #

# *This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).* # # *The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!* # # *No changes were made to the contents of this notebook from the original.* # # < [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) | [Contents](Index.ipynb) | [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb) > # # Data Indexing and Selection # In [Chapter 2](02.00-Introduction-to-NumPy.ipynb), we looked in detail at methods and tools to access, set, and modify values in NumPy arrays. # These included indexing (e.g., ``arr[2, 1]``), slicing (e.g., ``arr[:, 1:5]``), masking (e.g., ``arr[arr > 0]``), fancy indexing (e.g., ``arr[0, [1, 5]]``), and combinations thereof (e.g., ``arr[:, [1, 5]]``). # Here we'll look at similar means of accessing and modifying values in Pandas ``Series`` and ``DataFrame`` objects. # If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of. # # We'll start with the simple case of the one-dimensional ``Series`` object, and then move on to the more complicated two-dimesnional ``DataFrame`` object. # ## Data Selection in Series # # As we saw in the previous section, a ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. # If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays. # ### Series as dictionary # # Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values: # In[1]: import pandas as pd data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd']) data # In[2]: data['b'] # We can also use dictionary-like Python expressions and methods to examine the keys/indices and values: # In[3]: 'a' in data # In[4]: data.keys() # In[5]: list(data.items()) # ``Series`` objects can even be modified with a dictionary-like syntax. # Just as you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value: # In[6]: data['e'] = 1.25 data # This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues. # ### Series as one-dimensional array # A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*. # Examples of these are as follows: # In[7]: # slicing by explicit index data['a':'c'] # In[8]: # slicing by implicit integer index data[0:2] # In[9]: # masking data[(data > 0.3) & (data < 0.8)] # In[10]: # fancy indexing data[['a', 'e']] # Among these, slicing may be the source of the most confusion. # Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice. # ### Indexers: loc, iloc, and ix # # These slicing and indexing conventions can be a source of confusion. # For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index. # In[11]: data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5]) data # In[12]: # explicit index when indexing data[1] # In[13]: # implicit index when slicing data[1:3] # Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes. # These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``. # # First, the ``loc`` attribute allows indexing and slicing that always references the explicit index: # In[14]: data.loc[1] # In[15]: data.loc[1:3] # The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index: # In[16]: data.iloc[1] # In[17]: data.iloc[1:3] # A third indexing attribute, ``ix``, is a hybrid of the two, and for ``Series`` objects is equivalent to standard ``[]``-based indexing. # The purpose of the ``ix`` indexer will become more apparent in the context of ``DataFrame`` objects, which we will discuss in a moment. # # One guiding principle of Python code is that "explicit is better than implicit." # The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention. # ## Data Selection in DataFrame # # Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index. # These analogies can be helpful to keep in mind as we explore data selection within this structure. # ### DataFrame as a dictionary # # The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects. # Let's return to our example of areas and populations of states: # In[18]: area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}) pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}) data = pd.DataFrame({'area':area, 'pop':pop}) data # The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name: # In[19]: data['area'] # Equivalently, we can use attribute-style access with column names that are strings: # In[20]: data.area # This attribute-style column access actually accesses the exact same object as the dictionary-style access: # In[21]: data.area is data['area'] # Though this is a useful shorthand, keep in mind that it does not work for all cases! # For example, if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible. # For example, the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column: # In[22]: data.pop is data['pop'] # In particular, you should avoid the temptation to try column assignment via attribute (i.e., use ``data['pop'] = z`` rather than ``data.pop = z``). # # Like with the ``Series`` objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column: # In[23]: data['density'] = data['pop'] / data['area'] data # This shows a preview of the straightforward syntax of element-by-element arithmetic between ``Series`` objects; we'll dig into this further in [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb). # ### DataFrame as two-dimensional array # # As mentioned previously, we can also view the ``DataFrame`` as an enhanced two-dimensional array. # We can examine the raw underlying data array using the ``values`` attribute: # In[24]: data.values # With this picture in mind, many familiar array-like observations can be done on the ``DataFrame`` itself. # For example, we can transpose the full ``DataFrame`` to swap rows and columns: # In[25]: data.T # When it comes to indexing of ``DataFrame`` objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. # In particular, passing a single index to an array accesses a row: # In[26]: data.values[0] # and passing a single "index" to a ``DataFrame`` accesses a column: # In[27]: data['area'] # Thus for array-style indexing, we need another convention. # Here Pandas again uses the ``loc``, ``iloc``, and ``ix`` indexers mentioned earlier. # Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result: # In[28]: data.iloc[:3, :2] # Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names: # In[29]: data.loc[:'Illinois', :'pop'] # The ``ix`` indexer allows a hybrid of these two approaches: # In[30]: data.ix[:3, :'pop'] # Keep in mind that for integer indices, the ``ix`` indexer is subject to the same potential sources of confusion as discussed for integer-indexed ``Series`` objects. # # Any of the familiar NumPy-style data access patterns can be used within these indexers. # For example, in the ``loc`` indexer we can combine masking and fancy indexing as in the following: # In[31]: data.loc[data.density > 100, ['pop', 'density']] # Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy: # In[32]: data.iloc[0, 2] = 90 data # To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple ``DataFrame`` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches. # ### Additional indexing conventions # # There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. # First, while *indexing* refers to columns, *slicing* refers to rows: # In[33]: data['Florida':'Illinois'] # Such slices can also refer to rows by number rather than by index: # In[34]: data[1:3] # Similarly, direct masking operations are also interpreted row-wise rather than column-wise: # In[35]: data[data.density > 100] # These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice. # # < [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) | [Contents](Index.ipynb) | [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb) >