Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
The primary data structures of pandas, Series (1-dimensional), DataFrame (2-dimensional), and Panel (3-dimensional) handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
import numpy as np
import pandas as pd
np.__version__, pd.__version__
Creating a Series by passing a list of values, letting pandas create a default integer index
pd.Series([1,3,5,np.nan,6,8])
Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns
dates = pd.date_range('20130101',periods=6)
Creating a DataFrame by passing a dict of objects that can be converted to series-like:
df = pd.DataFrame({"A": 1.,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(xrange(4)), dtype="float32"),
"D": np.arange(10, 14),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo"})
df.dtypes
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list("ABCD"))
df.head()
df.tail(2)
df.index
df.columns
df.values
df.describe()
df.sort(columns="B")
df["A"]
df.A
df[1:3]
The Python and NumPy indexing operators []
and attribute operator .
provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits.
.loc
is strictly label based (both endpoints are included).iloc
is strictly integer position based (half-open interval).ix
supports mixed integer and label based accessdf.loc[:, ["B", "C"]]
df.loc["20130102": "20130104", ["B", "C"]]
df.iloc[1:3]
df.iloc[:, 1:3]
df.iloc[3:5, -1:1:-1]
df.iloc[[1, 2, 4], [0, 2]]
df[df.A > 0]
df > 0
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
df["F"] = s1 # add a new column, automatically expands
df
# setting values by label
df.loc[dates[0], "A"] = 0
df
# setting values by position
df.iloc[0, 1] = 0
df
pandas primarily uses the value np.nan
to represent missing data. By default, it is ignored in computations it's involved in.
# drop any rows that have missing data
df.dropna()
# fill in missing data
df.fillna(value=42)
df.isnull()
df.mean() # same as df.mean(0)
df.mean(1)
df["F"].value_counts() # histogram
df.apply(np.cumsum) # along axis=0
df.apply(lambda x: x.max() - x.min()) # calculate the range
df = pd.DataFrame(np.random.randn(10, 4))
df
pieces = [df[:3], df[3:7], df[7:]] # break it into pieces
pd.concat(pieces) # concatenates rows
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
left
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
right
pd.merge(left, right, on="key") # this is an SQL style merge / join
df = pd.DataFrame(np.random.randn(8, 4), columns=["A", "B", "C", "D"])
s = df.iloc[3]
df.append(s, ignore_index=True) # repeat row 3 as a new row
But beware:
df = pd.DataFrame(np.random.randn(8, 4), columns=["A", "B", "C", "D"])
s = df.iloc[3]
df2 = df.append(s) # repeat row 3 "as is" - probably not what you want!
df2
df2.index[-1]
By “group by” we are referring to a process involving one or more of the following steps
df = pd.DataFrame({"A": ["foo", "bar", "foo", "bar",
"foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three",
"two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8)})
df
df.groupby("A").sum()
df.groupby(["A", "B"]).sum() # hierarchical
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "f"]})
df
df["grade"] = df["raw_grade"].astype("category") # convert the raw grades to a categorical data type
df["grade"]
df["grade"].cat.categories
df["grade"].cat.categories = ["very good", "good", "very bad"] # Rename the categories to more meaningful names
df["grade"]
try:
df.loc[0, "grade"] = "acceptable"
except:
print("This won't work...")
%matplotlib inline
from matplotlib import pyplot as plt
df = pd.DataFrame(np.random.randn(100, 4),
columns=["A", "B", "C", "D"])
df = df.cumsum()
plt.figure()
df.plot()
plt.legend(loc="best")
df.to_csv("foo.csv")
df2 = pd.read_csv("foo.csv")
df2.loc[:10]
df = pd.DataFrame({"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50]})
df
df.loc[df.AAA >= 5, "BBB"] = -1 # conditionals
df
df.loc[df.AAA <= 5] # splitting
df = pd.DataFrame({"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50]})
Crit1 = df.AAA <= 5.5
Crit2 = df.BBB == 10.0
Crit3 = df.CCC > -40.0
AllCrit = Crit1 & Crit2 & Crit3
df.loc[AllCrit]
df.loc[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
Creating new columns
df = pd.DataFrame({"AAA": [1, 2, 1, 3],
"BBB": [1, 1, 2, 2],
"CCC": [2,1,3,1]})
df
Efficiently and dynamically creating new columns using applymap
source_cols = df.columns
new_cols = [str(x) + "_cat" for x in source_cols]
categories = {1: "Alpha", 2: "Beta", 3: "Charlie"}
df[new_cols] = df[source_cols].applymap(categories.get)
df
More grouping tricks:
df = pd.DataFrame({"animal": "cat dog cat fish dog cat cat".split(),
"size": list("SSMMMLL"),
"weight": [8, 10, 11, 1, 20, 12, 12],
"adult": [False] * 5 + [True] * 2})
df
df.groupby("animal").apply(lambda subf: subf["size"][subf["weight"].idxmax()])
gb = df.groupby(["animal"])
gb.get_group("cat") # same as df.loc[df.animal == "cat"]
def growup(x):
avg_weight = sum(x[x.size == "S"].weight * 1.5)
avg_weight += sum(x[x.size == "M"].weight * 1.25)
avg_weight += sum(x[x.size == "L"].weight)
avg_weight = avg_weight / len(x)
return pd.Series(["L", avg_weight, True], index=["size", "weight", "adult"])
gb.apply(growup)
df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, -1, 1, 2]})
df
gb = df.groupby("A") # replacing negative values with mean of the rest of a group
def replace(g):
mask = g < 0
g.loc[mask] = g[~mask].mean()
return g
gb.transform(replace)