Data Analysis with Pandas

Pandas is a Python library providing high-performance, easy-to-use data structures and data analysis tools.

In [16]:
# Import the pandas package under the alias "pd"
import pandas as pd

Series and DataFrames

The primary data structures in pandas are implemented as two classes:

  • DataFrame, which you can imagine as a relational data table, with rows and named columns.
  • Series, which is a single column. A DataFrame contains one or more Series and a name for each Series.

The data frame is a commonly used abstraction for data manipulation.

In [12]:
# Create a Series object
pd.Series({'CAL':38332521, 'TEX':26448193, 'NY':19651127})
Out[12]:
CAL    38332521
NY     19651127
TEX    26448193
dtype: int64
In [24]:
# Create a DataFrame object contraining two Series
pop = pd.Series({'CAL':38332521, 'TEX':26448193, 'NY':19651127})
area = pd.Series({'CAL':423967, 'TEX':695662, 'NY':141297})
pd.DataFrame({'population':pop, 'area':area})
Out[24]:
area population
CAL 423967 38332521
NY 141297 19651127
TEX 695662 26448193

Loading data

Pandas is commonly used to load and analyse datasets.

In [40]:
# Load the California housing dataset into a DataFrame
df_cal_housing = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")

# Print DataFrame shape
df_cal_housing.shape
Out[40]:
(17000, 9)
In [37]:
# Print a concise summary of the DataFrame
df_cal_housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
longitude             17000 non-null float64
latitude              17000 non-null float64
housing_median_age    17000 non-null float64
total_rooms           17000 non-null float64
total_bedrooms        17000 non-null float64
population            17000 non-null float64
households            17000 non-null float64
median_income         17000 non-null float64
median_house_value    17000 non-null float64
dtypes: float64(9)
memory usage: 1.2 MB
In [34]:
# Print generics statistics about the DataFrame columns
df_cal_housing.describe()
Out[34]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000
mean -119.562108 35.625225 28.589353 2643.664412 539.410824 1429.573941 501.221941 3.883578 207300.912353
std 2.005166 2.137340 12.586937 2179.947071 421.499452 1147.852959 384.520841 1.908157 115983.764387
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.790000 33.930000 18.000000 1462.000000 297.000000 790.000000 282.000000 2.566375 119400.000000
50% -118.490000 34.250000 29.000000 2127.000000 434.000000 1167.000000 409.000000 3.544600 180400.000000
75% -118.000000 37.720000 37.000000 3151.250000 648.250000 1721.000000 605.250000 4.767000 265000.000000
max -114.310000 41.950000 52.000000 37937.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
In [35]:
# Show the first records of the DataFrame
df_cal_housing.head()
Out[35]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
In [26]:
# Show 10 random samples
df_cal_housing.sample(n=10)
Out[26]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
11502 -121.26 38.67 18.0 1830.0 313.0 905.0 361.0 4.2273 141800.0
11859 -121.34 37.99 11.0 4487.0 868.0 2195.0 780.0 3.9615 194600.0
10121 -119.82 36.78 36.0 1582.0 313.0 761.0 318.0 2.6055 69200.0
15356 -122.29 37.88 48.0 2365.0 490.0 1034.0 475.0 3.1065 229200.0
3098 -117.82 33.88 15.0 5392.0 895.0 2531.0 827.0 6.2185 280300.0
12034 -121.41 38.59 17.0 12355.0 3630.0 5692.0 3073.0 2.5245 99100.0
4035 -117.97 33.86 34.0 2138.0 490.0 1682.0 463.0 3.6006 161700.0
8308 -118.45 34.00 39.0 1909.0 359.0 867.0 345.0 4.7000 334700.0
4373 -118.02 33.91 35.0 1337.0 234.0 692.0 235.0 5.1155 213700.0
8817 -118.67 34.30 5.0 6123.0 825.0 2440.0 736.0 7.9013 393000.0

Visualizing data

Leveraging Matplotlib, Pandas can easily create beautiful graphs to gain insights about the data.

In [41]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

# Plot the distribution of values in the "housing_median_age" column as an histogram
df_cal_housing.hist('housing_median_age')
Out[41]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a1bd0a1d0>]],
      dtype=object)
In [33]:
# Plot the distribution of values for all columns as an histogram
df_cal_housing.hist(figsize=(10, 8))
plt.tight_layout()
In [42]:
sns.pairplot(df_cal_housing)
Out[42]:
<seaborn.axisgrid.PairGrid at 0x1a1bbc24e0>