In this blog post, I will show the different ways to select subsets of data in Pandas using [], .loc, .iloc, .at, and .iat. I will be using the wine quality dataset hosted on the UCI website. This data record 11 chemical properties (such as the concentrations of sugar, citric acid, alcohol, pH etc.) of thousands of red and white wines from northern Portugal, as well as the quality of the wines, recorded on a scale from 1 to 10. We will only look at the data for red wine.

In [86]:

import pandas as pd 
wine_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

In [191]:

wine_df.head()

Out[191]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9978	7.4	0.70	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	5
0.9968	7.8	0.88	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
0.9970	7.8	0.76	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	5
0.9980	11.2	0.28	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	6
0.9964	7.9	0.60	0.06	1.6	0.069	15.0	59.0	3.30	0.46	9.4	5

Different ways to select columns¶

In [88]:

wine_df.columns = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide','density','pH','sulphates', 'alcohol', 'quality' ]

In [89]:

wine_df['fixed_acidity'].head()

Out[89]:

0     7.4
1     7.8
2     7.8
3    11.2
4     7.4
Name: fixed_acidity, dtype: float64

In [90]:

wine_df.fixed_acidity.head()

Out[90]:

0     7.4
1     7.8
2     7.8
3    11.2
4     7.4
Name: fixed_acidity, dtype: float64

Selecting multiple columns¶

In [197]:

wine_four = wine_df[['fixed_acidity', 'volatile_acidity','citric_acid', 'residual_sugar']]

In [198]:

cols = ['fixed_acidity', 'volatile_acidity','citric_acid', 'residual_sugar']
wine_list_four = wine_four[cols]

In [94]:

wine_list_four.head()

Out[94]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar
0	7.4	0.70	0.00	1.9
1	7.8	0.88	0.00	2.6
2	7.8	0.76	0.04	2.3
3	11.2	0.28	0.56	1.9
4	7.4	0.70	0.00	1.9

Selecting columns using "select_dtypes" and "filter" methods¶

In [95]:

wine_df.get_dtype_counts()

Out[95]:

float64    11
int64       1
dtype: int64

In [96]:

wine_df.select_dtypes(include = ['float']).head()

Out[96]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4

In [97]:

wine_df.select_dtypes(include=['number']).head()

Out[97]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

In [98]:

wine_df.filter(like='acid').head()

Out[98]:

	fixed_acidity	volatile_acidity	citric_acid
0	7.4	0.70	0.00
1	7.8	0.88	0.00
2	7.8	0.76	0.04
3	11.2	0.28	0.56
4	7.4	0.70	0.00

In [192]:

wine_df.rename(columns={'pH':'pH_5', 'quality': 'quality_6' }, inplace=True)
wine_df.filter(regex='\d').head()

Out[192]:

	pH_5	quality_6
density
0.9978	3.51	5
0.9968	3.20	5
0.9970	3.26	5
0.9980	3.16	6
0.9964	3.30	5

In [101]:

wine_df.head()

Out[101]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH_5	sulphates	alcohol	quality_6
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

In [199]:

wine_df.filter(items=['fixed_acidity', 'volatile_acidity']).head()

Out[199]:

	fixed_acidity	volatile_acidity
density
0.9978	7.4	0.70
0.9968	7.8	0.88
0.9970	7.8	0.76
0.9980	11.2	0.28
0.9964	7.9	0.60

In [103]:

wine_df.head()

Out[103]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH_5	sulphates	alcohol	quality_6
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

Changing the order of your columns¶

In [104]:

wine_df.columns

Out[104]:

Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH_5', 'sulphates', 'alcohol', 'quality_6'],
      dtype='object')

In [105]:

group_1 = ['pH_5','sulphates','alcohol', 'quality_6']
group_2 =['chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density']
group_3 = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar']

In [106]:

new_cols = group_1+group_2+group_3

In [107]:

wine_df.columns 

Out[107]:

Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH_5', 'sulphates', 'alcohol', 'quality_6'],
      dtype='object')

In [109]:

set(wine_df.columns) == set(new_cols) 

Out[109]:

True

In [108]:

wine_df_2 = wine_df[new_cols]

In [110]:

wine_df_2.head()

Out[110]:

	pH_5	sulphates	alcohol	quality_6	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	fixed_acidity	volatile_acidity	citric_acid	residual_sugar
0	3.51	0.56	9.4	5	0.076	11.0	34.0	0.9978	7.4	0.70	0.00	1.9
1	3.20	0.68	9.8	5	0.098	25.0	67.0	0.9968	7.8	0.88	0.00	2.6
2	3.26	0.65	9.8	5	0.092	15.0	54.0	0.9970	7.8	0.76	0.04	2.3
3	3.16	0.58	9.8	6	0.075	17.0	60.0	0.9980	11.2	0.28	0.56	1.9
4	3.51	0.56	9.4	5	0.076	11.0	34.0	0.9978	7.4	0.70	0.00	1.9

Selecting rows using .iloc and loc¶

In [112]:

wine_df.nunique()

Out[112]:

fixed_acidity            96
volatile_acidity        143
citric_acid              80
residual_sugar           91
chlorides               153
free_sulfur_dioxide      60
total_sulfur_dioxide    144
density                 436
pH_5                     89
sulphates                96
alcohol                  65
quality_6                 6
dtype: int64

In [114]:

wine_df = wine_df.drop_duplicates(subset='density')

In [115]:

wine_df.shape #436 unique values in the density column 

Out[115]:

(436, 12)

In [116]:

wine_df.set_index('density',inplace=True)

In [117]:

wine_df.head()

Out[117]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9978	7.4	0.70	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	5
0.9968	7.8	0.88	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
0.9970	7.8	0.76	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	5
0.9980	11.2	0.28	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	6
0.9964	7.9	0.60	0.06	1.6	0.069	15.0	59.0	3.30	0.46	9.4	5

In [118]:

wine_df.iloc[2]

Out[118]:

fixed_acidity            7.800
volatile_acidity         0.760
citric_acid              0.040
residual_sugar           2.300
chlorides                0.092
free_sulfur_dioxide     15.000
total_sulfur_dioxide    54.000
pH_5                     3.260
sulphates                0.650
alcohol                  9.800
quality_6                5.000
Name: 0.997, dtype: float64

In [119]:

wine_df.loc[0.9968] 

Out[119]:

fixed_acidity            7.800
volatile_acidity         0.880
citric_acid              0.000
residual_sugar           2.600
chlorides                0.098
free_sulfur_dioxide     25.000
total_sulfur_dioxide    67.000
pH_5                     3.200
sulphates                0.680
alcohol                  9.800
quality_6                5.000
Name: 0.9968, dtype: float64

In [120]:

wine_df.iloc[[1, 4, 7]] 

Out[120]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9968	7.8	0.880	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
0.9964	7.9	0.600	0.06	1.6	0.069	15.0	59.0	3.30	0.46	9.4	5
0.9943	5.6	0.615	0.00	1.6	0.089	16.0	59.0	3.58	0.52	9.9	5

In [130]:

rows = [0.9968, 0.9964, 0.9943]
wine_df.loc[rows]

Out[130]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9968	7.8	0.880	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
0.9964	7.9	0.600	0.06	1.6	0.069	15.0	59.0	3.30	0.46	9.4	5
0.9943	5.6	0.615	0.00	1.6	0.089	16.0	59.0	3.58	0.52	9.9	5

In [126]:

wine_df.iloc[1:4]

Out[126]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9968	7.8	0.88	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
0.9970	7.8	0.76	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	5
0.9980	11.2	0.28	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	6

Slice notation also works with the .loc indexer and is inclusive of the last label:

In [133]:

first = 0.9970
last = 0.9959
wine_df.loc[first:last]

Out[133]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9970	7.8	0.76	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	5
0.9980	11.2	0.28	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	6
0.9964	7.9	0.60	0.06	1.6	0.069	15.0	59.0	3.30	0.46	9.4	5
0.9946	7.3	0.65	0.00	1.2	0.065	15.0	21.0	3.39	0.47	10.0	7
0.9959	6.7	0.58	0.08	1.8	0.097	15.0	65.0	3.28	0.54	9.2	5

Selecting rows and columns simultaneously¶

In [134]:

wine_df.head()

Out[134]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9978	7.4	0.70	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	5
0.9968	7.8	0.88	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
0.9970	7.8	0.76	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	5
0.9980	11.2	0.28	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	6
0.9964	7.9	0.60	0.06	1.6	0.069	15.0	59.0	3.30	0.46	9.4	5

In [137]:

wine_df.iloc[:, [3,4,6]].head()

Out[137]:

	residual_sugar	chlorides	total_sulfur_dioxide
density
0.9978	1.9	0.076	34.0
0.9968	2.6	0.098	67.0
0.9970	2.3	0.092	54.0
0.9980	1.9	0.075	60.0
0.9964	1.6	0.069	59.0

In [138]:

wine_df.loc[:,['residual_sugar','chlorides','total_sulfur_dioxide']].head()

Out[138]:

	residual_sugar	chlorides	total_sulfur_dioxide
density
0.9978	1.9	0.076	34.0
0.9968	2.6	0.098	67.0
0.9970	2.3	0.092	54.0
0.9980	1.9	0.075	60.0
0.9964	1.6	0.069	59.0

Selecting disjointed rows and columns¶

In [139]:

wine_df.iloc[[10,14], [7, 9]]

Out[139]:

	pH_5	alcohol
density
0.9969	3.30	10.5
0.9962	3.28	9.5

In [143]:

rows = [0.9969, 0.9962]
columns = ['pH_5', 'alcohol']
wine_df.loc[rows,columns]

Out[143]:

	pH_5	alcohol
density
0.9969	3.30	10.5
0.9962	3.28	9.5

In [193]:

wine_df.loc[0.9970, 'chlorides']

Out[193]:

0.092

In [194]:

wine_df.iloc[2, 4]

Out[194]:

0.092

In [151]:

wine_df.head(10)

Out[151]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9978	7.4	0.700	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	5
0.9968	7.8	0.880	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
0.9970	7.8	0.760	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	5
0.9980	11.2	0.280	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	6
0.9964	7.9	0.600	0.06	1.6	0.069	15.0	59.0	3.30	0.46	9.4	5
0.9946	7.3	0.650	0.00	1.2	0.065	15.0	21.0	3.39	0.47	10.0	7
0.9959	6.7	0.580	0.08	1.8	0.097	15.0	65.0	3.28	0.54	9.2	5
0.9943	5.6	0.615	0.00	1.6	0.089	16.0	59.0	3.58	0.52	9.9	5
0.9974	7.8	0.610	0.29	1.6	0.114	9.0	29.0	3.26	1.56	9.1	5
0.9986	8.9	0.620	0.18	3.8	0.176	52.0	145.0	3.16	0.88	9.2	5

In [162]:

wine_df.iloc[8:2:-2, 2]

Out[162]:

density
0.9974    0.29
0.9959    0.08
0.9964    0.06
Name: citric_acid, dtype: float64

In [163]:

first = 0.9974
second = 0.9964

wine_df.loc[first:second:-2, 'citric_acid']

Out[163]:

density
0.9974    0.29
0.9959    0.08
0.9964    0.06
Name: citric_acid, dtype: float64

In [165]:

wine_df.iloc[:5]

Out[165]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9978	7.4	0.70	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	5
0.9968	7.8	0.88	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
0.9970	7.8	0.76	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	5
0.9980	11.2	0.28	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	6
0.9964	7.9	0.60	0.06	1.6	0.069	15.0	59.0	3.30	0.46	9.4	5

In [166]:

wine_df.iloc[:5, :]

Out[166]:

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	pH_5	sulphates	alcohol	quality_6
density
0.9978	7.4	0.70	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	5
0.9968	7.8	0.88	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
0.9970	7.8	0.76	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	5
0.9980	11.2	0.28	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	6
0.9964	7.9	0.60	0.06	1.6	0.069	15.0	59.0	3.30	0.46	9.4	5

Selecting rows and columns using get_loc and index methods¶

In [175]:

col_start = wine_df.columns.get_loc('volatile_acidity')
col_end = wine_df.columns.get_loc('volatile_acidity')+2
col_start, col_end

Out[175]:

(1, 3)

In [177]:

wine_df.iloc[:4, col_start:col_end]

Out[177]:

	volatile_acidity	citric_acid
density
0.9978	0.70	0.00
0.9968	0.88	0.00
0.9970	0.76	0.04
0.9980	0.28	0.56

In [196]:

row_start = wine_df.index[4]
row_end = wine_df.index[7]

wine_df.loc[row_start:row_end, 'volatile_acidity' : 'chlorides']

Out[196]:

	volatile_acidity	citric_acid	residual_sugar	chlorides
density
0.9964	0.600	0.06	1.6	0.069
0.9946	0.650	0.00	1.2	0.065
0.9959	0.580	0.08	1.8	0.097
0.9943	0.615	0.00	1.6	0.089

In [183]:

density_val = 0.9980
wine_df.loc[density_val, 'citric_acid']

Out[183]:

0.56

Selecting rows and columns using .iat and .at¶

In [184]:

wine_df.at[density_val, 'citric_acid']

Out[184]:

0.56

The %timeit magic command to find the difference in speed:

In [185]:

timeit wine_df.loc[density_val, 'citric_acid']  

9.77 µs ± 633 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [186]:

timeit wine_df.at[density_val, 'citric_acid']

6.2 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [200]:

row_num = wine_df.index.get_loc(density_val)
col_num = wine_df.columns.get_loc('citric_acid')
row_num,col_num

Out[200]:

(3, 2)

In [189]:

timeit wine_df.iloc[row_num, col_num]

11 µs ± 506 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [190]:

timeit wine_df.iat[row_num, col_num]

7.81 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)