In this blog post, I will show the different ways to select subsets of data in Pandas using [], .loc, .iloc, .at, and .iat.  I will be using the wine quality dataset hosted on the UCI website. This data record 11 chemical properties (such as the concentrations of sugar, citric acid, alcohol, pH etc.) of thousands of red and white wines from northern Portugal, as well as the quality of the wines, recorded on a scale from 1 to 10. We will only look at the data for red wine.

In [86]:
import pandas as pd 
wine_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
In [191]:
wine_df.head()
Out[191]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9978 7.4 0.70 0.00 1.9 0.076 11.0 34.0 3.51 0.56 9.4 5
0.9968 7.8 0.88 0.00 2.6 0.098 25.0 67.0 3.20 0.68 9.8 5
0.9970 7.8 0.76 0.04 2.3 0.092 15.0 54.0 3.26 0.65 9.8 5
0.9980 11.2 0.28 0.56 1.9 0.075 17.0 60.0 3.16 0.58 9.8 6
0.9964 7.9 0.60 0.06 1.6 0.069 15.0 59.0 3.30 0.46 9.4 5

Different ways to select columns

In [88]:
wine_df.columns = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide','density','pH','sulphates', 'alcohol', 'quality' ]
In [89]:
wine_df['fixed_acidity'].head()
Out[89]:
0     7.4
1     7.8
2     7.8
3    11.2
4     7.4
Name: fixed_acidity, dtype: float64
In [90]:
wine_df.fixed_acidity.head()
Out[90]:
0     7.4
1     7.8
2     7.8
3    11.2
4     7.4
Name: fixed_acidity, dtype: float64

Selecting multiple columns

In [197]:
wine_four = wine_df[['fixed_acidity', 'volatile_acidity','citric_acid', 'residual_sugar']]
In [198]:
cols = ['fixed_acidity', 'volatile_acidity','citric_acid', 'residual_sugar']
wine_list_four = wine_four[cols]
In [94]:
wine_list_four.head()
Out[94]:
fixed_acidity volatile_acidity citric_acid residual_sugar
0 7.4 0.70 0.00 1.9
1 7.8 0.88 0.00 2.6
2 7.8 0.76 0.04 2.3
3 11.2 0.28 0.56 1.9
4 7.4 0.70 0.00 1.9

Selecting columns using "select_dtypes" and "filter" methods

In [95]:
wine_df.get_dtype_counts()
Out[95]:
float64    11
int64       1
dtype: int64
In [96]:
wine_df.select_dtypes(include = ['float']).head()
Out[96]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
In [97]:
wine_df.select_dtypes(include=['number']).head()
Out[97]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [98]:
wine_df.filter(like='acid').head()
Out[98]:
fixed_acidity volatile_acidity citric_acid
0 7.4 0.70 0.00
1 7.8 0.88 0.00
2 7.8 0.76 0.04
3 11.2 0.28 0.56
4 7.4 0.70 0.00
In [192]:
wine_df.rename(columns={'pH':'pH_5', 'quality': 'quality_6' }, inplace=True)
wine_df.filter(regex='\d').head()
Out[192]:
pH_5 quality_6
density
0.9978 3.51 5
0.9968 3.20 5
0.9970 3.26 5
0.9980 3.16 6
0.9964 3.30 5
In [101]:
wine_df.head()
Out[101]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH_5 sulphates alcohol quality_6
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [199]:
wine_df.filter(items=['fixed_acidity', 'volatile_acidity']).head()
Out[199]:
fixed_acidity volatile_acidity
density
0.9978 7.4 0.70
0.9968 7.8 0.88
0.9970 7.8 0.76
0.9980 11.2 0.28
0.9964 7.9 0.60
In [103]:
wine_df.head()
Out[103]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH_5 sulphates alcohol quality_6
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Changing the order of your columns

In [104]:
wine_df.columns
Out[104]:
Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH_5', 'sulphates', 'alcohol', 'quality_6'],
      dtype='object')
In [105]:
group_1 = ['pH_5','sulphates','alcohol', 'quality_6']
group_2 =['chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density']
group_3 = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar']
In [106]:
new_cols = group_1+group_2+group_3
In [107]:
wine_df.columns 
Out[107]:
Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH_5', 'sulphates', 'alcohol', 'quality_6'],
      dtype='object')
In [109]:
set(wine_df.columns) == set(new_cols) 
Out[109]:
True
In [108]:
wine_df_2 = wine_df[new_cols]
In [110]:
wine_df_2.head()
Out[110]:
pH_5 sulphates alcohol quality_6 chlorides free_sulfur_dioxide total_sulfur_dioxide density fixed_acidity volatile_acidity citric_acid residual_sugar
0 3.51 0.56 9.4 5 0.076 11.0 34.0 0.9978 7.4 0.70 0.00 1.9
1 3.20 0.68 9.8 5 0.098 25.0 67.0 0.9968 7.8 0.88 0.00 2.6
2 3.26 0.65 9.8 5 0.092 15.0 54.0 0.9970 7.8 0.76 0.04 2.3
3 3.16 0.58 9.8 6 0.075 17.0 60.0 0.9980 11.2 0.28 0.56 1.9
4 3.51 0.56 9.4 5 0.076 11.0 34.0 0.9978 7.4 0.70 0.00 1.9

Selecting rows using .iloc and loc

In [112]:
wine_df.nunique()
Out[112]:
fixed_acidity            96
volatile_acidity        143
citric_acid              80
residual_sugar           91
chlorides               153
free_sulfur_dioxide      60
total_sulfur_dioxide    144
density                 436
pH_5                     89
sulphates                96
alcohol                  65
quality_6                 6
dtype: int64
In [114]:
wine_df = wine_df.drop_duplicates(subset='density')
In [115]:
wine_df.shape #436 unique values in the density column 
Out[115]:
(436, 12)
In [116]:
wine_df.set_index('density',inplace=True)
In [117]:
wine_df.head()
Out[117]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9978 7.4 0.70 0.00 1.9 0.076 11.0 34.0 3.51 0.56 9.4 5
0.9968 7.8 0.88 0.00 2.6 0.098 25.0 67.0 3.20 0.68 9.8 5
0.9970 7.8 0.76 0.04 2.3 0.092 15.0 54.0 3.26 0.65 9.8 5
0.9980 11.2 0.28 0.56 1.9 0.075 17.0 60.0 3.16 0.58 9.8 6
0.9964 7.9 0.60 0.06 1.6 0.069 15.0 59.0 3.30 0.46 9.4 5
In [118]:
wine_df.iloc[2]
Out[118]:
fixed_acidity            7.800
volatile_acidity         0.760
citric_acid              0.040
residual_sugar           2.300
chlorides                0.092
free_sulfur_dioxide     15.000
total_sulfur_dioxide    54.000
pH_5                     3.260
sulphates                0.650
alcohol                  9.800
quality_6                5.000
Name: 0.997, dtype: float64
In [119]:
wine_df.loc[0.9968] 
Out[119]:
fixed_acidity            7.800
volatile_acidity         0.880
citric_acid              0.000
residual_sugar           2.600
chlorides                0.098
free_sulfur_dioxide     25.000
total_sulfur_dioxide    67.000
pH_5                     3.200
sulphates                0.680
alcohol                  9.800
quality_6                5.000
Name: 0.9968, dtype: float64
In [120]:
wine_df.iloc[[1, 4, 7]] 
Out[120]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9968 7.8 0.880 0.00 2.6 0.098 25.0 67.0 3.20 0.68 9.8 5
0.9964 7.9 0.600 0.06 1.6 0.069 15.0 59.0 3.30 0.46 9.4 5
0.9943 5.6 0.615 0.00 1.6 0.089 16.0 59.0 3.58 0.52 9.9 5
In [130]:
rows = [0.9968, 0.9964, 0.9943]
wine_df.loc[rows]
Out[130]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9968 7.8 0.880 0.00 2.6 0.098 25.0 67.0 3.20 0.68 9.8 5
0.9964 7.9 0.600 0.06 1.6 0.069 15.0 59.0 3.30 0.46 9.4 5
0.9943 5.6 0.615 0.00 1.6 0.089 16.0 59.0 3.58 0.52 9.9 5
In [126]:
wine_df.iloc[1:4]
Out[126]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9968 7.8 0.88 0.00 2.6 0.098 25.0 67.0 3.20 0.68 9.8 5
0.9970 7.8 0.76 0.04 2.3 0.092 15.0 54.0 3.26 0.65 9.8 5
0.9980 11.2 0.28 0.56 1.9 0.075 17.0 60.0 3.16 0.58 9.8 6

Slice notation also works with the .loc indexer and is inclusive of the last label:

In [133]:
first = 0.9970
last = 0.9959
wine_df.loc[first:last]
Out[133]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9970 7.8 0.76 0.04 2.3 0.092 15.0 54.0 3.26 0.65 9.8 5
0.9980 11.2 0.28 0.56 1.9 0.075 17.0 60.0 3.16 0.58 9.8 6
0.9964 7.9 0.60 0.06 1.6 0.069 15.0 59.0 3.30 0.46 9.4 5
0.9946 7.3 0.65 0.00 1.2 0.065 15.0 21.0 3.39 0.47 10.0 7
0.9959 6.7 0.58 0.08 1.8 0.097 15.0 65.0 3.28 0.54 9.2 5

Selecting rows and columns simultaneously

In [134]:
wine_df.head()
Out[134]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9978 7.4 0.70 0.00 1.9 0.076 11.0 34.0 3.51 0.56 9.4 5
0.9968 7.8 0.88 0.00 2.6 0.098 25.0 67.0 3.20 0.68 9.8 5
0.9970 7.8 0.76 0.04 2.3 0.092 15.0 54.0 3.26 0.65 9.8 5
0.9980 11.2 0.28 0.56 1.9 0.075 17.0 60.0 3.16 0.58 9.8 6
0.9964 7.9 0.60 0.06 1.6 0.069 15.0 59.0 3.30 0.46 9.4 5
In [137]:
wine_df.iloc[:, [3,4,6]].head()
Out[137]:
residual_sugar chlorides total_sulfur_dioxide
density
0.9978 1.9 0.076 34.0
0.9968 2.6 0.098 67.0
0.9970 2.3 0.092 54.0
0.9980 1.9 0.075 60.0
0.9964 1.6 0.069 59.0
In [138]:
wine_df.loc[:,['residual_sugar','chlorides','total_sulfur_dioxide']].head()
Out[138]:
residual_sugar chlorides total_sulfur_dioxide
density
0.9978 1.9 0.076 34.0
0.9968 2.6 0.098 67.0
0.9970 2.3 0.092 54.0
0.9980 1.9 0.075 60.0
0.9964 1.6 0.069 59.0

Selecting disjointed rows and columns 

In [139]:
wine_df.iloc[[10,14], [7, 9]]
Out[139]:
pH_5 alcohol
density
0.9969 3.30 10.5
0.9962 3.28 9.5
In [143]:
rows = [0.9969, 0.9962]
columns = ['pH_5', 'alcohol']
wine_df.loc[rows,columns]
Out[143]:
pH_5 alcohol
density
0.9969 3.30 10.5
0.9962 3.28 9.5
In [193]:
wine_df.loc[0.9970, 'chlorides']
Out[193]:
0.092
In [194]:
wine_df.iloc[2, 4]
Out[194]:
0.092
In [151]:
wine_df.head(10)
Out[151]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9978 7.4 0.700 0.00 1.9 0.076 11.0 34.0 3.51 0.56 9.4 5
0.9968 7.8 0.880 0.00 2.6 0.098 25.0 67.0 3.20 0.68 9.8 5
0.9970 7.8 0.760 0.04 2.3 0.092 15.0 54.0 3.26 0.65 9.8 5
0.9980 11.2 0.280 0.56 1.9 0.075 17.0 60.0 3.16 0.58 9.8 6
0.9964 7.9 0.600 0.06 1.6 0.069 15.0 59.0 3.30 0.46 9.4 5
0.9946 7.3 0.650 0.00 1.2 0.065 15.0 21.0 3.39 0.47 10.0 7
0.9959 6.7 0.580 0.08 1.8 0.097 15.0 65.0 3.28 0.54 9.2 5
0.9943 5.6 0.615 0.00 1.6 0.089 16.0 59.0 3.58 0.52 9.9 5
0.9974 7.8 0.610 0.29 1.6 0.114 9.0 29.0 3.26 1.56 9.1 5
0.9986 8.9 0.620 0.18 3.8 0.176 52.0 145.0 3.16 0.88 9.2 5
In [162]:
wine_df.iloc[8:2:-2, 2]
Out[162]:
density
0.9974    0.29
0.9959    0.08
0.9964    0.06
Name: citric_acid, dtype: float64
In [163]:
first = 0.9974
second = 0.9964

wine_df.loc[first:second:-2, 'citric_acid']
Out[163]:
density
0.9974    0.29
0.9959    0.08
0.9964    0.06
Name: citric_acid, dtype: float64
In [165]:
wine_df.iloc[:5]
Out[165]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9978 7.4 0.70 0.00 1.9 0.076 11.0 34.0 3.51 0.56 9.4 5
0.9968 7.8 0.88 0.00 2.6 0.098 25.0 67.0 3.20 0.68 9.8 5
0.9970 7.8 0.76 0.04 2.3 0.092 15.0 54.0 3.26 0.65 9.8 5
0.9980 11.2 0.28 0.56 1.9 0.075 17.0 60.0 3.16 0.58 9.8 6
0.9964 7.9 0.60 0.06 1.6 0.069 15.0 59.0 3.30 0.46 9.4 5
In [166]:
wine_df.iloc[:5, :]
Out[166]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide pH_5 sulphates alcohol quality_6
density
0.9978 7.4 0.70 0.00 1.9 0.076 11.0 34.0 3.51 0.56 9.4 5
0.9968 7.8 0.88 0.00 2.6 0.098 25.0 67.0 3.20 0.68 9.8 5
0.9970 7.8 0.76 0.04 2.3 0.092 15.0 54.0 3.26 0.65 9.8 5
0.9980 11.2 0.28 0.56 1.9 0.075 17.0 60.0 3.16 0.58 9.8 6
0.9964 7.9 0.60 0.06 1.6 0.069 15.0 59.0 3.30 0.46 9.4 5

Selecting rows and columns using get_loc and index methods

In [175]:
col_start = wine_df.columns.get_loc('volatile_acidity')
col_end = wine_df.columns.get_loc('volatile_acidity')+2
col_start, col_end
Out[175]:
(1, 3)
In [177]:
wine_df.iloc[:4, col_start:col_end]
Out[177]:
volatile_acidity citric_acid
density
0.9978 0.70 0.00
0.9968 0.88 0.00
0.9970 0.76 0.04
0.9980 0.28 0.56
In [196]:
row_start = wine_df.index[4]
row_end = wine_df.index[7]

wine_df.loc[row_start:row_end, 'volatile_acidity' : 'chlorides']
Out[196]:
volatile_acidity citric_acid residual_sugar chlorides
density
0.9964 0.600 0.06 1.6 0.069
0.9946 0.650 0.00 1.2 0.065
0.9959 0.580 0.08 1.8 0.097
0.9943 0.615 0.00 1.6 0.089
In [183]:
density_val = 0.9980
wine_df.loc[density_val, 'citric_acid']
Out[183]:
0.56

Selecting rows and columns using .iat and .at

In [184]:
wine_df.at[density_val, 'citric_acid']
Out[184]:
0.56

The %timeit magic command to find the difference in speed:

In [185]:
timeit wine_df.loc[density_val, 'citric_acid']  
9.77 µs ± 633 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [186]:
timeit wine_df.at[density_val, 'citric_acid']
6.2 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [200]:
row_num = wine_df.index.get_loc(density_val)
col_num = wine_df.columns.get_loc('citric_acid')
row_num,col_num
Out[200]:
(3, 2)
In [189]:
timeit wine_df.iloc[row_num, col_num]
11 µs ± 506 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [190]:
timeit wine_df.iat[row_num, col_num]
7.81 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)