Dplyr is one of the most popular r-packages and also part of tidyverse that's been developed by Hadley Wickham. The mere fact that dplyr package is very famous means, it's one of the most frequently used. Being a data scientist is not always about creating sophisticated models but Data Analysis (Manipulation) and Data Visualization play a very important role in BAU of many us - in fact, a very important part before any modeling exercise since Feature Engineering and EDA are the most important differentiating factors of your model and someone else's.

Hence, this notebook aims to bring out some well known and not-so-well-known applications of dplyr so that any data analyst could leverage its potential

In [1]:
# This R environment comes with all of CRAN preinstalled, as well as many other helpful packages
# The environment is defined by the kaggle/rstats docker image: https://github.com/kaggle/docker-rstats
# For example, here's several helpful packages to load in 

library(dplyr) # Loading Dplyr package

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

system("ls ../input")

# Any results you write to the current directory are saved as output.
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Let us start by reading the input training file using the base r function read.csv

In [2]:
train <- read.csv('../input/train.csv',stringsAsFactors = F, header = T)
train #just to see if it's been loaded
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 C
11 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7000 G6 S
12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S
13 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.0500 S
14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.2750 S
15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 0 350406 7.8542 S
16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0 248706 16.0000 S
17 0 3 Rice, Master. Eugene male 2 4 1 382652 29.1250 Q
18 1 2 Williams, Mr. Charles Eugene male NA 0 0 244373 13.0000 S
19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) female 31 1 0 345763 18.0000 S
20 1 3 Masselmani, Mrs. Fatima female NA 0 0 2649 7.2250 C
21 0 2 Fynney, Mr. Joseph J male 35 0 0 239865 26.0000 S
22 1 2 Beesley, Mr. Lawrence male 34 0 0 248698 13.0000 D56 S
23 1 3 McGowan, Miss. Anna "Annie" female 15 0 0 330923 8.0292 Q
24 1 1 Sloper, Mr. William Thompson male 28 0 0 113788 35.5000 A6 S
25 0 3 Palsson, Miss. Torborg Danira female 8 3 1 349909 21.0750 S
26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)female 38 1 5 347077 31.3875 S
27 0 3 Emir, Mr. Farred Chehab male NA 0 0 2631 7.2250 C
28 0 1 Fortune, Mr. Charles Alexander male 19 3 2 19950 263.0000 C23 C25 C27 S
29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NA 0 0 330959 7.8792 Q
30 0 3 Todoroff, Mr. Lalio male NA 0 0 349216 7.8958 S
862 0 2 Giles, Mr. Frederick Edward male 21 1 0 28134 11.5000 S
863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Barron)female 48 0 0 17466 25.9292 D17 S
864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NA 8 2 CA. 2343 69.5500 S
865 0 2 Gill, Mr. John William male 24 0 0 233866 13.0000 S
866 1 2 Bystrom, Mrs. (Karolina) female 42 0 0 236852 13.0000 S
867 1 2 Duran y More, Miss. Asuncion female 27 1 0 SC/PARIS 2149 13.8583 C
868 0 1 Roebling, Mr. Washington Augustus II male 31 0 0 PC 17590 50.4958 A24 S
869 0 3 van Melkebeke, Mr. Philemon male NA 0 0 345777 9.5000 S
870 1 3 Johnson, Master. Harold Theodor male 4 1 1 347742 11.1333 S
871 0 3 Balkic, Mr. Cerin male 26 0 0 349248 7.8958 S
872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47 1 1 11751 52.5542 D35 S
873 0 1 Carlsson, Mr. Frans Olof male 33 0 0 695 5.0000 B51 B53 B55 S
874 0 3 Vander Cruyssen, Mr. Victor male 47 0 0 345765 9.0000 S
875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28 1 0 P/PP 3381 24.0000 C
876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15 0 0 2667 7.2250 C
877 0 3 Gustafsson, Mr. Alfred Ossian male 20 0 0 7534 9.8458 S
878 0 3 Petroff, Mr. Nedelio male 19 0 0 349212 7.8958 S
879 0 3 Laleff, Mr. Kristo male NA 0 0 349217 7.8958 S
880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56 0 1 11767 83.1583 C50 C
881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 1 230433 26.0000 S
882 0 3 Markun, Mr. Johann male 33 0 0 349257 7.8958 S
883 0 3 Dahlberg, Miss. Gerda Ulrika female 22 0 0 7552 10.5167 S
884 0 2 Banfield, Mr. Frederick James male 28 0 0 C.A./SOTON 34068 10.5000 S
885 0 3 Sutehall, Mr. Henry Jr male 25 0 0 SOTON/OQ 392076 7.0500 S
886 0 3 Rice, Mrs. William (Margaret Norton) female 39 0 5 382652 29.1250 Q
887 0 2 Montvila, Rev. Juozas male 27 0 0 211536 13.0000 S
888 1 1 Graham, Miss. Margaret Edith female 19 0 0 112053 30.0000 B42 S
889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NA 1 2 W./C. 6607 23.4500 S
890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30.0000 C148 C
891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.7500 Q

Getting the total number of rows in the given dataframe (even though it's been very straight forward with nrow() in base-r, this being a dplyr starter-kit, we'll start with that.

In [3]:
train %>% count()
n
891

The above code just gives the row count of the dataframe that's been passed with the pipe %>% operator. The pipe operator works very similar to the | (pipe) operator in Unix environment where the ouput of the current operation is fed as the input of the following operation. Similary in dplyr or any other package that supports pipe operator, the functions in it will always take only dataframe as the first arugment hence the function can be called in two ways like below:

In [4]:
count(train) #Without pipe, passing the df as the first argument
train %>% count() #with pipe, more convient and more readability
n
891
n
891

But dplyr's real flavor starts with the following 5 functions (or as most people call, verbs of dplyr):

  • select()
  • filter()
  • arrange()
  • mutate()
  • summarise()
  • group_by()

And let us see what every one of these does!

select:

select() as the name suggests selects the columns that are required from a given dataframe and if multiple columns are required or not required, then one_of() could be used within select.

In [5]:
select(train,Age) #without pipe
Age
22
38
26
35
35
NA
54
2
27
14
4
58
20
39
14
55
2
NA
31
NA
35
34
15
28
8
38
NA
19
NA
NA
21
48
NA
24
42
27
31
NA
4
26
47
33
47
28
15
20
19
NA
56
25
33
22
28
25
39
27
19
NA
26
32
In [6]:
#multicolumn selection
train %>% select(one_of('Sex','Age'))
SexAge
male 22
female38
female26
female35
male 35
male NA
male 54
male 2
female27
female14
female 4
female58
male 20
male 39
female14
female55
male 2
male NA
female31
femaleNA
male 35
male 34
female15
male 28
female 8
female38
male NA
male 19
femaleNA
male NA
male 21
female48
femaleNA
male 24
female42
female27
male 31
male NA
male 4
male 26
female47
male 33
male 47
female28
female15
male 20
male 19
male NA
female56
female25
male 33
female22
male 28
male 25
female39
male 27
female19
femaleNA
male 26
male 32
In [7]:
#multicolumn rejection

train %>% select(-one_of('Age','Sex'))
PassengerIdSurvivedPclassNameSibSpParchTicketFareCabinEmbarked
1 0 3 Braund, Mr. Owen Harris 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard 3 1 349909 21.0750 S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 0 2 347742 11.1333 S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) 1 0 237736 30.0708 C
11 1 3 Sandstrom, Miss. Marguerite Rut 1 1 PP 9549 16.7000 G6 S
12 1 1 Bonnell, Miss. Elizabeth 0 0 113783 26.5500 C103 S
13 0 3 Saundercock, Mr. William Henry 0 0 A/5. 2151 8.0500 S
14 0 3 Andersson, Mr. Anders Johan 1 5 347082 31.2750 S
15 0 3 Vestrom, Miss. Hulda Amanda Adolfina 0 0 350406 7.8542 S
16 1 2 Hewlett, Mrs. (Mary D Kingcome) 0 0 248706 16.0000 S
17 0 3 Rice, Master. Eugene 4 1 382652 29.1250 Q
18 1 2 Williams, Mr. Charles Eugene 0 0 244373 13.0000 S
19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) 1 0 345763 18.0000 S
20 1 3 Masselmani, Mrs. Fatima 0 0 2649 7.2250 C
21 0 2 Fynney, Mr. Joseph J 0 0 239865 26.0000 S
22 1 2 Beesley, Mr. Lawrence 0 0 248698 13.0000 D56 S
23 1 3 McGowan, Miss. Anna "Annie" 0 0 330923 8.0292 Q
24 1 1 Sloper, Mr. William Thompson 0 0 113788 35.5000 A6 S
25 0 3 Palsson, Miss. Torborg Danira 3 1 349909 21.0750 S
26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)1 5 347077 31.3875 S
27 0 3 Emir, Mr. Farred Chehab 0 0 2631 7.2250 C
28 0 1 Fortune, Mr. Charles Alexander 3 2 19950 263.0000 C23 C25 C27 S
29 1 3 O'Dwyer, Miss. Ellen "Nellie" 0 0 330959 7.8792 Q
30 0 3 Todoroff, Mr. Lalio 0 0 349216 7.8958 S
862 0 2 Giles, Mr. Frederick Edward 1 0 28134 11.5000 S
863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Barron)0 0 17466 25.9292 D17 S
864 0 3 Sage, Miss. Dorothy Edith "Dolly" 8 2 CA. 2343 69.5500 S
865 0 2 Gill, Mr. John William 0 0 233866 13.0000 S
866 1 2 Bystrom, Mrs. (Karolina) 0 0 236852 13.0000 S
867 1 2 Duran y More, Miss. Asuncion 1 0 SC/PARIS 2149 13.8583 C
868 0 1 Roebling, Mr. Washington Augustus II 0 0 PC 17590 50.4958 A24 S
869 0 3 van Melkebeke, Mr. Philemon 0 0 345777 9.5000 S
870 1 3 Johnson, Master. Harold Theodor 1 1 347742 11.1333 S
871 0 3 Balkic, Mr. Cerin 0 0 349248 7.8958 S
872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 1 1 11751 52.5542 D35 S
873 0 1 Carlsson, Mr. Frans Olof 0 0 695 5.0000 B51 B53 B55 S
874 0 3 Vander Cruyssen, Mr. Victor 0 0 345765 9.0000 S
875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) 1 0 P/PP 3381 24.0000 C
876 1 3 Najib, Miss. Adele Kiamie "Jane" 0 0 2667 7.2250 C
877 0 3 Gustafsson, Mr. Alfred Ossian 0 0 7534 9.8458 S
878 0 3 Petroff, Mr. Nedelio 0 0 349212 7.8958 S
879 0 3 Laleff, Mr. Kristo 0 0 349217 7.8958 S
880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 0 1 11767 83.1583 C50 C
881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) 0 1 230433 26.0000 S
882 0 3 Markun, Mr. Johann 0 0 349257 7.8958 S
883 0 3 Dahlberg, Miss. Gerda Ulrika 0 0 7552 10.5167 S
884 0 2 Banfield, Mr. Frederick James 0 0 C.A./SOTON 34068 10.5000 S
885 0 3 Sutehall, Mr. Henry Jr 0 0 SOTON/OQ 392076 7.0500 S
886 0 3 Rice, Mrs. William (Margaret Norton) 0 5 382652 29.1250 Q
887 0 2 Montvila, Rev. Juozas 0 0 211536 13.0000 S
888 1 1 Graham, Miss. Margaret Edith 0 0 112053 30.0000 B42 S
889 0 3 Johnston, Miss. Catherine Helen "Carrie" 1 2 W./C. 6607 23.4500 S
890 1 1 Behr, Mr. Karl Howell 0 0 111369 30.0000 C148 C
891 0 3 Dooley, Mr. Patrick 0 0 370376 7.7500 Q

Like selecting a column with entire column name (or multiple column names with one_of()), select could also be used with a few more string ops.

In [8]:
train %>% select(starts_with('P'))
PassengerIdPclassParch
13 0
21 0
33 0
41 0
53 0
63 0
71 0
83 1
93 2
102 0
113 1
121 0
133 0
143 5
153 0
162 0
173 1
182 0
193 0
203 0
212 0
222 0
233 0
241 0
253 1
263 5
273 0
281 2
293 0
303 0
8622 0
8631 0
8643 2
8652 0
8662 0
8672 0
8681 0
8693 0
8703 1
8713 0
8721 1
8731 0
8743 0
8752 0
8763 0
8773 0
8783 0
8793 0
8801 1
8812 1
8823 0
8833 0
8842 0
8853 0
8863 5
8872 0
8881 0
8893 2
8901 0
8913 0
In [9]:
train %>% select(ends_with('e'))
NameAgeFare
Braund, Mr. Owen Harris 22 7.2500
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38 71.2833
Heikkinen, Miss. Laina 26 7.9250
Futrelle, Mrs. Jacques Heath (Lily May Peel) 35 53.1000
Allen, Mr. William Henry 35 8.0500
Moran, Mr. James NA 8.4583
McCarthy, Mr. Timothy J 54 51.8625
Palsson, Master. Gosta Leonard 2 21.0750
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 27 11.1333
Nasser, Mrs. Nicholas (Adele Achem) 14 30.0708
Sandstrom, Miss. Marguerite Rut 4 16.7000
Bonnell, Miss. Elizabeth 58 26.5500
Saundercock, Mr. William Henry 20 8.0500
Andersson, Mr. Anders Johan 39 31.2750
Vestrom, Miss. Hulda Amanda Adolfina 14 7.8542
Hewlett, Mrs. (Mary D Kingcome) 55 16.0000
Rice, Master. Eugene 2 29.1250
Williams, Mr. Charles Eugene NA 13.0000
Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) 31 18.0000
Masselmani, Mrs. Fatima NA 7.2250
Fynney, Mr. Joseph J 35 26.0000
Beesley, Mr. Lawrence 34 13.0000
McGowan, Miss. Anna "Annie" 15 8.0292
Sloper, Mr. William Thompson 28 35.5000
Palsson, Miss. Torborg Danira 8 21.0750
Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)38 31.3875
Emir, Mr. Farred Chehab NA 7.2250
Fortune, Mr. Charles Alexander 19 263.0000
O'Dwyer, Miss. Ellen "Nellie" NA 7.8792
Todoroff, Mr. Lalio NA 7.8958
Giles, Mr. Frederick Edward 21 11.5000
Swift, Mrs. Frederick Joel (Margaret Welles Barron)48 25.9292
Sage, Miss. Dorothy Edith "Dolly" NA 69.5500
Gill, Mr. John William 24 13.0000
Bystrom, Mrs. (Karolina) 42 13.0000
Duran y More, Miss. Asuncion 27 13.8583
Roebling, Mr. Washington Augustus II 31 50.4958
van Melkebeke, Mr. Philemon NA 9.5000
Johnson, Master. Harold Theodor 4 11.1333
Balkic, Mr. Cerin 26 7.8958
Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 47 52.5542
Carlsson, Mr. Frans Olof 33 5.0000
Vander Cruyssen, Mr. Victor 47 9.0000
Abelson, Mrs. Samuel (Hannah Wizosky) 28 24.0000
Najib, Miss. Adele Kiamie "Jane" 15 7.2250
Gustafsson, Mr. Alfred Ossian 20 9.8458
Petroff, Mr. Nedelio 19 7.8958
Laleff, Mr. Kristo NA 7.8958
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 56 83.1583
Shelley, Mrs. William (Imanita Parrish Hall) 25 26.0000
Markun, Mr. Johann 33 7.8958
Dahlberg, Miss. Gerda Ulrika 22 10.5167
Banfield, Mr. Frederick James 28 10.5000
Sutehall, Mr. Henry Jr 25 7.0500
Rice, Mrs. William (Margaret Norton) 39 29.1250
Montvila, Rev. Juozas 27 13.0000
Graham, Miss. Margaret Edith 19 30.0000
Johnston, Miss. Catherine Helen "Carrie" NA 23.4500
Behr, Mr. Karl Howell 26 30.0000
Dooley, Mr. Patrick 32 7.7500

Group_by:

Group_by is a lot similar to SQL Group by but more versatile. It is related to concept of “split-apply-combine”. Let us understand group_by with a starter example of finding out number of male and number of female - which logically could be the count of each Sex Type (once grouped by Sex).

In [10]:
train %>% group_by(Sex) %>% count()
Sexn
female314
male 577

Aha! That seems simple and now let us do a two level grouping to understand how many of survived of each gender.

In [11]:
train %>% group_by(Survived, Sex) %>% count()

train %>% group_by(Sex, Survived) %>% count()
SurvivedSexn
0 female 81
0 male 468
1 female233
1 male 109
SexSurvivedn
female0 81
female1 233
male 0 468
male 1 109

Mutate and Summarise:

That's minimally group_by, but the true power of group_by is unveiled only when it is coupled with mutate and summarise functions.

Mutate function adds a new column based on the given expression while summarise function summarises the dataset based on the given function and let us see the difference in action with the following example.

Let us get the average age of all survivors (and non-survivors): so this must be group_by -ed based on Survived while summarised by Age so that we will get a summarised mean value.for two groups.

In [12]:
train %>% group_by(Survived) %>% summarise(mean(Age))

 #Remember we have got NAs, so mean() wouldn't work and to bypass NAs, na.rm = T must be passed. 

train %>% group_by(Survived) %>% summarise(average_age = mean(Age,na.rm=T))
Survivedmean(Age)
0 NA
1 NA
Survivedaverage_age
0 30.62618
1 28.34369

That's summarise() giving us the summary of the dataframe. If we need to create a new column, values filled for all 891 datapoints, that's where mutate plays its role. Let us create a new column, Age_Bracket containing value Minor if Age is less than 18 else Major

In [13]:
train %>% mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% select(starts_with('Age'))

#In fact this can be coupled with Survivor list to see the impact of this Age_bracket

train %>% 
mutate(Age_Bracket = ifelse(Age < 18, 'Minor','Major')) %>% 
group_by(Survived,Age_Bracket) %>% 
summarise(pnt = (n()/nrow(train))*100)
AgeAge_Bracket
22 Major
38 Major
26 Major
35 Major
35 Major
NA NA
54 Major
2 Minor
27 Major
14 Minor
4 Minor
58 Major
20 Major
39 Major
14 Minor
55 Major
2 Minor
NA NA
31 Major
NA NA
35 Major
34 Major
15 Minor
28 Major
8 Minor
38 Major
NA NA
19 Major
NA NA
NA NA
21 Major
48 Major
NA NA
24 Major
42 Major
27 Major
31 Major
NA NA
4 Minor
26 Major
47 Major
33 Major
47 Major
28 Major
15 Minor
20 Major
19 Major
NA NA
56 Major
25 Major
33 Major
22 Major
28 Major
25 Major
39 Major
27 Major
19 Major
NA NA
26 Major
32 Major
SurvivedAge_Bracketpnt
0 Major 41.750842
0 Minor 5.836139
0 NA 14.029181
1 Major 25.701459
1 Minor 6.846240
1 NA 5.836139

That's how dplyr can get more powerful with group_by coupled with mutate or summarise for feautre engineering and for better data visualization. But this doesn't stop here, because one of the most important function a dataanalyst would require is sorting and that's what arrange() does.

arrange - in ascending order:

In [14]:
train %>% arrange(Fare) %>% tail(22) #Extracting last 22 results after sorting the fare in asending order
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
870319 1 1 Wick, Miss. Mary Natalie female 31 0 2 36928 164.8667 C7 S
871857 1 1 Wick, Mrs. George Dennick (Mary Hitchcock) female 45 1 1 36928 164.8667 S
872690 1 1 Madill, Miss. Georgette Alexandra female 15 0 1 24160 211.3375 B5 S
873731 1 1 Allen, Miss. Elisabeth Walton female 29 0 0 24160 211.3375 B5 S
874780 1 1 Robert, Mrs. Edward Scott (Elisabeth Walton McMillan)female 43 0 1 24160 211.3375 B3 S
875378 0 1 Widener, Mr. Harry Elkins male 27 0 2 113503 211.5000 C82 C
876528 0 1 Farthing, Mr. John male NA 0 0 PC 17483 221.7792 C95 S
877381 1 1 Bidois, Miss. Rosalie female 42 0 0 PC 17757 227.5250 C
878558 0 1 Robbins, Mr. Victor male NA 0 0 PC 17757 227.5250 C
879701 1 1 Astor, Mrs. John Jacob (Madeleine Talmadge Force) female 18 1 0 PC 17757 227.5250 C62 C64 C
880717 1 1 Endres, Miss. Caroline Louise female 38 0 0 PC 17757 227.5250 C45 C
881119 0 1 Baxter, Mr. Quigg Edmond male 24 0 1 PC 17558 247.5208 B58 B60 C
882300 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50 0 1 PC 17558 247.5208 B58 B60 C
883312 1 1 Ryerson, Miss. Emily Borie female 18 2 2 PC 17608 262.3750 B57 B59 B63 B66 C
884743 1 1 Ryerson, Miss. Susan Parker "Suzette" female 21 2 2 PC 17608 262.3750 B57 B59 B63 B66 C
885 28 0 1 Fortune, Mr. Charles Alexander male 19 3 2 19950 263.0000 C23 C25 C27 S
886 89 1 1 Fortune, Miss. Mabel Helen female 23 3 2 19950 263.0000 C23 C25 C27 S
887342 1 1 Fortune, Miss. Alice Elizabeth female 24 3 2 19950 263.0000 C23 C25 C27 S
888439 0 1 Fortune, Mr. Mark male 64 1 4 19950 263.0000 C23 C25 C27 S
889259 1 1 Ward, Miss. Anna female 35 0 0 PC 17755 512.3292 C
890680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36 0 1 PC 17755 512.3292 B51 B53 B55 C
891738 1 1 Lesurer, Mr. Gustave J male 35 0 0 PC 17755 512.3292 B101 C
In [15]:
## Arrange in descending order

train %>% arrange(desc(Age)) %>% head(10)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
631 1 1 Barkworth, Mr. Algernon Henry Wilsonmale 80.0 0 0 27042 30.0000 A23 S
852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 S
97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
494 0 1 Artagaveytia, Mr. Ramon male 71.0 0 0 PC 17609 49.5042 C
117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 Q
673 0 2 Mitchell, Mr. Henry Michael male 70.0 0 0 C.A. 24580 10.5000 S
746 0 1 Crosby, Capt. Edward Gifford male 70.0 1 1 WE/P 5735 71.0000 B22 S
34 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 S
55 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C
281 0 3 Duane, Mr. Frank male 65.0 0 0 336439 7.7500 Q

filter:

Filter does row_wise filter ( similar to what select did with columns). filter() takes a logical expression and evaluates them and results the only_true datapoints. So to be clear, all that matters to filter() function is if the expression evaluates to TRUE.

Let us start with filtering (extracting) only male and getting their Embarked station count.

In [16]:
train %>% 
filter(Sex == 'male') %>%
group_by(Embarked) %>%
count()
Embarkedn
C 95
Q 41
S 441
In [17]:
#Getting the count of everyone whose age is lesser than 18

train %>% filter(Age < 18) %>% count()
n
113

Coupling Filter with Regex to perform simple string manipulation and detection.

In [18]:
train %>% filter(grepl('wick',train$Name))
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
344 0 2 Sedgwick, Mr. Charles Frederick Waddingtonmale 25 0 0 244361 13 S

And this is dplyr in a nut shell and hope you get a decent start with this notebook if you are a beginner. Please share your thoughts in comments and suggestions!

In [19]: