First Step: Packages

Ensure you have the necessary packages installed and mounted.

Mount dplyr and install the the hflights data to begin.

library(dplyr)
#install hflights if you haven't already, so you can mount the data set.
#install.packages("hflights")

What is dplyr?

Dplyr is part of the tidyverse - a suite of packages by Hadley Wickham that allow for more intuitive coding and data management practices.

Tidyverse includes dplyr, tidyr, ggplot2, tibble, readr, purr, stringr, and forcats

check www.tidyverse.org for more.

People tend to be either Base R purists or tidyverse fans. Understanding both is useful.

PROTIP: Check the cheatsheets for easy reference. Go to Help->Cheatsheets in Rstudio for more.

Using dplyr

dplyr streamlines your code a ton. It contains five basic verbs for data management:

filter, select, arrange, mutate, and summarise (plus group_by).

Let’s proceed by example, using the hflights data set.

library(hflights)
head(hflights)

##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0

#convert hflights from a tibble to a data frame.
flights <-data.frame(hflights)

#check the variable/column names in flights data set.
ls(flights)

##  [1] "ActualElapsedTime" "AirTime"           "ArrDelay"         
##  [4] "ArrTime"           "CancellationCode"  "Cancelled"        
##  [7] "DayofMonth"        "DayOfWeek"         "DepDelay"         
## [10] "DepTime"           "Dest"              "Distance"         
## [13] "Diverted"          "FlightNum"         "Month"            
## [16] "Origin"            "TailNum"           "TaxiIn"           
## [19] "TaxiOut"           "UniqueCarrier"     "Year"

Using the Chaining or Pipelining method

Rather than nesting functions, you can use the %>% operator (or Ctrl+Shift+M) to type commands in nested steps that are more intuitive.

#Pipelining with dplyr:

#example - select information from the flights data set on unique carriers and departure delays, but only for flights that were at least an hour behind schedule (60 mins).  
pipeline1<-flights %>% select(UniqueCarrier, DepDelay) %>% filter (DepDelay > 60)

ONE BIG ADVANTAGE OF PIPELINING

You can select part of the pipeline to test out and make sure it works. In the above example, you can see whether JUST the select function is working properly, without continuing to filter the results.

#EXAMPLE 2 - select columns indicating carrier, flight number, and departure time, filter so it only shows united airlines flights that left after 2pm.
EXpipeline <- flights %>% 
  select(UniqueCarrier, FlightNum, DepTime) %>% 
  filter(UniqueCarrier == "UA", DepTime > 1400)
head(EXpipeline)

##   UniqueCarrier FlightNum DepTime
## 1            UA       108    1536
## 2            UA       108    1538
## 3            UA       108    1528
## 4            UA       108    1540
## 5            UA       108    1543
## 6            UA       108    1529

The Primary dplyr Functions

The FILTER function

Filter works just like it does in lots of other programs.

#example filter data to only get departure times for American Airlines flights

filter1<- flights %>% 
  select(UniqueCarrier,DepTime) %>% 
  filter(UniqueCarrier =="AA")
head(filter1)

##   UniqueCarrier DepTime
## 1            AA    1400
## 2            AA    1401
## 3            AA    1352
## 4            AA    1403
## 5            AA    1405
## 6            AA    1359

The SELECT function

The select function can be used to choose only a specific set of columns/variables.

It is often used in tandem with the “contains” function, which lets you search a particular vector for the presence of a specific value or string.

#base R approach
selectBASE<-flights[,c("DepTime", "ArrTime", "FlightNum")]

#dplyr approach
select1<-select(flights, DepTime, ArrTime, FlightNum)

#use colon to select contiguous columns/variables
#use 'contains' to select columns that contain a particular string

select2<-select(flights, Year:DayofMonth, contains("Taxi"))

#so selecting all columns that containn date and time info, along with those containing info about flight cancellations could be done using:


EXselect<-select(flights, Month, DayofMonth, DayOfWeek, DepTime, ArrTime, contains("Cancel"))
head(EXselect)

##      Month DayofMonth DayOfWeek DepTime ArrTime Cancelled CancellationCode
## 5424     1          1         6    1400    1500         0                 
## 5425     1          2         7    1401    1501         0                 
## 5426     1          3         1    1352    1502         0                 
## 5427     1          4         2    1403    1513         0                 
## 5428     1          5         3    1405    1507         0                 
## 5429     1          6         4    1359    1503         0

***

The ARRANGE function

We can reorder the rows that we want to see, based on some criterion/criteria.

arrange1<-flights %>% 
  select(UniqueCarrier, DepDelay) %>% 
  arrange(DepDelay)

#use the 'desc' qualifier to switch the arrangement to descending order. It defaults to ascending.

#Example - select all AA flights that left in the afternoon on January 1st, and arrange by departure time.

EXarrange<-flights %>% 
  select(UniqueCarrier, DepTime, Month, DayofMonth) %>% 
  filter(UniqueCarrier =="AA", Month == "1", DayofMonth == "1", DepTime > 1200) %>% 
  arrange(DepTime)
head(EXarrange)

##   UniqueCarrier DepTime Month DayofMonth
## 1            AA    1211     1          1
## 2            AA    1400     1          1
## 3            AA    1631     1          1
## 4            AA    1756     1          1

The MUTATE function

You can use mutate to change variables and add new versions of them to the data frame.

#Example - we can add a variable called Speed that's a function of distance and travel time of each flight.

mutate1<-flights %>% 
  select(Distance, AirTime) %>% 
  mutate(Speed = Distance/AirTime*60)

#to store the new variable, assign it to flights using the standard <- arrow function.
flights2 <- flights %>%  mutate(Speed = Distance/AirTime*60)


#Example - calculate total delay of each flight based on departure and arrival delay, and store it as a new value in flights data.

flights3 <- flights %>% mutate(TotalDelay = DepDelay + ArrDelay)

#check the new data.
data.frame(head(flights3))

##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1          1         6    1400    1500            AA       428
## 2 2011     1          2         7    1401    1501            AA       428
## 3 2011     1          3         1    1352    1502            AA       428
## 4 2011     1          4         2    1403    1513            AA       428
## 5 2011     1          5         3    1405    1507            AA       428
## 6 2011     1          6         4    1359    1503            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N576AA                60      40      -10        0    IAH  DFW      224
## 2  N557AA                60      45       -9        1    IAH  DFW      224
## 3  N541AA                70      48       -8       -8    IAH  DFW      224
## 4  N403AA                70      39        3        3    IAH  DFW      224
## 5  N492AA                62      44       -3        5    IAH  DFW      224
## 6  N262AA                64      45       -7       -1    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted TotalDelay
## 1      7      13         0                         0        -10
## 2      6       9         0                         0         -8
## 3      5      17         0                         0        -16
## 4      9      22         0                         0          6
## 5      9       9         0                         0          2
## 6      6      13         0                         0         -8

The SUMMARISE function

Summarise lets you group results by a particular criterion/criteria. It is very common to use ‘summarise’ in tandem with the group_by function.

including na.rm-=T will remove any “NA” values prior to summarizing the data.

#summarize by creating a table grouped by destination and then summarize each group by taking the mean of arrival delay.

summarise1<-flights %>% 
  group_by(Dest) %>% 
  summarise (avg_delay = mean(ArrDelay, na.rm=T))

#calculate the mean delay in arrival times for each month of the year

EXsummarize<-flights %>% 
  group_by(Month) %>% 
  summarise (avg_delay = mean(ArrDelay, na.rm=T))
head(EXsummarize)

## # A tibble: 6 x 2
##   Month avg_delay
##   <int>     <dbl>
## 1     1      4.93
## 2     2      6.35
## 3     3      7.89
## 4     4     11.1 
## 5     5     13.1 
## 6     6     10.8

Lab 3 - Data Manipulation Using dplyr

Fred Clavel

January 29, 2019