Ensure you have the necessary packages installed and mounted.
Mount dplyr and install the the hflights data to begin.
library(dplyr)
#install hflights if you haven't already, so you can mount the data set.
#install.packages("hflights")
Dplyr is part of the tidyverse - a suite of packages by Hadley Wickham that allow for more intuitive coding and data management practices.
Tidyverse includes dplyr, tidyr, ggplot2, tibble, readr, purr, stringr, and forcats
check www.tidyverse.org for more.
People tend to be either Base R purists or tidyverse fans. Understanding both is useful.
PROTIP: Check the cheatsheets for easy reference. Go to Help->Cheatsheets in Rstudio for more.
dplyr streamlines your code a ton. It contains five basic verbs for data management:
filter, select, arrange, mutate, and summarise (plus group_by).
Let’s proceed by example, using the hflights data set.
library(hflights)
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
#convert hflights from a tibble to a data frame.
flights <-data.frame(hflights)
#check the variable/column names in flights data set.
ls(flights)
## [1] "ActualElapsedTime" "AirTime" "ArrDelay"
## [4] "ArrTime" "CancellationCode" "Cancelled"
## [7] "DayofMonth" "DayOfWeek" "DepDelay"
## [10] "DepTime" "Dest" "Distance"
## [13] "Diverted" "FlightNum" "Month"
## [16] "Origin" "TailNum" "TaxiIn"
## [19] "TaxiOut" "UniqueCarrier" "Year"
Rather than nesting functions, you can use the %>% operator (or Ctrl+Shift+M) to type commands in nested steps that are more intuitive.
#Pipelining with dplyr:
#example - select information from the flights data set on unique carriers and departure delays, but only for flights that were at least an hour behind schedule (60 mins).
pipeline1<-flights %>% select(UniqueCarrier, DepDelay) %>% filter (DepDelay > 60)
You can select part of the pipeline to test out and make sure it works. In the above example, you can see whether JUST the select function is working properly, without continuing to filter the results.
#EXAMPLE 2 - select columns indicating carrier, flight number, and departure time, filter so it only shows united airlines flights that left after 2pm.
EXpipeline <- flights %>%
select(UniqueCarrier, FlightNum, DepTime) %>%
filter(UniqueCarrier == "UA", DepTime > 1400)
head(EXpipeline)
## UniqueCarrier FlightNum DepTime
## 1 UA 108 1536
## 2 UA 108 1538
## 3 UA 108 1528
## 4 UA 108 1540
## 5 UA 108 1543
## 6 UA 108 1529
Filter works just like it does in lots of other programs.
#example filter data to only get departure times for American Airlines flights
filter1<- flights %>%
select(UniqueCarrier,DepTime) %>%
filter(UniqueCarrier =="AA")
head(filter1)
## UniqueCarrier DepTime
## 1 AA 1400
## 2 AA 1401
## 3 AA 1352
## 4 AA 1403
## 5 AA 1405
## 6 AA 1359
The select function can be used to choose only a specific set of columns/variables.
It is often used in tandem with the “contains” function, which lets you search a particular vector for the presence of a specific value or string.
#base R approach
selectBASE<-flights[,c("DepTime", "ArrTime", "FlightNum")]
#dplyr approach
select1<-select(flights, DepTime, ArrTime, FlightNum)
#use colon to select contiguous columns/variables
#use 'contains' to select columns that contain a particular string
select2<-select(flights, Year:DayofMonth, contains("Taxi"))
#so selecting all columns that containn date and time info, along with those containing info about flight cancellations could be done using:
EXselect<-select(flights, Month, DayofMonth, DayOfWeek, DepTime, ArrTime, contains("Cancel"))
head(EXselect)
## Month DayofMonth DayOfWeek DepTime ArrTime Cancelled CancellationCode
## 5424 1 1 6 1400 1500 0
## 5425 1 2 7 1401 1501 0
## 5426 1 3 1 1352 1502 0
## 5427 1 4 2 1403 1513 0
## 5428 1 5 3 1405 1507 0
## 5429 1 6 4 1359 1503 0
***
We can reorder the rows that we want to see, based on some criterion/criteria.
arrange1<-flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(DepDelay)
#use the 'desc' qualifier to switch the arrangement to descending order. It defaults to ascending.
#Example - select all AA flights that left in the afternoon on January 1st, and arrange by departure time.
EXarrange<-flights %>%
select(UniqueCarrier, DepTime, Month, DayofMonth) %>%
filter(UniqueCarrier =="AA", Month == "1", DayofMonth == "1", DepTime > 1200) %>%
arrange(DepTime)
head(EXarrange)
## UniqueCarrier DepTime Month DayofMonth
## 1 AA 1211 1 1
## 2 AA 1400 1 1
## 3 AA 1631 1 1
## 4 AA 1756 1 1
You can use mutate to change variables and add new versions of them to the data frame.
#Example - we can add a variable called Speed that's a function of distance and travel time of each flight.
mutate1<-flights %>%
select(Distance, AirTime) %>%
mutate(Speed = Distance/AirTime*60)
#to store the new variable, assign it to flights using the standard <- arrow function.
flights2 <- flights %>% mutate(Speed = Distance/AirTime*60)
#Example - calculate total delay of each flight based on departure and arrival delay, and store it as a new value in flights data.
flights3 <- flights %>% mutate(TotalDelay = DepDelay + ArrDelay)
#check the new data.
data.frame(head(flights3))
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1 N576AA 60 40 -10 0 IAH DFW 224
## 2 N557AA 60 45 -9 1 IAH DFW 224
## 3 N541AA 70 48 -8 -8 IAH DFW 224
## 4 N403AA 70 39 3 3 IAH DFW 224
## 5 N492AA 62 44 -3 5 IAH DFW 224
## 6 N262AA 64 45 -7 -1 IAH DFW 224
## TaxiIn TaxiOut Cancelled CancellationCode Diverted TotalDelay
## 1 7 13 0 0 -10
## 2 6 9 0 0 -8
## 3 5 17 0 0 -16
## 4 9 22 0 0 6
## 5 9 9 0 0 2
## 6 6 13 0 0 -8
Summarise lets you group results by a particular criterion/criteria. It is very common to use ‘summarise’ in tandem with the group_by function.
including na.rm-=T will remove any “NA” values prior to summarizing the data.
#summarize by creating a table grouped by destination and then summarize each group by taking the mean of arrival delay.
summarise1<-flights %>%
group_by(Dest) %>%
summarise (avg_delay = mean(ArrDelay, na.rm=T))
#calculate the mean delay in arrival times for each month of the year
EXsummarize<-flights %>%
group_by(Month) %>%
summarise (avg_delay = mean(ArrDelay, na.rm=T))
head(EXsummarize)
## # A tibble: 6 x 2
## Month avg_delay
## <int> <dbl>
## 1 1 4.93
## 2 2 6.35
## 3 3 7.89
## 4 4 11.1
## 5 5 13.1
## 6 6 10.8