dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. dplyr is the next iteration of plyr, focussing on only data frames. The main advantages include:
Speed. Compared to plyr library (home of the familiar ddply
function), dplyr is anywhere between 20X - 100X faster in its calculations.
Cleaner Code. the syntax allows for function chaining, preventing any potential cluttering in the code, which in turn makes for easier code writing/reading.
Simpler Code. dplyr has a limited number of functions (5) that are focused on the most common requirements of data manipulation. the syntax is both simple and effecient.
tbl_df()
Just as with plyr library, dplyr works best with dataframes. Please note that there is a type particular to dplyr, table dataframes, which make it worthwhile to convert all of our data to. The main advantage is that table dataframes are “smarter”. Let's observe the difference via an example using aviation data from Houston, Texas.
library(hflights)
hflights <- hflights
After reading in the data, try doing a head()
of hflights.
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
Not very friendly to the Rstudio console or the human eye. Now, let's convert into a table dataframe and see the difference.
hflights_df <- tbl_df(hflights)
hflights_df
## Source: local data frame [227,496 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## 5430 2011 1 7 5 1359 1509 AA
## 5431 2011 1 8 6 1355 1454 AA
## 5432 2011 1 9 7 1443 1554 AA
## 5433 2011 1 10 1 1443 1553 AA
## .. ... ... ... ... ... ... ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
## (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
## (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
dplyr has a group of functions/verbs used for data manipulation:
select()
: focus on a subset of columnsfilter()
: focus on a subset of rowsarrange()
: re-order the rowsmutate()
: add new columnssummarise()
: produce summary statistics (aggregate)select()
We are very familiar with two methods of selecting a set of columns from a larger data frame. The first is the subset()
function, and the other is the rows, columns slicing syntax [,]
.
The select()
function is even more straightforward and powerful. Let's explore through some examples:
# select three columns by name (no quotations)
head(dplyr::select(hflights_df, Year, Month, AirTime), 2)
## Source: local data frame [2 x 3]
##
## Year Month AirTime
## 5424 2011 1 40
## 5425 2011 1 45
# all but three columns
head(dplyr::select(hflights_df, -Year, -Month, -AirTime), 2)
## Source: local data frame [2 x 18]
##
## DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
## 5424 1 6 1400 1500 AA 428 N576AA
## 5425 2 7 1401 1501 AA 428 N557AA
## Variables not shown: ActualElapsedTime (int), ArrDelay (int), DepDelay
## (int), Origin (chr), Dest (chr), Distance (int), TaxiIn (int), TaxiOut
## (int), Cancelled (int), CancellationCode (chr), Diverted (int)
# sequence of columns by name instead of index
head(dplyr::select(hflights_df, Year:AirTime), 2)
## Source: local data frame [2 x 11]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
## (int), AirTime (int)
One final note about select()
is the ability to do renaming of columns with a very straightforward syntax: new_col_name = old_name
. Let's select the first three columns we began with and rename them in Spanish.
# select and rename three columns
head(dplyr::select(hflights_df, Año = Year, Mes = Month, TiempoDelVuelo = AirTime),
2)
## Source: local data frame [2 x 3]
##
## Año Mes TiempoDelVuelo
## 5424 2011 1 40
## 5425 2011 1 45
Notice how the new dataframe only contains the three columns we've just created. To keep the remaining columns, there is a subfunction called matches()
that does the trick.
# rename three columns and select all
head(dplyr::select(hflights_df, Año = Year, Mes = Month, TiempoDelVuelo = AirTime,
matches(".")), 2)
## Source: local data frame [2 x 21]
##
## Año Mes TiempoDelVuelo DayofMonth DayOfWeek DepTime ArrTime
## 5424 2011 1 40 1 6 1400 1500
## 5425 2011 1 45 2 7 1401 1501
## Variables not shown: UniqueCarrier (chr), FlightNum (int), TailNum (chr),
## ActualElapsedTime (int), ArrDelay (int), DepDelay (int), Origin (chr),
## Dest (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled
## (int), CancellationCode (chr), Diverted (int)
filter()
Similair to subset()
or [,]
, we can filter our data by rows evaluated by a certain criteria.
To select all flights on New Years Day, we can say:
filter(hflights_df, Month == 1, DayofMonth == 1)
## Source: local data frame [552 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 1 6 728 840 AA 460
## 3 2011 1 1 6 1631 1736 AA 1121
## 4 2011 1 1 6 1756 2112 AA 1294
## 5 2011 1 1 6 1012 1347 AA 1700
## 6 2011 1 1 6 1211 1325 AA 1820
## 7 2011 1 1 6 557 906 AA 1994
## 8 2011 1 1 6 1824 2106 AS 731
## 9 2011 1 1 6 654 1124 B6 620
## 10 2011 1 1 6 1639 2110 B6 622
## .. ... ... ... ... ... ... ... ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
## (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
## Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
# using filter takes the place of the more cumbersome (and much more messily
# printed):
hflights[hflights$Month == 1 & hflights$DayofMonth == 1, ]
arrange()
This function works the same as filter()
, except that instead of selecting rows, it reorders them. The first argument is a dataframe, follwed by a set of column names to order by.
Also note the desc()
option which orders in descending (alphabetical or numerical) order. To arrange by reverse choronology i.e. which flights happened closest to present day:
arrange(hflights_df, desc(Month), desc(DayofMonth))
## Source: local data frame [227,496 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 12 31 6 556 918 AA 466
## 2 2011 12 31 6 1156 1254 AA 865
## 3 2011 12 31 6 1616 1714 AA 1033
## 4 2011 12 31 6 1729 2044 AA 1294
## 5 2011 12 31 6 1441 1542 AA 1740
## 6 2011 12 31 6 901 1013 AA 1788
## 7 2011 12 31 6 1024 1343 AA 2234
## 8 2011 12 31 6 1843 2131 AS 731
## 9 2011 12 31 6 831 1251 B6 620
## 10 2011 12 31 6 1544 2009 B6 622
## .. ... ... ... ... ... ... ... ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
## (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
## Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
mutate()
This function requires a table dataframe as a first argument, and any desired columns to follow. It requires column creation to be more neatly written in the code, and the payoff in speed is also a significant benefit!
mutate(hflights_df,
Distance_km = Distance*1.60934,
flightCode = paste(FlightNum, UniqueCarrier, sep="_"))
One cool feature of dplyr::mutate()
is the ability to refer to columns that you just created. What is the following code producing for us in terms of data?
hflights_df %.%
mutate(Distance_km = round(Distance*1.6),
AirPoints = Distance_km*2) %.%
select(Distance_km, AirPoints) %.%
arrange(desc(AirPoints)) %.%
tail(5)
## Source: local data frame [5 x 2]
##
## Distance_km AirPoints
## 227492 203 406
## 227493 203 406
## 227494 126 252
## 227495 126 252
## 227496 126 252
summarise()
Testing our understanding: Using plyr library, create a dataframe that contains the number of flights per month out of Houston.
Now, let's compare and contrast the process of aggregation in plyr vs. dplyr.
hflights_df %.% group_by(Month) %.% summarise(num_flights_month = n()) %.% arrange(desc(num_flights_month))
## Source: local data frame [12 x 2]
##
## Month num_flights_month
## 1 7 20548
## 2 8 20176
## 3 6 19600
## 4 3 19470
## 5 5 19172
## 6 12 19117
## 7 1 18910
## 8 10 18696
## 9 4 18593
## 10 9 18065
## 11 11 18021
## 12 2 17128
You should finish with two dataframes, one called cars the other newcars. Using the datasets
library (specifically the cars
dataframe) and dplyr please:
create a variable time
that divdes distance by speed
create two new variables that correspond to km instead of miles (e.g. speed_km
, dist_km
)
create a time2
variable (that should be equal to time
) with the same formula using the variables from step 2)
create a car_build
variable that is either (fast, average, slow) depending on whether its speed
is (0-10, 11-20, 21+)
remove any distance that is less than or equal to 4 miles
order the data in descending order by distance, then by increasing order by speed
keep all columns and rename speed
and dist
to contain the _miles
suffix.
create a new dataframe (newcars) that aggregates total distance covered in both miles and km by car_build
.
Resources
For help on the base functions of dplyr, run this in the R console: ?manip
Hadley Wickham's dplyr github page. The readme is quite helpful.
An introduction to dplyr from the folks at Rstudio that this tutorial is based heavily on. You can also just run this line of code in R:
vignette("introduction", package = "dplyr")
Advantages of dplyr package explored