Outline

dplyr: What? Why?
Table dataframes
Functions
Exercises

Why dplyr?

dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. dplyr is the next iteration of plyr, focussing on only data frames. The main advantages include:

Speed. Compared to plyr library (home of the familiar ddply function), dplyr is anywhere between 20X - 100X faster in its calculations.
Cleaner Code. the syntax allows for function chaining, preventing any potential cluttering in the code, which in turn makes for easier code writing/reading.
Simpler Code. dplyr has a limited number of functions (5) that are focused on the most common requirements of data manipulation. the syntax is both simple and effecient.

Before Data Manipulation: `tbl_df()`

Just as with plyr library, dplyr works best with dataframes. Please note that there is a type particular to dplyr, table dataframes, which make it worthwhile to convert all of our data to. The main advantage is that table dataframes are “smarter”. Let's observe the difference via an example using aviation data from Houston, Texas.

library(hflights)
hflights <- hflights

After reading in the data, try doing a head() of hflights.

##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0

Not very friendly to the Rstudio console or the human eye. Now, let's convert into a table dataframe and see the difference.

hflights_df <- tbl_df(hflights)
hflights_df

## Source: local data frame [227,496 x 21]
## 
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
## 5430 2011     1          7         5    1359    1509            AA
## 5431 2011     1          8         6    1355    1454            AA
## 5432 2011     1          9         7    1443    1554            AA
## 5433 2011     1         10         1    1443    1553            AA
## ..    ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

Functions

dplyr has a group of functions/verbs used for data manipulation:

select(): focus on a subset of columns
filter(): focus on a subset of rows
arrange(): re-order the rows
mutate(): add new columns
summarise(): produce summary statistics (aggregate)

Selecting columns with `select()`

We are very familiar with two methods of selecting a set of columns from a larger data frame. The first is the subset() function, and the other is the rows, columns slicing syntax [,].

The select() function is even more straightforward and powerful. Let's explore through some examples:

# select three columns by name (no quotations)
head(dplyr::select(hflights_df, Year, Month, AirTime), 2)

## Source: local data frame [2 x 3]
## 
##      Year Month AirTime
## 5424 2011     1      40
## 5425 2011     1      45

# all but three columns
head(dplyr::select(hflights_df, -Year, -Month, -AirTime), 2)

## Source: local data frame [2 x 18]
## 
##      DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
## 5424          1         6    1400    1500            AA       428  N576AA
## 5425          2         7    1401    1501            AA       428  N557AA
## Variables not shown: ActualElapsedTime (int), ArrDelay (int), DepDelay
##   (int), Origin (chr), Dest (chr), Distance (int), TaxiIn (int), TaxiOut
##   (int), Cancelled (int), CancellationCode (chr), Diverted (int)

# sequence of columns by name instead of index
head(dplyr::select(hflights_df, Year:AirTime), 2)

## Source: local data frame [2 x 11]
## 
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int)

One final note about select() is the ability to do renaming of columns with a very straightforward syntax: new_col_name = old_name. Let's select the first three columns we began with and rename them in Spanish.

# select and rename three columns
head(dplyr::select(hflights_df, Año = Year, Mes = Month, TiempoDelVuelo = AirTime), 
    2)

## Source: local data frame [2 x 3]
## 
##       Año Mes TiempoDelVuelo
## 5424 2011   1             40
## 5425 2011   1             45

Notice how the new dataframe only contains the three columns we've just created. To keep the remaining columns, there is a subfunction called matches() that does the trick.

# rename three columns and select all
head(dplyr::select(hflights_df, Año = Year, Mes = Month, TiempoDelVuelo = AirTime, 
    matches(".")), 2)

## Source: local data frame [2 x 21]
## 
##       Año Mes TiempoDelVuelo DayofMonth DayOfWeek DepTime ArrTime
## 5424 2011   1             40          1         6    1400    1500
## 5425 2011   1             45          2         7    1401    1501
## Variables not shown: UniqueCarrier (chr), FlightNum (int), TailNum (chr),
##   ActualElapsedTime (int), ArrDelay (int), DepDelay (int), Origin (chr),
##   Dest (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled
##   (int), CancellationCode (chr), Diverted (int)

Filter data with `filter()`

Similair to subset() or [,], we can filter our data by rows evaluated by a certain criteria.

To select all flights on New Years Day, we can say:

filter(hflights_df, Month == 1, DayofMonth == 1)

## Source: local data frame [552 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          1         6     728     840            AA       460
## 3  2011     1          1         6    1631    1736            AA      1121
## 4  2011     1          1         6    1756    2112            AA      1294
## 5  2011     1          1         6    1012    1347            AA      1700
## 6  2011     1          1         6    1211    1325            AA      1820
## 7  2011     1          1         6     557     906            AA      1994
## 8  2011     1          1         6    1824    2106            AS       731
## 9  2011     1          1         6     654    1124            B6       620
## 10 2011     1          1         6    1639    2110            B6       622
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

# using filter takes the place of the more cumbersome (and much more messily
# printed):
hflights[hflights$Month == 1 & hflights$DayofMonth == 1, ]

Order data with `arrange()`

This function works the same as filter(), except that instead of selecting rows, it reorders them. The first argument is a dataframe, follwed by a set of column names to order by.

Also note the desc() option which orders in descending (alphabetical or numerical) order. To arrange by reverse choronology i.e. which flights happened closest to present day:

arrange(hflights_df, desc(Month), desc(DayofMonth))

## Source: local data frame [227,496 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011    12         31         6     556     918            AA       466
## 2  2011    12         31         6    1156    1254            AA       865
## 3  2011    12         31         6    1616    1714            AA      1033
## 4  2011    12         31         6    1729    2044            AA      1294
## 5  2011    12         31         6    1441    1542            AA      1740
## 6  2011    12         31         6     901    1013            AA      1788
## 7  2011    12         31         6    1024    1343            AA      2234
## 8  2011    12         31         6    1843    2131            AS       731
## 9  2011    12         31         6     831    1251            B6       620
## 10 2011    12         31         6    1544    2009            B6       622
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

Create new columns with `mutate()`

This function requires a table dataframe as a first argument, and any desired columns to follow. It requires column creation to be more neatly written in the code, and the payoff in speed is also a significant benefit!

  mutate(hflights_df,

         Distance_km = Distance*1.60934,

         flightCode = paste(FlightNum, UniqueCarrier, sep="_"))

One cool feature of dplyr::mutate() is the ability to refer to columns that you just created. What is the following code producing for us in terms of data?

  hflights_df %.% 
    mutate(Distance_km = round(Distance*1.6),

         AirPoints = Distance_km*2) %.%

    select(Distance_km, AirPoints) %.%

    arrange(desc(AirPoints)) %.%

  tail(5)

## Source: local data frame [5 x 2]
## 
##        Distance_km AirPoints
## 227492         203       406
## 227493         203       406
## 227494         126       252
## 227495         126       252
## 227496         126       252

Aggregate data with `summarise()`

Testing our understanding: Using plyr library, create a dataframe that contains the number of flights per month out of Houston.

Now, let's compare and contrast the process of aggregation in plyr vs. dplyr.

hflights_df %.% group_by(Month) %.% summarise(num_flights_month = n()) %.% arrange(desc(num_flights_month))

## Source: local data frame [12 x 2]
## 
##    Month num_flights_month
## 1      7             20548
## 2      8             20176
## 3      6             19600
## 4      3             19470
## 5      5             19172
## 6     12             19117
## 7      1             18910
## 8     10             18696
## 9      4             18593
## 10     9             18065
## 11    11             18021
## 12     2             17128

Exercises

You should finish with two dataframes, one called cars the other newcars. Using the datasets library (specifically the cars dataframe) and dplyr please:

create a variable time that divdes distance by speed
create two new variables that correspond to km instead of miles (e.g. speed_km, dist_km)
create a time2 variable (that should be equal to time) with the same formula using the variables from step 2)
create a car_build variable that is either (fast, average, slow) depending on whether its speed is (0-10, 11-20, 21+)
remove any distance that is less than or equal to 4 miles
order the data in descending order by distance, then by increasing order by speed
keep all columns and rename speed and dist to contain the _miles suffix.
create a new dataframe (newcars) that aggregates total distance covered in both miles and km by car_build.

Resources

For help on the base functions of dplyr, run this in the R console: ?manip
Hadley Wickham's dplyr github page. The readme is quite helpful.
An introduction to dplyr from the folks at Rstudio that this tutorial is based heavily on. You can also just run this line of code in R: vignette("introduction", package = "dplyr")
Advantages of dplyr package explored

Outline

Why dplyr?

Before Data Manipulation: tbl_df()

Functions

Selecting columns with select()

Filter data with filter()

Order data with arrange()

Create new columns with mutate()