Introduction to R packages

In R, packages are software modules providing extra functions, operators and datasets. Some packages, like base and stats, are part of the core R distribution, which means you don’t need to install them separately, and they are already loaded in your environment upon starting your R session. When using RStudio, in the bottom right pane, you can see the available packages by clicking the Packages tab. The ones that are currently loaded in your environment appear with a tick mark. For those, you don’t need to prepend the package name followed by a double colon when you call a function it implements (e.g. you can write str_sub() instead of stringr::str_sub() if and only if the stringr package is already loaded in your environment).

In order to load a package, one calls library() with the package name as argument. Quoting is unnecessary.

Before using a package for the first time, you need to download and install it on your computer by calling install.packages() with the quoted name of the package as argument.

Introduction to the tidyverse collection

tidyverse is a collection of open-source packages for R that were primarily developed by Hadley Wickham. Let us list the most common and broadly describe their scope:

  1. tibble to work with tibbles instead of data frames: tibbles are “smart” data containers that display in a smarter way, with neater type control, etc.
  2. readr to import tabular data from different text files (tab-delimited, csv, fixed-width fields, etc) into tibbles
  3. readxl to import data from Excel files (both .xls and .xlsx) into tibbles
  4. tidyr to transform data with different degrees of “messiness” into tidy data. Tidy data is simple to understand, but many datasets you come across are not “tidy”. In tidy data, each variable is in its own column, and each observation (a.k.a “case” or “individual”) is in its own row. tidyr helps you achieve tidy data and reshape datasets by collapsing or splitting columns, creating combinations of cases, handling missing values, etc
  5. magrittr introduces additional pipe operators like %>% and %<>%
  6. dplyr is for me the central package of the tidyverse collection. It’s more than a pair of data pliers: it is a whole toolbox to extract information from a dataset, to group, summarize, join, apply functions to several variables, etc.
  7. ggplot2 implements a whole “grammar of graphics” and enables the user to create beautiful graphs of various types (scatterplots, histograms, jitter plots, violin plots, scatterplots, bargraphs, smoothed curves, 2D-density plots, etc), tweaking the way they are displayed through simple commands while maintaining a sense of graphical harmony.
  8. stringr enables you to manipulate strings efficiently and effortlessly.
  9. purrr is more advanced and deals with functional programming : programming by manipulating functions as objects
  10. forcats is a re-implementation of R’s classical factors (containers for categorical data).

All these packages constitute a huge programming effort and a considerable contribution to the R ecosystem. I name them a “R 2.0”. In order to install all of them at once, one goes:

> install.packages("tidyverse")

The above can take a little bit of time. Then we go:

> library(tidyverse) # you could also load only the packages you want to use among those listed above, issuing one or several calls to library()

This opens up a whole world of possibilities, that we are going to explore in the remaining sections of this tutorial.

The pipe operator

> sum(3,7,2) # sum() accepts an arbitrary number of arguments and yields their sum
[1] 12
> # can be written:
> 3 %>% sum(7,2)
[1] 12

The pipe operator %>% passes what is present on his left hand side as the first argument of the function call on its right hand side.

> vec <- c(3,5,NA,7)
> vec %>% sum(na.rm = TRUE) # same as sum(vec, na.rm = TRUE)
[1] 15

Reading from files with readxl

We will import into an R object the content of the sheet called “HHcharacteristic” in the file livestock_data.xlsx. That is its second sheet. We write a call to the read_xlsx() function from the readxl package:

> read_xlsx("livestock_data.xlsx", sheet = 2) # fails because readxl is actually not loaded by default when one loads tidyverse, so we go:
Error in read_xlsx("livestock_data.xlsx", sheet = 2): could not find function "read_xlsx"
> library(readxl)
> read_xlsx("livestock_data.xlsx", sheet = 2) -> households # works fine

Viewing a data frame or tibble

When you want to view the structure of a dataset, you can call the fundamental str() function on it:

> str(households)
tibble [527 × 16] (S3: tbl_df/tbl/data.frame)
 $ qno             : chr [1:527] "BUN01" "BUN02" "BUN03" "BUN04" ...
 $ survey          : chr [1:527] "Western" "Western" "Western" "Western" ...
 $ anydairy        : num [1:527] 0 0 1 0 1 0 0 1 0 0 ...
 $ any_agric       : num [1:527] 1 1 1 1 1 1 1 1 1 1 ...
 $ sex             : num [1:527] 1 1 1 0 1 1 1 1 1 1 ...
 $ age             : num [1:527] 48 28 72 52 48 70 32 38 35 40 ...
 $ education       : num [1:527] 12 8 4 0 12 4 12 12 10 12 ...
 $ hh_experience   : num [1:527] 5 8 42 19 14 35 4 11 10 14 ...
 $ primary_activity: chr [1:527] "Businessman" "Farm management" NA "Farm management" ...
 $ adults          : num [1:527] 4 2 5 3 2 4 4 6 2 4 ...
 $ fem_totadult    : num [1:527] 0.75 0.5 0.6 0.667 0.5 ...
 $ depend_ratio    : num [1:527] 0.429 0.333 0.286 0.5 0.714 ...
 $ off_farm_act    : num [1:527] 2 2 1 2 1 1 2 2 0 1 ...
 $ offfarm_activity: num [1:527] 2 2 1 2 1 1 2 2 0 1 ...
 $ interview_date  : POSIXct[1:527], format: "2000-04-25" "2000-04-25" ...
 $ off_farm_percent: num [1:527] 0.5 1 0.2 0.667 0.5 ...

You can see that the data from the excel file has been imported correctly into a tibble (first line of the str() report) with 527 rows (observations) and 16 columns (variables). Following that, you get a snapshot of each variable (introduced by the $ character for S3 objects, or the @character for S4 objects): its type, its length (in a data frame or tibble, all columns have the same height, being the number of observations) and its first elements (the list that finishes with the three dots).

Displaying tibbles by simply calling their name is not cumbersome (contrary to old-school data.frames for which long outputs are cropped by a not-so-handsome message “[ reached ‘max’ / getOption(”max.print”) – omitted xxx rows ]” while all the columns, however many, are displayed):

> households
# A tibble: 527 × 16
   qno   survey  anydairy any_agric   sex   age education hh_experience
   <chr> <chr>      <dbl>     <dbl> <dbl> <dbl>     <dbl>         <dbl>
 1 BUN01 Western        0         1     1    48        12             5
 2 BUN02 Western        0         1     1    28         8             8
 3 BUN03 Western        1         1     1    72         4            42
 4 BUN04 Western        0         1     0    52         0            19
 5 BUN05 Western        1         1     1    48        12            14
 6 BUN06 Western        0         1     1    70         4            35
 7 BUN07 Western        0         1     1    32        12             4
 8 BUN08 Western        1         1     1    38        12            11
 9 BUN09 Western        0         1     1    35        10            10
10 BUN10 Western        0         1     1    40        12            14
# ℹ 517 more rows
# ℹ 8 more variables: primary_activity <chr>, adults <dbl>, fem_totadult <dbl>,
#   depend_ratio <dbl>, off_farm_act <dbl>, offfarm_activity <dbl>,
#   interview_date <dttm>, off_farm_percent <dbl>

You can see the display fits the width of your screen, only the first 10 rows are displayed, and the additional variables (columns are hidden and summarised at the bottom in grey characters).

Finally, in RStudio, you can also use the View() function (be careful with the case!) to display a live view of the table as an additional tab in the top left (source editor) pane of RStudio:

> View(households)

This view is not directly editable, but it is live, meaning that in case you change the dataset, the changes will reflect in the view. Notice you could also get to the same view by clicking on the said dataset within the Environment tab of the top right pane in RStudio.

Simple summaries the old and new way

There is a column for sex (I guess, males and females). It is encoded as numeric and not as a qualitative variables with two values, so the summary with summary() is rather awkward:

> summary(households$sex)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.7666  1.0000  1.0000 

Note that the equivalent to the above but using pipes amounts to:

  1. piping the whole dataset into a function that extracts as a vector its sex variable, and then
  2. piping that vector into the summary() function.

Which is done as below:

> households %>% `$`(sex) %>% summary() # we quote the dollar operator with backticks
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.7666  1.0000  1.0000 
> #or
> households %>% `[[`("sex") %>% summary() # a tibble is also a data frame, hence a list of variables (columns), and the double-bracket operator extracts as a vector its named element
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.7666  1.0000  1.0000 

So, the summary is not very informative. With dplyr, the summarize operation (you can also write summarise, R accepts British English as well as American English) is meant to calculate a summary (single value) from one or several variables. What we want here is:

  1. to group the dataset into groups according to the value of the “sex” variable
  2. to display the pairs (value, count of observations) for each of the different encodings for the “sex” variable.
> households %>% group_by(sex) %>% summarize(count = n())
# A tibble: 2 × 2
    sex count
  <dbl> <int>
1     0   123
2     1   404

See the excellent “Data transformation with dplyr” cheatsheet. That explains the above, which is what I was used to doing until now, but the recent versions of dplyr brought something simpler:

> households %>% count(sex)
# A tibble: 2 × 2
    sex     n
  <dbl> <int>
1     0   123
2     1   404

To get such a neat table was possible in “the old R”, of course:

> summary(as.factor(households$sex)) # transforming into qualitative data so that the summary() yields counts instead of the usual descriptive stats summary with quartiles
  0   1 
123 404 

Alternatively:

> table(households$sex)

  0   1 
123 404 

But you notice that in both the last two cases (with “the old R”), the result appears as a named vector, which is less tidy to work with than a true tibble.

Filtering, slicing, grouping data: using pipes and the power of dplyr

Let’s see some of the many features offered by the dplyr package when it comes to data wrangling.

We can filter some observations, producing a sub-table (still a tibble) containing only the observations matching logical criteria that we state. For example, if we want to build a tibble with only the households with at least 10 years of farming experience:

> households %>% filter(hh_experience >= 10) # 339 observations
# A tibble: 339 × 16
   qno    survey  anydairy any_agric   sex   age education hh_experience
   <chr>  <chr>      <dbl>     <dbl> <dbl> <dbl>     <dbl>         <dbl>
 1 BUN03  Western        1         1     1    72         4            42
 2 BUN04  Western        0         1     0    52         0            19
 3 BUN05  Western        1         1     1    48        12            14
 4 BUN06  Western        0         1     1    70         4            35
 5 BUN08  Western        1         1     1    38        12            11
 6 BUN09  Western        0         1     1    35        10            10
 7 BUN10  Western        0         1     1    40        12            14
 8 BUN103 Western        1         1     0    54         4            38
 9 BUN105 Western        1         1     1    50        12            25
10 BUN106 Western        1         1     1    30         8            10
# ℹ 329 more rows
# ℹ 8 more variables: primary_activity <chr>, adults <dbl>, fem_totadult <dbl>,
#   depend_ratio <dbl>, off_farm_act <dbl>, offfarm_activity <dbl>,
#   interview_date <dttm>, off_farm_percent <dbl>

(please note an advantage of using the pipe versus the equivalent syntax filter(households, hh_experience >=10): RStudio autocompletes for you the variable name when you are midway typing hh_experience and you hit the Tab key)

You can also filter the dataset based on a series of requirements, all to be fulfilled simultaneously (logical AND). Just make them several arguments to your call to filter. For instance, to select the the households with at least 10 years of farming experience and where the main person is a civil servant:

> households %>% filter(hh_experience >= 10, primary_activity == "Civil servant")
# A tibble: 21 × 16
   qno       survey  anydairy any_agric   sex   age education hh_experience
   <chr>     <chr>      <dbl>     <dbl> <dbl> <dbl>     <dbl>         <dbl>
 1 BUN08     Western        1         1     1    38        12            11
 2 BUN12     Western        1         1     1    54        12            27
 3 BUN138    Western        1         1     1    50        17            17
 4 BUN20     Western        1         1     1    43        17            16
 5 BUN32     Western        1         1     1    43        10            10
 6 BUN37     Western        1         1     1    54        12            28
 7 KIA12089  Kiambu         1         1     1    38        15            11
 8 KIA121113 Kiambu         1         1     1    37        15            10
 9 KIA121116 Kiambu         1         1     0    34        11            15
10 KIA121814 Kiambu         1         1     1    50        11            21
# ℹ 11 more rows
# ℹ 8 more variables: primary_activity <chr>, adults <dbl>, fem_totadult <dbl>,
#   depend_ratio <dbl>, off_farm_act <dbl>, offfarm_activity <dbl>,
#   interview_date <dttm>, off_farm_percent <dbl>

Only 21 observations remaining.