In R, packages are software modules providing extra
functions, operators and datasets. Some packages, like base
and stats
, are part of the core R distribution, which means
you don’t need to install them separately, and they are already loaded
in your environment upon starting your R session. When using RStudio, in
the bottom right pane, you can see the available packages by clicking
the Packages
tab. The ones that are currently loaded in
your environment appear with a tick mark. For those, you don’t need to
prepend the package name followed by a double colon when you call a
function it implements (e.g. you can write str_sub()
instead of stringr::str_sub()
if and only if the
stringr
package is already loaded in your environment).
In order to load a package, one calls library()
with the
package name as argument. Quoting is unnecessary.
Before using a package for the first time, you need to download and
install it on your computer by calling install.packages()
with the quoted name of the package as argument.
tidyverse is a collection of open-source packages for R that were primarily developed by Hadley Wickham. Let us list the most common and broadly describe their scope:
tibble
to work with tibbles instead of data
frames: tibbles are “smart” data containers that display in a smarter
way, with neater type control, etc.readr
to import tabular data from different text files
(tab-delimited, csv, fixed-width fields, etc) into tibblesreadxl
to import data from Excel files (both .xls and
.xlsx) into tibblestidyr
to transform data with different degrees of
“messiness” into tidy data. Tidy data is simple to
understand, but many datasets you come across are not “tidy”. In tidy
data, each variable is in its own column, and each observation (a.k.a
“case” or “individual”) is in its own row. tidyr
helps you
achieve tidy data and reshape datasets by collapsing or splitting
columns, creating combinations of cases, handling missing values,
etcmagrittr
introduces additional pipe operators like
%>%
and %<>%
dplyr
is for me the central package of
the tidyverse collection. It’s more than a pair of data pliers: it is a
whole toolbox to extract information from a dataset, to group,
summarize, join, apply functions to several variables, etc.ggplot2
implements a whole “grammar of graphics” and
enables the user to create beautiful graphs of various types
(scatterplots, histograms, jitter plots, violin plots, scatterplots,
bargraphs, smoothed curves, 2D-density plots, etc), tweaking the way
they are displayed through simple commands while maintaining a sense of
graphical harmony.stringr
enables you to manipulate strings efficiently
and effortlessly.purrr
is more advanced and deals with functional
programming : programming by manipulating functions as objectsforcats
is a re-implementation of R’s classical factors
(containers for categorical data).All these packages constitute a huge programming effort and a considerable contribution to the R ecosystem. I name them a “R 2.0”. In order to install all of them at once, one goes:
> install.packages("tidyverse")
The above can take a little bit of time. Then we go:
> library(tidyverse) # you could also load only the packages you want to use among those listed above, issuing one or several calls to library()
This opens up a whole world of possibilities, that we are going to explore in the remaining sections of this tutorial.
> sum(3,7,2) # sum() accepts an arbitrary number of arguments and yields their sum
[1] 12
> # can be written:
> 3 %>% sum(7,2)
[1] 12
The pipe operator %>%
passes what is present on his
left hand side as the first argument of the function call on its right
hand side.
> vec <- c(3,5,NA,7)
> vec %>% sum(na.rm = TRUE) # same as sum(vec, na.rm = TRUE)
[1] 15
We will import into an R object the content of the sheet called
“HHcharacteristic” in the file livestock_data.xlsx
. That is
its second sheet. We write a call to the read_xlsx()
function from the readxl
package:
> read_xlsx("livestock_data.xlsx", sheet = 2) # fails because readxl is actually not loaded by default when one loads tidyverse, so we go:
Error in read_xlsx("livestock_data.xlsx", sheet = 2): could not find function "read_xlsx"
> library(readxl)
> read_xlsx("livestock_data.xlsx", sheet = 2) -> households # works fine
When you want to view the structure of a dataset, you can call the
fundamental str()
function on it:
> str(households)
tibble [527 × 16] (S3: tbl_df/tbl/data.frame)
$ qno : chr [1:527] "BUN01" "BUN02" "BUN03" "BUN04" ...
$ survey : chr [1:527] "Western" "Western" "Western" "Western" ...
$ anydairy : num [1:527] 0 0 1 0 1 0 0 1 0 0 ...
$ any_agric : num [1:527] 1 1 1 1 1 1 1 1 1 1 ...
$ sex : num [1:527] 1 1 1 0 1 1 1 1 1 1 ...
$ age : num [1:527] 48 28 72 52 48 70 32 38 35 40 ...
$ education : num [1:527] 12 8 4 0 12 4 12 12 10 12 ...
$ hh_experience : num [1:527] 5 8 42 19 14 35 4 11 10 14 ...
$ primary_activity: chr [1:527] "Businessman" "Farm management" NA "Farm management" ...
$ adults : num [1:527] 4 2 5 3 2 4 4 6 2 4 ...
$ fem_totadult : num [1:527] 0.75 0.5 0.6 0.667 0.5 ...
$ depend_ratio : num [1:527] 0.429 0.333 0.286 0.5 0.714 ...
$ off_farm_act : num [1:527] 2 2 1 2 1 1 2 2 0 1 ...
$ offfarm_activity: num [1:527] 2 2 1 2 1 1 2 2 0 1 ...
$ interview_date : POSIXct[1:527], format: "2000-04-25" "2000-04-25" ...
$ off_farm_percent: num [1:527] 0.5 1 0.2 0.667 0.5 ...
You can see that the data from the excel file has been imported
correctly into a tibble (first line of the str()
report)
with 527 rows (observations) and 16 columns (variables). Following that,
you get a snapshot of each variable (introduced by the $
character for S3 objects, or the @
character for S4
objects): its type, its length (in a data frame or tibble, all columns
have the same height, being the number of observations) and its first
elements (the list that finishes with the three dots).
Displaying tibbles by simply calling their name is not cumbersome
(contrary to old-school data.frame
s for which long outputs
are cropped by a not-so-handsome message “[ reached ‘max’ /
getOption(”max.print”) – omitted xxx rows ]” while all the columns,
however many, are displayed):
> households
# A tibble: 527 × 16
qno survey anydairy any_agric sex age education hh_experience
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BUN01 Western 0 1 1 48 12 5
2 BUN02 Western 0 1 1 28 8 8
3 BUN03 Western 1 1 1 72 4 42
4 BUN04 Western 0 1 0 52 0 19
5 BUN05 Western 1 1 1 48 12 14
6 BUN06 Western 0 1 1 70 4 35
7 BUN07 Western 0 1 1 32 12 4
8 BUN08 Western 1 1 1 38 12 11
9 BUN09 Western 0 1 1 35 10 10
10 BUN10 Western 0 1 1 40 12 14
# ℹ 517 more rows
# ℹ 8 more variables: primary_activity <chr>, adults <dbl>, fem_totadult <dbl>,
# depend_ratio <dbl>, off_farm_act <dbl>, offfarm_activity <dbl>,
# interview_date <dttm>, off_farm_percent <dbl>
You can see the display fits the width of your screen, only the first 10 rows are displayed, and the additional variables (columns are hidden and summarised at the bottom in grey characters).
Finally, in RStudio, you can also use the View()
function (be careful with the case!) to display a live view of the table
as an additional tab in the top left (source editor) pane of
RStudio:
> View(households)
This view is not directly editable, but it is live, meaning that in
case you change the dataset, the changes will reflect in the view.
Notice you could also get to the same view by clicking on the said
dataset within the Environment
tab of the top right pane in
RStudio.
There is a column for sex (I guess, males and females). It is encoded as numeric and not as a qualitative variables with two values, so the summary with summary() is rather awkward:
> summary(households$sex)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 1.0000 1.0000 0.7666 1.0000 1.0000
Note that the equivalent to the above but using pipes amounts to:
sex
variable, and thensummary()
function.Which is done as below:
> households %>% `$`(sex) %>% summary() # we quote the dollar operator with backticks
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 1.0000 1.0000 0.7666 1.0000 1.0000
> #or
> households %>% `[[`("sex") %>% summary() # a tibble is also a data frame, hence a list of variables (columns), and the double-bracket operator extracts as a vector its named element
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 1.0000 1.0000 0.7666 1.0000 1.0000
So, the summary is not very informative. With dplyr
, the
summarize
operation (you can also write
summarise
, R accepts British English as well as American
English) is meant to calculate a summary (single value) from one or
several variables. What we want here is:
> households %>% group_by(sex) %>% summarize(count = n())
# A tibble: 2 × 2
sex count
<dbl> <int>
1 0 123
2 1 404
See the excellent “Data transformation with dplyr” cheatsheet. That explains the above, which is what I was used to doing until now, but the recent versions of dplyr brought something simpler:
> households %>% count(sex)
# A tibble: 2 × 2
sex n
<dbl> <int>
1 0 123
2 1 404
To get such a neat table was possible in “the old R”, of course:
> summary(as.factor(households$sex)) # transforming into qualitative data so that the summary() yields counts instead of the usual descriptive stats summary with quartiles
0 1
123 404
Alternatively:
> table(households$sex)
0 1
123 404
But you notice that in both the last two cases (with “the old R”), the result appears as a named vector, which is less tidy to work with than a true tibble.
dplyr
Let’s see some of the many features offered by the dplyr
package when it comes to data wrangling.
We can filter some observations, producing a sub-table (still a tibble) containing only the observations matching logical criteria that we state. For example, if we want to build a tibble with only the households with at least 10 years of farming experience:
> households %>% filter(hh_experience >= 10) # 339 observations
# A tibble: 339 × 16
qno survey anydairy any_agric sex age education hh_experience
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BUN03 Western 1 1 1 72 4 42
2 BUN04 Western 0 1 0 52 0 19
3 BUN05 Western 1 1 1 48 12 14
4 BUN06 Western 0 1 1 70 4 35
5 BUN08 Western 1 1 1 38 12 11
6 BUN09 Western 0 1 1 35 10 10
7 BUN10 Western 0 1 1 40 12 14
8 BUN103 Western 1 1 0 54 4 38
9 BUN105 Western 1 1 1 50 12 25
10 BUN106 Western 1 1 1 30 8 10
# ℹ 329 more rows
# ℹ 8 more variables: primary_activity <chr>, adults <dbl>, fem_totadult <dbl>,
# depend_ratio <dbl>, off_farm_act <dbl>, offfarm_activity <dbl>,
# interview_date <dttm>, off_farm_percent <dbl>
(please note an advantage of using the pipe versus the equivalent
syntax filter(households, hh_experience >=10)
: RStudio
autocompletes for you the variable name when you are midway typing
hh_experience
and you hit the Tab
key)
You can also filter the dataset based on a series of requirements,
all to be fulfilled simultaneously (logical AND). Just make them several
arguments to your call to filter
. For instance, to select
the the households with at least 10 years of farming experience
and where the main person is a civil servant:
> households %>% filter(hh_experience >= 10, primary_activity == "Civil servant")
# A tibble: 21 × 16
qno survey anydairy any_agric sex age education hh_experience
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BUN08 Western 1 1 1 38 12 11
2 BUN12 Western 1 1 1 54 12 27
3 BUN138 Western 1 1 1 50 17 17
4 BUN20 Western 1 1 1 43 17 16
5 BUN32 Western 1 1 1 43 10 10
6 BUN37 Western 1 1 1 54 12 28
7 KIA12089 Kiambu 1 1 1 38 15 11
8 KIA121113 Kiambu 1 1 1 37 15 10
9 KIA121116 Kiambu 1 1 0 34 11 15
10 KIA121814 Kiambu 1 1 1 50 11 21
# ℹ 11 more rows
# ℹ 8 more variables: primary_activity <chr>, adults <dbl>, fem_totadult <dbl>,
# depend_ratio <dbl>, off_farm_act <dbl>, offfarm_activity <dbl>,
# interview_date <dttm>, off_farm_percent <dbl>
Only 21 observations remaining.