This tutorial walks you through basic interactions with the R
software package, from data manipulation to statistical tests and
regression models. You may start an interactive session with R either
directly from the shell by typing R
on the commandline. Pay
attention that your GNU/Linux environment is case-sensitive, so typing
r
will fail. Alternatively, you can start R through the
fancier Graphical User Interface (GUI) called RStudio: type
rstudio
from the terminal, or (double-)click on the
corresponding icon on your desktop space.
During your interactive session with R, you type
commands or expressions in the
terminal (the frame called Console in RStudio), to which R will
either reply silently or display some result or perform some other
action (e.g. drawing a graph in a separate window frame). Whenever R is
ready to listen to you, it invites you to type in some
command with a tiny “greater than” sign at the very beginning of the
last line in the Console. In the examples you will find troughout this
tutorial, the interactions with R are reproduced in
typewriter font
. So you don’t have to type in the initial
“greater than” sign when it is displayed at the beginning of a
commandline: this is simply the invite as produced by
R.
Once you are done with the present tutorial, if you want to deepen your understanding of how R works, start to write some basic programs in R, etc, I suggest that you read that excellent other tutorial written by Emmanuel Paradis and entitled R for beginners. It is a bit “old-school”, talking about the base R programming style, but very instrumental in understanding how R works.
In programming languages, storing some kind of data into the “memory”
of the computer (in fact into the current symbolic workspace defined by
your R session) is called assigning a value to a
variable. When you assign a value (whatever its type: it can be a string
of characters, an integer, a boolean variable, a floating-point number,
etc) to a variable, you create a new object with R. This is done through
the assignment operator ->
. This
operator is written with a hyphen (-
) followed by a
“greater than” sign (>
), so that it forms an arrow.
Taking care of swapping the two operands (variable and value), this can
be also written the other way around:
> 5 -> a # puts the value 5 into the variable a
> b <- 7 # puts the value 7 into the variable b
> b - a # asks for the evaluation (calculation) of b-a
[1] 2
> 2 # nothing prevents you from evaluating a constant!
[1] 2
The “greater than” character that appears at the beginning of the
line (in a terminal or in RStudio’s Console frame is the
invite. It means that R is waiting for you to enter a
command. Some commands produce a silent output
(e.g. assigning a value to a variable is a silent operation), some other
display a result. For example, subtracting the content of the variable
called a
from the content of the variable called
b
gives the single value 2
as an output.
Notice that the [1]
automatically prepended to the output
line is an index: as R tends to see everything as a
vectorized variable (you will see other examples very
soon), here it tells us that this 2
is the first element of
the output vector produced by the operation b-a
(this
output vector contains only one element). On the last line of the
interaction above, you can see that even typing a mere value and asking
R to evaluate it by pressing the (Enter) key yields the
exact same output with the exact same formatting.
As we just said, vectors are essential data
structures in R. One creates vectors simply by using the concatenation
operator (or function), rightfully named c
:
> myvec <- c(1,3,5,7,84) # passing five integers as arguments to the function c
> myvec # myvec is now a vector containing five elements
[1] 1 3 5 7 84
> length(myvec) # calling the function length on the object myvec
[1] 5
Here we have just used our first two functions in R.
With the first command of this interaction block, we performed a
function call to the function called c
,
giving it five integers as arguments. A function call in R is always
written as this: the name of the function is followed by a pair of
parentheses enclosing a list of comma-separated arguments (possibly only
one or even none). Hence, the last operation we performed above is a
function call to the function length
, passing to it one
argument only, namely the myvec
variable.
Remember that whenever you want to use a function,
you have to write these parentheses. Even if the
function needs no argument. For instance, the function ls
,
when invoked (another vocable computer people use for “called”) with no
argument, lists the content of your current “userspace”
or “environment”: it returns a vector populated with strings giving the
names of these objects present in your current R environment (i.e. the
variables you defined earlier in your interactive session).
> ls()
[1] "a" "b" "myvec"
Beware!! If you omit the parentheses, R will try and evaluate the function itself:
> ls
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
pattern, sorted = TRUE)
{
if (!missing(name)) {
pos <- tryCatch(name, error = function(e) e)
if (inherits(pos, "error")) {
name <- substitute(name)
if (!is.character(name))
name <- deparse(name)
warning(gettextf("%s converted to character string",
sQuote(name)), domain = NA)
pos <- name
}
}
all.names <- .Internal(ls(envir, all.names, sorted))
if (!missing(pattern)) {
if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
ll != length(grep("]", pattern, fixed = TRUE))) {
if (pattern == "[") {
pattern <- "\\["
warning("replaced regular expression pattern '[' by '\\\\['")
}
else if (length(grep("[^\\\\]\\[<-", pattern))) {
pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
}
}
grep(pattern, all.names, value = TRUE)
}
else all.names
}
<bytecode: 0x55e9cec20c90>
<environment: namespace:base>
To get some help on a command, just type a question mark
(?
) immediately followed by the name of the command. If
using the commandline R, you will then enter a manpage-like environment,
in which the space
moves you to the next screen (or page),
the b
key goes back one page, and a keypress on
q
quits the help and brings you back to the commandline and
its invite. Alternatively, depending on the configuration of your
computer, you may get access to the HTML version of the help pages: that
will be the case for most of us, using RStudio: the manpage is displayed
in a nice, html format in the bottom right pane. By the way, in RStudio,
you can also ask for manpages by clicking on the Help
tab
in the bottom right pane, and then using the search box in the upper
right corner of that pane to enter a search keyword. You don’t need to
type the question marks when using that searchbox.
> ?ls
> #.... viewing the help on the ls command in the help pane
>
> ?length
> # .... viewing the help on the length command in the help pane
>
> ??transpose
> # .... viewing the help pages containing the transpose keyword in the help pane
The latter type of query (with the double question mark) is to be used when you are unsure of the name of the function you are seeking help about. It tries to provide you with the list of help files containing the keyword in question, after what you will prepend the single question mark to the name of the identified command (or function) to get the relevant help page.
These help pages (also called manpages) are all
built according to the same structure: after a short
Description of the purpose of the command, follows a section
called Usage. This latter section presents the syntax you may
use to build a function call, with the possible options to specify and
their possible default values. For instance, in the help about the
c
function, the Usage section reads as follows
:
c(..., recursive = FALSE)
We are told here that the c
function takes several
arguments first (an undefined number thereof) and then expects an
optional parameter named recursive
. We know it is optional
because of the equal sign and the specified value put thereafter: when
the user omits to specify a value for that parameter, R will silently
set its value to FALSE
.
This is a general rule for the behaviour of functions in R: when an equal sign appears in the Usage section right after the name of an argument, it means that in case the corresponding argument is omitted, it is silently given the said value by R (i.e.~the value appearing on the right of that equal sign). This makes function call writing very flexible, as one can omit some of the arguments when one is fine with the default values.
For instance, you may try the following:
> list1 <- list(1,2) # the list function builds a list from its argument
> list2 <- list(3,4)
> c(list1,list2) # don't pay too much attention to the way R writes lists
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
> c(list1,list2,recursive=FALSE) # produces the exact same output as the previous command
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
> c(list1,list2,recursive=TRUE)
[1] 1 2 3 4
In other terms, here the function call finally executed when the user
enters c(list1,list2)
is indeed
c(list1,list2,recursive=FALSE)
: the default value for the
argument has been used.
Back into the manpage, the section named Arguments explains the role and valid values of the different arguments (or options) of the function, and the Details section gives further in-depth explanations about the behaviour of the function and the results it yields.
R can manipulate different types of data. Whenever you input some data into R, these data are given a mode, and this mode affects the way R deals with them.
> mode(0.1)
[1] "numeric"
> mode("allo")
[1] "character"
> mode(2 > 5)
[1] "logical"
> mode(1+2i)
[1] "complex"
> mode(TRUE)
[1] "logical"
> mode(T) # "T" is a short for TRUE. FALSE may also be written "F"
[1] "logical"
Non-atomic constructs also inherit the mode of their components:
> a = c(1,1) # yes, variable assignment can also be written this most simple way... ;)
> mode(a)
[1] "numeric"
> a = c("ab","ba")
> mode(a)
[1] "character"
The two logical values in R are TRUE
and
FALSE
. You may use instead their shorter versions
T
and F
. Attention: use these symbols “as is”.
Enclosing them into quotes would not give the boolean values but mere
strings (mode character).
We’ve already seen the c
function to create vectors. A
vector is a special R construct used to store an array of items all
having the same mode. Vectors made of evenly spaced numeric items are
called sequences. These are created with the
seq
command. The following three commands produce the very
same result:
> c(1,2,3,4,5)
[1] 1 2 3 4 5
> seq(from=1, to=5, by=1)
[1] 1 2 3 4 5
> 1:5
[1] 1 2 3 4 5
Notice that the second command uses argument naming:
seq(from=1, to=5, by=1)
. This is optional but convenient in
many cases, and highly recommended in the specific case of
seq
. We will see other examples of function calls with
explicit argument naming later on. In order to understand how the
seq
function works, please refer to its help page (command
?seq
or help("seq")
). There you can see
(Usage section, a.k.a~“synopsis” of the command/function) that
the first three arguments expected by seq
are named
respectively from
, to
and by
(the
first one is the lower bound of the sequence to be generated, the second
is the upper bound and the third is the step value).
Keep in mind the 1:5
trick to quickly build a sequence
of consecutive integers. We will make use of that later. Also notice
that seq
comes useful to sample regularly an interval on
the real line, e.g.~x = seq(from=0, to=1, by=0.01)
builds a
vector of 101 points evenly spaced on the real interval \([0,1]\), both boundaries included, and
stores the result into the variable x
.
Finally, we mention the rep
function that helps you
build a vector with a repetitive content. For instance, if we want to
build a vector of 10 values all equal to 1:
> rep(1,10)
[1] 1 1 1 1 1 1 1 1 1 1
This also applies to build a repetition of a pattern of length \(> 1\):
> rep(1:3,4)
[1] 1 2 3 1 2 3 1 2 3 1 2 3
In a vector, each element is addressable by its index. The usual square bracket notation applies. Vector indices start at 1, and vector elements are mutable: one can alter directly one of the values in a vector without reassigning the whole vector.
> a = c(10,11,12,13)
> a[2]
[1] 11
> a[2] <- 20
> a
[1] 10 20 12 13
One can also take a slice from an existing vector, extracting consecutive elements:
> a[2:4]
[1] 20 12 13
Our first attempt to build a 2x2 matrix could be as follows:
> mat1=c(c(1,2),c(3,4))
> mat1
[1] 1 2 3 4
> length(mat1)
[1] 4
So it doesn’t work like that: the c
command flattens all
the arguments it is given to build only one long
vector. We have to use the function matrix
to build a
matrix:
> mat1 = matrix(data=c(1,2,3,4), nrow=2, ncol=2)
> mat1
[,1] [,2]
[1,] 1 3
[2,] 2 4
Note that by default R fills in the data structure by
columns: it fills up the first column with the first data and
then proceeds to the second column with the remaining data, etc. This is
beacause the default value for the byrow
option of the
matrix
command is set to FALSE
(see the
manpage of matrix
, e.g.~typing ?matrix
after
the invite). To proceed row-wise, you must explicitly specify
byrow=T
(see our examples further below).
We said earlier that vectorization is often implicit in R. Here are some examples:
> mat2 = matrix(data=5, nrow=2, ncol=3)
> mat2
[,1] [,2] [,3]
[1,] 5 5 5
[2,] 5 5 5
> mat3 = matrix(data=c(1,2), nrow=3, ncol=2, byrow=T)
> mat3
[,1] [,2]
[1,] 1 2
[2,] 1 2
[3,] 1 2
R automatically replicates the input data to pan to the size of the container (here, a 3x2 matrix).
A matrix is in fact a special form of vector: it has a length and a
mode (and only one mode: try to mix strings and numbers in the same
matrix and see what happens), but also an additional feature, its
dimension, that one can query (and alter) through the
function dim
.
> length(mat3)
[1] 6
> mode(mat3)
[1] "numeric"
> dim(mat3)
[1] 3 2
> dim(mat3) = c(1,6)
> mat3
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 2 2 2
Accessing elements from a matrix is done also with the usual square bracket notation:
> mat4=matrix(data=1:20, ncol=5, byrow=T)
> mat4
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
> mat4[2,3]
[1] 8
> mat4[,3]
[1] 3 8 13 18
> mat4[1,]
[1] 1 2 3 4 5
These last two commands are very important to understand. These are
the first and simplest examples of subsetting that we
see: here we extract the third column of mat4
, and then its
first row. In both cases the results are vectors. mat4[,]
would simply give the entire mat4
. All sorts of selection
patterns can be used in combination.
For instance:
> mat4[,2:4]
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 7 8 9
[3,] 12 13 14
[4,] 17 18 19
Of course, as the resulting matrix is a new object, with no memory of the data container it originates from, indexing in it starts anew from 1.
In R, the “modulo” operator (you know, that operator giving the
remainder in the euclidean division of its two operands) is written
%%
. Its two operands (on its left and on its right) are
expected to be numbers:
> 17 %% 5
[1] 2
But in fact we can also use this operator to find out all the remainders modulo 2 from a matrix:
> transformed = mat4 %% 2
> transformed
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 0 1
[2,] 0 1 0 1 0
[3,] 1 0 1 0 1
[4,] 0 1 0 1 0
> sum(transformed)
[1] 10
Fair enough: out of this matrix of 20 consecutive integers, 10 are
odd numbers. The modulo operator has worked in a vectorized fashion: it
has been applied to every element of its input, producing on output of
same dimensionality. What’s more, we can actually extract from
mat4
all its elements with zero remainder in the division
by 3, with the single following selection command:
> mat4[mat4 %% 3 == 0]
[1] 6 12 3 18 9 15
As in most programming languages, in R the boolean operator
to test for equality is ==
, obviously because
=
is one of the operators used for variable assignment. We
are using it in the command above to select a subset of the indices in
mat4
, extracting from it the elements whose Euclidean
division by 3 yields a zero remainder (i.e.~we are extracting the
elements of mat4
that are divisible by 3). Now let us try
to use the same kind of selection pattern to count the number of A’s and
G’s in a large matrix made of nucleotides. Our 10x42 matrix is made of
repeated patterns of five nucleotides (A, C, A, T, G): there will be
twice as many A’s as any of the other three nucleotides.
> mat5=matrix(c('A','C','A','T','G'),nrow=10,ncol=42,byrow=T)
> mat5[1,] # just to check the first line...
[1] "A" "C" "A" "T" "G" "A" "C" "A" "T" "G" "A" "C" "A" "T" "G" "A" "C" "A" "T"
[20] "G" "A" "C" "A" "T" "G" "A" "C" "A" "T" "G" "A" "C" "A" "T" "G" "A" "C" "A"
[39] "T" "G" "A" "C"
> mat5 == "G" # vectorized comparison to "G"
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[2,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[3,] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[5,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[6,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[7,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[8,] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[9,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[10,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[3,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[6,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[7,] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[8,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[9,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[10,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
[1,] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[2,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[3,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[4,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[5,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[6,] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[7,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[8,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[9,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[10,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[,37] [,38] [,39] [,40] [,41] [,42]
[1,] FALSE FALSE FALSE TRUE FALSE FALSE
[2,] FALSE TRUE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE TRUE FALSE
[4,] FALSE FALSE TRUE FALSE FALSE FALSE
[5,] TRUE FALSE FALSE FALSE FALSE TRUE
[6,] FALSE FALSE FALSE TRUE FALSE FALSE
[7,] FALSE TRUE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE TRUE FALSE
[9,] FALSE FALSE TRUE FALSE FALSE FALSE
[10,] TRUE FALSE FALSE FALSE FALSE TRUE
> length(mat5[mat5=="G"])
[1] 84
Note that the two boolean/logical values TRUE
and
FALSE
are automatically translated into respectively 1 and
0 when forced in an arithmetic operation for instance the
sum
operation as illustrated below. This makes it extremely
convenient to count the number of elements in a vector that
fulfill a given requirement:
> sum(mat5=="G")
[1] 84
> sum(mat5=="A") # correct result: 168 = 2 * 84
[1] 168
It is worth here commenting further about the selection scheme based
on a test (i.e. an expression being evaluated into a boolean value or an
array of boolean values, i.e.~an R object in logical mode)
enclosed into square brackets that are appended to the name of a
variable. While in the above examples the array we took a subset from
(mat4
or mat5
) was also appearing in the
selection pattern (e.g. mat5[mat5=='G']
), this is not a
constraint imposed by R. One can extract elements from a vector
according to a criterion determined on some other vector, possibly
unrelated to the first one. For example:
> myvec = 1:20
> filter = rep(c(T,F,F,T,F),4) # logical vector unrelated to myvec\ldots
> length(filter) # \ldots but with same size
[1] 20
> myvec[filter]
[1] 1 4 6 9 11 14 16 19
A categorical variable can take only a finite number of different values, usually from quite a small set. Examples are for instance: * a nucleotide can be encoded as a categorical variable, taking one of four values represented e.g. with the four letters A,C,G and T; * human blood groups are taken from the set {A,B,AB,O} (four categories or levels); * from the Martin-Schultz scale, one could classify human iris colours in the seven following categories: amber, blue, brown, gray, green, hazel, red/violet; * a question from survey could accept responses high, medium or low (three categories our levels); * a disease allele can be either dominant or recessive (two categories or levels).
Please note that some categorical variables come with no specific logical ordering (e.g. human iris colors), while some other (we call them ordinal variables) do: for instance the high/medium/low categories.
In R, the different values a categorical variable can possibly have
are called levels. Defining categorical variables is
done through the factor
function. One has to pay attention
to the fact that this function is at the same time the instanciation of
a vector and the declaration of the type of its
elements. Possible values that would not exist in the specific vector we
are declaring should be named through the argument called
labels
.
For a start, let us define the blood groups of 8 patients:
> blood_groups = factor(c('A','B','A','B','A','O','AB','A'))
> blood_groups
[1] A B A B A O AB A
Levels: A AB B O
Please refer to the manpage for factor
: in the example
above we only use the first argument of this function (called
x
) in the manpage. Here it is a vector made of strings, but
it could also have been a vector of integers. R automatically deduces
the four possible levels for this categorical variable, as in this
example at least one instance of each is present. The resulting
variable, blood_groups
, is a vector plus
the information of the possible levels of the underlying categorical
variable. Note the difference with:
> blood_groups=c('A','B','A','B','A','O','AB','A')
> blood_groups
[1] "A" "B" "A" "B" "A" "O" "AB" "A"
In this latter example, blood_groups
is simply a vector
of strings, with no underlying idea of categories.
If we need to tell R that levels exist that are not present in the
vector we define, we have to use the second argument of the
factor
function, called labels
in its
manpage:
> blood_groups=factor(c('A','B','A'), levels=c('A','B','AB','O'))
> blood_groups
[1] A B A
Levels: A B AB O
If we want to define ordinal data, we put the right
ordering in the levels
argument and set the boolean option
ordered
to TRUE
. Note the informative way used
by R to display that ordering along with the vector of categorical
data:
> responses = factor(c('low','low','high','low'), levels=c('low','med','high'), ordered=T)
> responses
[1] low low high low
Levels: low < med < high
To sum it up, let us remember that factors are some kind of vectors
that contain the information relative to the possible levels or
categories. The function levels
gives you direct access to
these.
Data frames are complex data containers in R. While vectors (and
their matrix derivatives) cannot store data of different types (try to
build c(1,2,"three")
), data frames can sotre different
types of data in different fields (i.e.~different
columns). To get our first example of a data frame, and as we don’t know
yet how to import data from our data files, we load some data readily
available in R through the variable called iris
. These are
observations of a number of plant specimens, all belonging to the
Iris genus. For each specimen, geometrical features of the
petal and sepal are recorded, as well as its classification name at the
species level. As this data frame is rather large, typing
iris
on the commandline dumps the whole dataframe to the
screen in a rather inconvenient way. Let us introduce the very useful
and very versatile function str
(for “structure”) that
provides insight on the structure of just any R object:
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> # and to get all the levels associated to the Species field:
> levels(iris$Species)
[1] "setosa" "versicolor" "virginica"
We learn that the iris
variable is of type
data.frame
and contains 150 observations
(records or rows) or 5 variables
(fields or columns). The first four fields contain
numeric variables and are named Sepal.Length
,
Sepal.Width
, Petal.Length
and
Petal.Width
, while the fifth field contains a categorical
variable being the name of the species. We are told that only three
different species (and thus 3 different levels) exist in the dataset:
Iris setosa, Iris versicolor and Iris
virginica. To get all the information relative to the levels of the
field named Species
, we type
levels(iris$Species)
: the dollar sign is used to designate
a field of a data frame. Besides the information concerning the size of
the data container, field names, type of data, etc, str
displays the first values found in every field. Note that the internal
representation of categorical variables by R may rely on integers. These
are not to be typed in by the user but gives you an idea of how
efficient categorical variable storage is in R.
Another useful and versatile function to have a glance at a complex
set of data is summary
, which gives an account of the
distribution of categorical variables as well as a basic statistical
summary of all numerical variables:
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
As we have already seen, the dollar sign is used to designate a field from a data frame:
> iris$Sepal.Length
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
Note that we could also use the good old square bracket notation to select that column: a data frame resembles a matrix where column are not designated with their indices, but with names (strings of characters):
> iris[,'Sepal.Length']
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
Selecting a range of lines works, as the lines (or records) in a data frame are numbered:
> iris[50:54,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
50 5.0 3.3 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
What if we now want to select some of the records (lines), based on some test(s) on their content? Below we extract all the records (i.e. lines, corresponding to specimens) for which the length of the sepal is at least 7.7. Note that nothing follows the comma enclosed in the square brackets: we select all columns.
> iris[iris$Sepal.Length>=7.7,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
123 7.7 2.8 6.7 2.0 virginica
132 7.9 3.8 6.4 2.0 virginica
136 7.7 3.0 6.1 2.3 virginica
Selecting lines with criteria calculated among several columns is
also possible. Below we select all the records where the sepal length is
greater or equal than 7.7 AND the petal length is strictly greater than
6.5. The logical AND operator is written &
, while the
logical OR is |
and the logical NOT writes !
(prepended):
> b <- iris[iris$Sepal.Length>=7.7 & iris$Petal.Length>6.5,]
> b
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
123 7.7 2.8 6.7 2.0 virginica
Note that the resulting object is still a data frame, where the original names of the records have remained:
> is.data.frame(b)
[1] TRUE
> b[1,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
118 7.7 3.8 6.7 2.2 virginica
> b['123',]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
123 7.7 2.8 6.7 2 virginica
> b[,2]
[1] 3.8 2.6 2.8
Here we see an interesting feature of data frames: their records
(lines) and fields (columns) are usually named with character strings,
but integer indices are still usable to address elements. If we want to
get only a subset of all columns in our resulting data frame, we use the
subset
function:
> subset1 = subset(iris, Sepal.Length==5.1, select=Sepal.Width)
> subset1
Sepal.Width
1 3.5
18 3.5
20 3.8
22 3.7
24 3.3
40 3.4
45 3.8
47 3.8
99 2.5
Above, the second argument of subset
is the boolean
selection operator, while the third argument (select
) trims
columns out of the output.
> subset1 = subset(iris,abs(Sepal.Length-5)<=0.1,select=c('Sepal.Length','Sepal.Width'))
> subset1
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
5 5.0 3.6
8 5.0 3.4
10 4.9 3.1
18 5.1 3.5
20 5.1 3.8
22 5.1 3.7
24 5.1 3.3
26 5.0 3.0
27 5.0 3.4
35 4.9 3.1
36 5.0 3.2
38 4.9 3.6
40 5.1 3.4
41 5.0 3.5
44 5.0 3.5
45 5.1 3.8
47 5.1 3.8
50 5.0 3.3
58 4.9 2.4
61 5.0 2.0
94 5.0 2.3
99 5.1 2.5
107 4.9 2.5
Here we use the subset
command to extract the specimens
with a sepal length less than 0.1 away from 5.0, and on these we select
only the two columns relative to sepal dimensions.
In R, two very generic commands exist that allow you to combine data
row-wise or column-wise. They are called respectively rbind
and cbind
. With cbind
you may for instance add
an additional column to a dataframe. With rbind
you could
for instance concatenate the records of two similar arrays of data into
a single one. Let’s start with a simple example, say you want to add as
a first column of the dataframe iris
random id numbers for
every specimen. We will use the random number generation function
runif
, with no guarantee not to get duplicates, though. On
the first line below, runif(150, min=1, max=100000
generates a series of 150 numbers from the uniform probability
distribution on the interval \([1,
100000]\). Of these random numbers, floor
gives the
numerical floor (largest integer value no greather than). We then
combine this new column with the existing iris
data
frame:
> my_column = floor(runif(150, min=1, max=100000))
> cbind(my_column, iris) -> iris2
> str(iris2)
'data.frame': 150 obs. of 6 variables:
$ my_column : num 48173 91295 60679 45880 93772 ...
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Pay attention that the additional column is automatically named by
the name of the source. You can always alter the column names later on,
accessing these by the names()
function, which can be use
to output or to input the names:
> names(iris2) -> oldnames
> oldnames[1] <- "ID"
> names(iris2) <- oldnames
You can add additional rows to an existing dataframe with
rbind
, for instance adding the column averages to any
dataframe made of numeric values:
> av1 = mean(iris2[,1]);av2=mean(iris2[,2])
> av3=mean(iris2[,3]);av4=mean(iris2[,4]);av5=mean(iris2[,5])
> av6=names(which.max(table(iris2[,6]))) # # to get the most common category in this factor
> rbind(iris2,c(av1,av2,av3,av4,av5,av6)) -> iris3
> tail(iris3, n=2)
ID Sepal.Length Sepal.Width Petal.Length
150 73373 5.9 3 5.1
151 53344.2866666667 5.84333333333333 3.05733333333333 3.758
Petal.Width Species
150 1.8 virginica
151 1.19933333333333 setosa
So it seems that we are fine: the line containing the average values
of all the columns has been added at the bottom of the dataframe. But
let’s look at the structure of our new iris3
variable:
> str(iris3)
'data.frame': 151 obs. of 6 variables:
$ ID : chr "48173" "91295" "60679" "45880" ...
$ Sepal.Length: chr "5.1" "4.9" "4.7" "4.6" ...
$ Sepal.Width : chr "3.5" "3" "3.2" "3.1" ...
$ Petal.Length: chr "1.4" "1.4" "1.3" "1.5" ...
$ Petal.Width : chr "0.2" "0.2" "0.2" "0.2" ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
What happened here? Furtively, all the numeric variables have been converted to string (values of mode “character”). This is very interesting to explain here, as it draws from the differencies between data types in R.
First of all, we have to be aware that vectors
(built for instance through the use of the c
command)
cannot contain mixed data types. Whenever one wants to
build a vector from data of different modes, R automatically coerce some
of the data to the most generic mode (usually the “character” mode as
lots of basic objects accept conversion to strings by means of the
as.character()
function). For instance, here:
> mode(av1)
[1] "numeric"
> mode(av6)
[1] "character"
> av1
[1] 53344.29
> c(av1,av2,av3,av4,av5,av6) -> last_row
> last_row
[1] "53344.2866666667" "5.84333333333333" "3.05733333333333" "3.758"
[5] "1.19933333333333" "setosa"
> mode(last_row)
[1] "character"
last_row
, as an R vector, has only one single mode: here
it is made of strings. Very logically then, when we used
rbind
to concatenate this row to the iris2
table, the contagion spread to the dataframe because we were adding
character values to columns formerly of numeric mode.
Most frequently your work session with R will start with importing some data from a datafile that you have somewhere on your computer. We are going to see how to build an R object from a file, filling the object (it will be that data structure R calls a table) with the data present in that file.
Whenever R is running, it has an internal variable indicating where
is its current working directory. If you invoked R from
a shell, it will usually be the current working directory of the shell
at the moment when you launched R. Otherwise it can be your home
directory, or any other directory where R has read and write
permissions. You may check what is R’s current working directory with
the function getwd()
:
> getwd() # obviously you will get a different value here...
[1] "/home/jbde/Trainings/R/r-and-tidyverse-intro-ag-research"
> mydir = getwd()
The current working directory may be set to some other value through
the setwd
command:
> setwd("~") # setting it to my home directory...
> getwd()
[1] "/home/jbde"
> setwd(mydir) # setting the current directory back to original value
Whenever you specify some filename during your R session, R will
search for that file from the current working
directory, unless of course you give it in the form of an
absolute path (i.e. starting at the root of the
filesystem, with an initial slash /
). So if you just give
the name of a file (and not an absolute or relative path), R will expect
to find the file you’re talking about in its current working directory.
This is also where it will write files.
read.table
functionAlthough there are several specific R functions to read data from
files, these are nothing but static refinements of the generic
read.table
function. We will thus focus on this latter
only. It is meant to read data from a file, according to some
specifications (whether there is a header line, what is the field
separator, etc) and writes the content into an R object called a
data frame.
As a first example of input file, let us take the file
fev_dataset.txt
provided along with this tutorial. This
file contains 655 lines (one header line and 654 records). We reproduce
below its first 6 lines (for those who want to know, the shell command
to get this is head -n 6 fev_dataset.txt
):
Age FEV Ht Gender Smoke
9 1.7080 57.0 0 0
8 1.7240 67.5 0 0
7 1.7200 54.5 0 0
9 1.5580 53.0 1 0
9 1.8950 57.0 1 0
Each line comprises 5 fields:
This dataset comes from a study originally performed by Rosner and colleagues to find out whether constant exposure to an environment where at least one of the parents is smoking had an impact on the respiratory capacity of young boys and girls. Or course, the age and height of a child are expected to play a significant role in the determination of the pulmonary capacity of that same child, hence the presence of these variables among the data collected.
Let us first try and use the simplest form of the function
read.table
:
> read.table("fev_dataset.txt") -> dat1
Okay, we have read the contents of the file and have stored it into a
new data frame we called dat1
. As R keeps
silent after this, let us check that this worked as expected. As two
exploratory content-checking commands, you may use head
(to
first items of a data structure) or str
(to get an
informative output concerning the internal structure of the object you
query):
> is.data.frame(dat1)
[1] TRUE
> head(dat1)
V1 V2 V3 V4 V5
1 Age FEV Ht Gender Smoke
2 9 1.7080 57.0 0 0
3 8 1.7240 67.5 0 0
4 7 1.7200 54.5 0 0
5 9 1.5580 53.0 1 0
6 9 1.8950 57.0 1 0
> str(dat1)
'data.frame': 655 obs. of 5 variables:
$ V1: chr "Age" "9" "8" "7" ...
$ V2: chr "FEV" "1.7080" "1.7240" "1.7200" ...
$ V3: chr "Ht" "57.0" "67.5" "54.5" ...
$ V4: chr "Gender" "0" "0" "0" ...
$ V5: chr "Smoke" "0" "0" "0" ...
Now, that’s bad: R has seen a file of 655 records,
not noticing that the first line is actually not a data record, but the
header containing the names of the different fields (or columns).
Instead R used an automatic naming convention, dubbing the columns
V1
, V2
, etc. Notice it automatically adds
row-naming numbers, also.
The output of the str
function is even more informative:
we see that as R had to fit into the same column some textual data
(coming from the header) and some numerical data as well (the
measurement datapoints), it interpreted everything to be categorical
variables represented as strings. To correct this, we have to tell R
that the first line of the file should not be considered as a data
record but as a header, thus giving the names of the columns. This is
done through the header
option that we set to
TRUE
:
> read.table("fev_dataset.txt", header=TRUE) -> dat1
> head(dat1)
Age FEV Ht Gender Smoke
1 9 1.708 57.0 0 0
2 8 1.724 67.5 0 0
3 7 1.720 54.5 0 0
4 9 1.558 53.0 1 0
5 9 1.895 57.0 1 0
6 8 2.336 61.0 0 0
> str(dat1)
'data.frame': 654 obs. of 5 variables:
$ Age : int 9 8 7 9 9 8 6 6 8 9 ...
$ FEV : num 1.71 1.72 1.72 1.56 1.9 ...
$ Ht : num 57 67.5 54.5 53 57 61 58 56 58.5 60 ...
$ Gender: int 0 0 0 1 1 0 0 0 0 0 ...
$ Smoke : int 0 0 0 0 0 0 0 0 0 0 ...
All is fine now, as R correctly detected that the records contain
numerical values. If we really want to insist that the
Gender
and Smoke
variable be treated as
categorical, we can force the use of specific classes (or “types”) for
the different columns (the backslash character at the end of a
commandline is just to indicate that the said line continues
uninterrupted onto the next: the line break was necessary only to fit
the line into this pagewidth):
> read.table("fev_dataset.txt", header=TRUE, colClasses=c("integer","numeric","numeric","factor","factor")) -> dat1
> str(dat1)
'data.frame': 654 obs. of 5 variables:
$ Age : int 9 8 7 9 9 8 6 6 8 9 ...
$ FEV : num 1.71 1.72 1.72 1.56 1.9 ...
$ Ht : num 57 67.5 54.5 53 57 61 58 56 58.5 60 ...
$ Gender: Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 1 ...
$ Smoke : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
It is important to consider that within your original file, some
characters mark the boundary between one field and the following, within
one record. These can be for instance tabulation characters (usually
represented \t
), any number of consecutive white spaces,
commas (,
), semicolons (;
) or colons
(:
). By default read.table
’s field separator
is any “white space”, which means one or more spaces, tabs, newlines or
carriage returns. In case this is not suitable for your input file, you
have to specify the correct field separator by means of the
sep
option: for a regular CSV file (Comma-Separated
Values), you would specify sep=","
as an option to
read.table
.
Also important to consider is the decimal separator in use for
numerical data in your original datafile. Most people around the world
use the dot (.
) as the decimal point, but you are not
without knowing that the convention in French-speaking countries is to
have the comma (,
) as the decimal point. Such an encoding
would be correctly handled by R during importation when the user
specifies the decimal point through the sep
option,
e.g.~sep=","
.
Sometimes your file contains missing values, for instance if you have
two consecutive commas (,,
) in a CSV file using the comma
as a field separator. R understand this and uses a special token called
NA
to encode missing values. Sometimes the people who
prepared the datafile also use another predefined token to indicate
missing values, for instance “Na”, “na” or “n/a”. In order for R to
translate these strings into its proper NA
values, you have
to indicate the possible NA-coding strings by means of the
na.strings
option. For instance:
> read.table("myfile.txt", header=TRUE, sep=",", na.strings=c("na", "n/a"))
In this case any value completely missing (two consecutive field
separators), as well as any string “na” or “n/a” as a field value, will
be understood by R as a missing value and translated into the unique
appropriate R token NA
.
Once you want to write some tabulated data (e.g.~a vector, matrix or
data frame) from R right into a file in your filesystem, you may use the
write.table
function. For instance, to write the content of
an R data frame called mydata
in your current working
session into a file that you wawnt to call myfile
and write
into your current working directory, you will use:
> write.table(mydata, file="myfile")
In this case, the file will be written using the default field
separator for write.table
, which is a single space
character. You may explore the different formatting options by calling
the help of write.table
. Most of them are the equivalent of
what you can find for the reciprocal read.table
function.
While many people use an advanced spreadsheet software
(e.g. LibreOffice Calc, Apache OpenOffice Calc or Microsoft Excel) to
prepare their data, unfortunately it is not possible to read data into R
directly from any of the .odt
, .xls
or
.xlsx
formats. This is mainly because these formats are
meant to accomodate complex data in several sheets, which possibly
cannot fit into one single R object. When you want to import some data
from such a format, you first have to convert it into some CSV format
(notice that the field separator can be any other character of your
choice, not necessarily a comma), using your favourite spreadsheet
software. You will then be able to read the simpler CSV format into R
with the read.table
function.
For instance, after having transformed the file
tutorial_data.xlsx
into a comma-separated file that you
call tutorial_data_commasep.csv
, you would load in into a
mydat
data frame in your R session using the following
command (the backslach character at the end of a commandline is just to
indicate that the said line continues uninterrupted onto the next: the
line breaks were necessary only to fit the lines into this
pagewidth):
> mydat=read.table("tutorial_data_commasep.csv",
+ header=T,
+ sep=",",
+ colClasses=c("integer","factor","integer","numeric","factor",
+ "numeric","factor","integer","factor","factor"))
> knitr::knit_exit()
Comments in R
This is really what we should start with. On a commandline, the hash character (
#
) has a special meaning: the rest of the line, including the hash character itself, is simply totally ignored by R. This means that you can use it to enter comments in your code. Please do comment massively your code, as it will make your life easier when you will go back to saved file of R commands months after you first wrote them: you will understand what they mean exactly.