Introduction

This tutorial walks you through basic interactions with the R software package, from data manipulation to statistical tests and regression models. You may start an interactive session with R either directly from the shell by typing R on the commandline. Pay attention that your GNU/Linux environment is case-sensitive, so typing r will fail. Alternatively, you can start R through the fancier Graphical User Interface (GUI) called RStudio: type rstudio from the terminal, or (double-)click on the corresponding icon on your desktop space.

During your interactive session with R, you type commands or expressions in the terminal (the frame called Console in RStudio), to which R will either reply silently or display some result or perform some other action (e.g. drawing a graph in a separate window frame). Whenever R is ready to listen to you, it invites you to type in some command with a tiny “greater than” sign at the very beginning of the last line in the Console. In the examples you will find troughout this tutorial, the interactions with R are reproduced in typewriter font. So you don’t have to type in the initial “greater than” sign when it is displayed at the beginning of a commandline: this is simply the invite as produced by R.

Once you are done with the present tutorial, if you want to deepen your understanding of how R works, start to write some basic programs in R, etc, I suggest that you read that excellent other tutorial written by Emmanuel Paradis and entitled R for beginners. It is a bit “old-school”, talking about the base R programming style, but very instrumental in understanding how R works.

First interactions

Comments in R

This is really what we should start with. On a commandline, the hash character (#) has a special meaning: the rest of the line, including the hash character itself, is simply totally ignored by R. This means that you can use it to enter comments in your code. Please do comment massively your code, as it will make your life easier when you will go back to saved file of R commands months after you first wrote them: you will understand what they mean exactly.

Variable assignment

In programming languages, storing some kind of data into the “memory” of the computer (in fact into the current symbolic workspace defined by your R session) is called assigning a value to a variable. When you assign a value (whatever its type: it can be a string of characters, an integer, a boolean variable, a floating-point number, etc) to a variable, you create a new object with R. This is done through the assignment operator ->. This operator is written with a hyphen (-) followed by a “greater than” sign (>), so that it forms an arrow. Taking care of swapping the two operands (variable and value), this can be also written the other way around:

> 5 -> a    # puts the value 5 into the variable a
> b <- 7    # puts the value 7 into the variable b 
> b - a     # asks for the evaluation (calculation) of b-a 
[1] 2
> 2         # nothing prevents you from evaluating a constant!
[1] 2

The “greater than” character that appears at the beginning of the line (in a terminal or in RStudio’s Console frame is the invite. It means that R is waiting for you to enter a command. Some commands produce a silent output (e.g. assigning a value to a variable is a silent operation), some other display a result. For example, subtracting the content of the variable called a from the content of the variable called b gives the single value 2 as an output. Notice that the [1] automatically prepended to the output line is an index: as R tends to see everything as a vectorized variable (you will see other examples very soon), here it tells us that this 2 is the first element of the output vector produced by the operation b-a (this output vector contains only one element). On the last line of the interaction above, you can see that even typing a mere value and asking R to evaluate it by pressing the (Enter) key yields the exact same output with the exact same formatting.

Vectors and function calls

As we just said, vectors are essential data structures in R. One creates vectors simply by using the concatenation operator (or function), rightfully named c:

> myvec <- c(1,3,5,7,84) # passing five integers as arguments to the function c
> myvec                  # myvec is now a vector containing five elements
[1]  1  3  5  7 84
> length(myvec)          # calling the function length on the object myvec
[1] 5

Here we have just used our first two functions in R. With the first command of this interaction block, we performed a function call to the function called c, giving it five integers as arguments. A function call in R is always written as this: the name of the function is followed by a pair of parentheses enclosing a list of comma-separated arguments (possibly only one or even none). Hence, the last operation we performed above is a function call to the function length, passing to it one argument only, namely the myvec variable.

Remember that whenever you want to use a function, you have to write these parentheses. Even if the function needs no argument. For instance, the function ls, when invoked (another vocable computer people use for “called”) with no argument, lists the content of your current “userspace” or “environment”: it returns a vector populated with strings giving the names of these objects present in your current R environment (i.e. the variables you defined earlier in your interactive session).

> ls()
[1] "a"     "b"     "myvec"

Beware!! If you omit the parentheses, R will try and evaluate the function itself:

> ls
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
    pattern, sorted = TRUE) 
{
    if (!missing(name)) {
        pos <- tryCatch(name, error = function(e) e)
        if (inherits(pos, "error")) {
            name <- substitute(name)
            if (!is.character(name)) 
                name <- deparse(name)
            warning(gettextf("%s converted to character string", 
                sQuote(name)), domain = NA)
            pos <- name
        }
    }
    all.names <- .Internal(ls(envir, all.names, sorted))
    if (!missing(pattern)) {
        if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
            ll != length(grep("]", pattern, fixed = TRUE))) {
            if (pattern == "[") {
                pattern <- "\\["
                warning("replaced regular expression pattern '[' by  '\\\\['")
            }
            else if (length(grep("[^\\\\]\\[<-", pattern))) {
                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
            }
        }
        grep(pattern, all.names, value = TRUE)
    }
    else all.names
}
<bytecode: 0x55e9cec20c90>
<environment: namespace:base>

Getting some help

To get some help on a command, just type a question mark (?) immediately followed by the name of the command. If using the commandline R, you will then enter a manpage-like environment, in which the space moves you to the next screen (or page), the b key goes back one page, and a keypress on q quits the help and brings you back to the commandline and its invite. Alternatively, depending on the configuration of your computer, you may get access to the HTML version of the help pages: that will be the case for most of us, using RStudio: the manpage is displayed in a nice, html format in the bottom right pane. By the way, in RStudio, you can also ask for manpages by clicking on the Help tab in the bottom right pane, and then using the search box in the upper right corner of that pane to enter a search keyword. You don’t need to type the question marks when using that searchbox.

> ?ls
> #.... viewing the help on the ls command in the help pane
> 
> ?length
> # .... viewing the help on the length command in the help pane
> 
> ??transpose
> #  .... viewing the help pages containing the transpose keyword in the help pane

The latter type of query (with the double question mark) is to be used when you are unsure of the name of the function you are seeking help about. It tries to provide you with the list of help files containing the keyword in question, after what you will prepend the single question mark to the name of the identified command (or function) to get the relevant help page.

These help pages (also called manpages) are all built according to the same structure: after a short Description of the purpose of the command, follows a section called Usage. This latter section presents the syntax you may use to build a function call, with the possible options to specify and their possible default values. For instance, in the help about the c function, the Usage section reads as follows :

c(..., recursive = FALSE)

We are told here that the c function takes several arguments first (an undefined number thereof) and then expects an optional parameter named recursive. We know it is optional because of the equal sign and the specified value put thereafter: when the user omits to specify a value for that parameter, R will silently set its value to FALSE.

This is a general rule for the behaviour of functions in R: when an equal sign appears in the Usage section right after the name of an argument, it means that in case the corresponding argument is omitted, it is silently given the said value by R (i.e.~the value appearing on the right of that equal sign). This makes function call writing very flexible, as one can omit some of the arguments when one is fine with the default values.

For instance, you may try the following:

> list1 <- list(1,2)   # the list function builds a list from its argument
> list2 <- list(3,4)
> c(list1,list2)       # don't pay too much attention to the way R writes lists
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4
> c(list1,list2,recursive=FALSE) # produces the exact same output as the previous command
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4
> c(list1,list2,recursive=TRUE)
[1] 1 2 3 4

In other terms, here the function call finally executed when the user enters c(list1,list2) is indeed c(list1,list2,recursive=FALSE): the default value for the argument has been used.

Back into the manpage, the section named Arguments explains the role and valid values of the different arguments (or options) of the function, and the Details section gives further in-depth explanations about the behaviour of the function and the results it yields.

Different data, different modes

R can manipulate different types of data. Whenever you input some data into R, these data are given a mode, and this mode affects the way R deals with them.

> mode(0.1)
[1] "numeric"
> mode("allo")
[1] "character"
> mode(2 > 5)
[1] "logical"
> mode(1+2i)
[1] "complex"
> mode(TRUE)
[1] "logical"
> mode(T)   # "T" is a short for TRUE. FALSE may also be written "F"
[1] "logical"

Non-atomic constructs also inherit the mode of their components:

> a = c(1,1)   # yes, variable assignment can also be written this most simple way... ;)
> mode(a)
[1] "numeric"
> a = c("ab","ba")
> mode(a)
[1] "character"

The two logical values in R are TRUE and FALSE. You may use instead their shorter versions T and F. Attention: use these symbols “as is”. Enclosing them into quotes would not give the boolean values but mere strings (mode character).

Data structures in R: vectors, matrices and data frames

We’ve already seen the c function to create vectors. A vector is a special R construct used to store an array of items all having the same mode. Vectors made of evenly spaced numeric items are called sequences. These are created with the seq command. The following three commands produce the very same result:

> c(1,2,3,4,5)
[1] 1 2 3 4 5
> seq(from=1, to=5, by=1)
[1] 1 2 3 4 5
> 1:5
[1] 1 2 3 4 5

Notice that the second command uses argument naming: seq(from=1, to=5, by=1). This is optional but convenient in many cases, and highly recommended in the specific case of seq. We will see other examples of function calls with explicit argument naming later on. In order to understand how the seq function works, please refer to its help page (command ?seq or help("seq")). There you can see (Usage section, a.k.a~“synopsis” of the command/function) that the first three arguments expected by seq are named respectively from, to and by (the first one is the lower bound of the sequence to be generated, the second is the upper bound and the third is the step value).

Keep in mind the 1:5 trick to quickly build a sequence of consecutive integers. We will make use of that later. Also notice that seq comes useful to sample regularly an interval on the real line, e.g.~x = seq(from=0, to=1, by=0.01) builds a vector of 101 points evenly spaced on the real interval \([0,1]\), both boundaries included, and stores the result into the variable x.

Finally, we mention the rep function that helps you build a vector with a repetitive content. For instance, if we want to build a vector of 10 values all equal to 1:

> rep(1,10)
 [1] 1 1 1 1 1 1 1 1 1 1

This also applies to build a repetition of a pattern of length \(> 1\):

> rep(1:3,4)
 [1] 1 2 3 1 2 3 1 2 3 1 2 3

Indexing and altering elements

In a vector, each element is addressable by its index. The usual square bracket notation applies. Vector indices start at 1, and vector elements are mutable: one can alter directly one of the values in a vector without reassigning the whole vector.

> a = c(10,11,12,13)
> a[2]
[1] 11
> a[2] <- 20
> a
[1] 10 20 12 13

One can also take a slice from an existing vector, extracting consecutive elements:

> a[2:4]
[1] 20 12 13

Going multidimensional: matrices

Our first attempt to build a 2x2 matrix could be as follows:

> mat1=c(c(1,2),c(3,4))
> mat1
[1] 1 2 3 4
> length(mat1)
[1] 4

So it doesn’t work like that: the c command flattens all the arguments it is given to build only one long vector. We have to use the function matrix to build a matrix:

> mat1 = matrix(data=c(1,2,3,4), nrow=2, ncol=2)
> mat1
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Note that by default R fills in the data structure by columns: it fills up the first column with the first data and then proceeds to the second column with the remaining data, etc. This is beacause the default value for the byrow option of the matrix command is set to FALSE (see the manpage of matrix, e.g.~typing ?matrix after the invite). To proceed row-wise, you must explicitly specify byrow=T (see our examples further below).

We said earlier that vectorization is often implicit in R. Here are some examples:

> mat2 = matrix(data=5, nrow=2, ncol=3)
> mat2
     [,1] [,2] [,3]
[1,]    5    5    5
[2,]    5    5    5
> mat3 = matrix(data=c(1,2), nrow=3, ncol=2, byrow=T)
> mat3
     [,1] [,2]
[1,]    1    2
[2,]    1    2
[3,]    1    2

R automatically replicates the input data to pan to the size of the container (here, a 3x2 matrix).

A matrix is in fact a special form of vector: it has a length and a mode (and only one mode: try to mix strings and numbers in the same matrix and see what happens), but also an additional feature, its dimension, that one can query (and alter) through the function dim.

> length(mat3)
[1] 6
> mode(mat3)
[1] "numeric"
> dim(mat3)
[1] 3 2
> dim(mat3) = c(1,6)
> mat3
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    1    2    2    2

Accessing elements from a matrix is done also with the usual square bracket notation:

> mat4=matrix(data=1:20, ncol=5, byrow=T)
> mat4
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20
> mat4[2,3]
[1] 8
> mat4[,3]
[1]  3  8 13 18
> mat4[1,]
[1] 1 2 3 4 5

These last two commands are very important to understand. These are the first and simplest examples of subsetting that we see: here we extract the third column of mat4, and then its first row. In both cases the results are vectors. mat4[,] would simply give the entire mat4. All sorts of selection patterns can be used in combination.

For instance:

> mat4[,2:4]
     [,1] [,2] [,3]
[1,]    2    3    4
[2,]    7    8    9
[3,]   12   13   14
[4,]   17   18   19

Of course, as the resulting matrix is a new object, with no memory of the data container it originates from, indexing in it starts anew from 1.

Applying functions to variables with implicit vectorization

In R, the “modulo” operator (you know, that operator giving the remainder in the euclidean division of its two operands) is written %%. Its two operands (on its left and on its right) are expected to be numbers:

> 17 %% 5
[1] 2

But in fact we can also use this operator to find out all the remainders modulo 2 from a matrix:

> transformed = mat4 %% 2
> transformed
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    1    0    1
[2,]    0    1    0    1    0
[3,]    1    0    1    0    1
[4,]    0    1    0    1    0
> sum(transformed)
[1] 10

Fair enough: out of this matrix of 20 consecutive integers, 10 are odd numbers. The modulo operator has worked in a vectorized fashion: it has been applied to every element of its input, producing on output of same dimensionality. What’s more, we can actually extract from mat4 all its elements with zero remainder in the division by 3, with the single following selection command:

> mat4[mat4 %% 3 == 0]
[1]  6 12  3 18  9 15

As in most programming languages, in R the boolean operator to test for equality is ==, obviously because = is one of the operators used for variable assignment. We are using it in the command above to select a subset of the indices in mat4, extracting from it the elements whose Euclidean division by 3 yields a zero remainder (i.e.~we are extracting the elements of mat4 that are divisible by 3). Now let us try to use the same kind of selection pattern to count the number of A’s and G’s in a large matrix made of nucleotides. Our 10x42 matrix is made of repeated patterns of five nucleotides (A, C, A, T, G): there will be twice as many A’s as any of the other three nucleotides.

> mat5=matrix(c('A','C','A','T','G'),nrow=10,ncol=42,byrow=T)
> mat5[1,] # just to check the first line...
 [1] "A" "C" "A" "T" "G" "A" "C" "A" "T" "G" "A" "C" "A" "T" "G" "A" "C" "A" "T"
[20] "G" "A" "C" "A" "T" "G" "A" "C" "A" "T" "G" "A" "C" "A" "T" "G" "A" "C" "A"
[39] "T" "G" "A" "C"
> mat5 == "G"  # vectorized comparison to "G"
       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
 [1,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [2,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [3,]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 [4,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [5,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [6,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [7,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [8,]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 [9,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[10,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
 [1,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [2,]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 [3,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [4,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [5,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [6,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [7,]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 [8,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [9,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
[10,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
      [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
 [1,]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 [2,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [3,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [4,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [5,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [6,]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 [7,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [8,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [9,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[10,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
      [,37] [,38] [,39] [,40] [,41] [,42]
 [1,] FALSE FALSE FALSE  TRUE FALSE FALSE
 [2,] FALSE  TRUE FALSE FALSE FALSE FALSE
 [3,] FALSE FALSE FALSE FALSE  TRUE FALSE
 [4,] FALSE FALSE  TRUE FALSE FALSE FALSE
 [5,]  TRUE FALSE FALSE FALSE FALSE  TRUE
 [6,] FALSE FALSE FALSE  TRUE FALSE FALSE
 [7,] FALSE  TRUE FALSE FALSE FALSE FALSE
 [8,] FALSE FALSE FALSE FALSE  TRUE FALSE
 [9,] FALSE FALSE  TRUE FALSE FALSE FALSE
[10,]  TRUE FALSE FALSE FALSE FALSE  TRUE
> length(mat5[mat5=="G"])
[1] 84

Note that the two boolean/logical values TRUE and FALSE are automatically translated into respectively 1 and 0 when forced in an arithmetic operation for instance the sum operation as illustrated below. This makes it extremely convenient to count the number of elements in a vector that fulfill a given requirement:

> sum(mat5=="G")
[1] 84
> sum(mat5=="A") # correct result: 168 = 2 * 84
[1] 168

It is worth here commenting further about the selection scheme based on a test (i.e. an expression being evaluated into a boolean value or an array of boolean values, i.e.~an R object in logical mode) enclosed into square brackets that are appended to the name of a variable. While in the above examples the array we took a subset from (mat4 or mat5) was also appearing in the selection pattern (e.g. mat5[mat5=='G']), this is not a constraint imposed by R. One can extract elements from a vector according to a criterion determined on some other vector, possibly unrelated to the first one. For example:

> myvec = 1:20
> filter = rep(c(T,F,F,T,F),4)      # logical vector unrelated to myvec\ldots
> length(filter)                    # \ldots but with same size
[1] 20
> myvec[filter]
[1]  1  4  6  9 11 14 16 19

Working with categorical variables: factors

A categorical variable can take only a finite number of different values, usually from quite a small set. Examples are for instance: * a nucleotide can be encoded as a categorical variable, taking one of four values represented e.g. with the four letters A,C,G and T; * human blood groups are taken from the set {A,B,AB,O} (four categories or levels); * from the Martin-Schultz scale, one could classify human iris colours in the seven following categories: amber, blue, brown, gray, green, hazel, red/violet; * a question from survey could accept responses high, medium or low (three categories our levels); * a disease allele can be either dominant or recessive (two categories or levels).

Please note that some categorical variables come with no specific logical ordering (e.g. human iris colors), while some other (we call them ordinal variables) do: for instance the high/medium/low categories.

In R, the different values a categorical variable can possibly have are called levels. Defining categorical variables is done through the factor function. One has to pay attention to the fact that this function is at the same time the instanciation of a vector and the declaration of the type of its elements. Possible values that would not exist in the specific vector we are declaring should be named through the argument called labels.

For a start, let us define the blood groups of 8 patients:

> blood_groups = factor(c('A','B','A','B','A','O','AB','A'))
> blood_groups
[1] A  B  A  B  A  O  AB A 
Levels: A AB B O

Please refer to the manpage for factor: in the example above we only use the first argument of this function (called x) in the manpage. Here it is a vector made of strings, but it could also have been a vector of integers. R automatically deduces the four possible levels for this categorical variable, as in this example at least one instance of each is present. The resulting variable, blood_groups, is a vector plus the information of the possible levels of the underlying categorical variable. Note the difference with:

> blood_groups=c('A','B','A','B','A','O','AB','A')
> blood_groups
[1] "A"  "B"  "A"  "B"  "A"  "O"  "AB" "A" 

In this latter example, blood_groups is simply a vector of strings, with no underlying idea of categories.

If we need to tell R that levels exist that are not present in the vector we define, we have to use the second argument of the factor function, called labels in its manpage:

> blood_groups=factor(c('A','B','A'), levels=c('A','B','AB','O'))
> blood_groups
[1] A B A
Levels: A B AB O

If we want to define ordinal data, we put the right ordering in the levels argument and set the boolean option ordered to TRUE. Note the informative way used by R to display that ordering along with the vector of categorical data:

> responses = factor(c('low','low','high','low'), levels=c('low','med','high'), ordered=T)
> responses
[1] low  low  high low 
Levels: low < med < high

To sum it up, let us remember that factors are some kind of vectors that contain the information relative to the possible levels or categories. The function levels gives you direct access to these.

Data frames

Data frames are complex data containers in R. While vectors (and their matrix derivatives) cannot store data of different types (try to build c(1,2,"three")), data frames can sotre different types of data in different fields (i.e.~different columns). To get our first example of a data frame, and as we don’t know yet how to import data from our data files, we load some data readily available in R through the variable called iris. These are observations of a number of plant specimens, all belonging to the Iris genus. For each specimen, geometrical features of the petal and sepal are recorded, as well as its classification name at the species level. As this data frame is rather large, typing iris on the commandline dumps the whole dataframe to the screen in a rather inconvenient way. Let us introduce the very useful and very versatile function str (for “structure”) that provides insight on the structure of just any R object:

> str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> # and to get all the levels associated to the Species field:
> levels(iris$Species)
[1] "setosa"     "versicolor" "virginica" 

We learn that the iris variable is of type data.frame and contains 150 observations (records or rows) or 5 variables (fields or columns). The first four fields contain numeric variables and are named Sepal.Length, Sepal.Width, Petal.Length and Petal.Width, while the fifth field contains a categorical variable being the name of the species. We are told that only three different species (and thus 3 different levels) exist in the dataset: Iris setosa, Iris versicolor and Iris virginica. To get all the information relative to the levels of the field named Species, we type levels(iris$Species): the dollar sign is used to designate a field of a data frame. Besides the information concerning the size of the data container, field names, type of data, etc, str displays the first values found in every field. Note that the internal representation of categorical variables by R may rely on integers. These are not to be typed in by the user but gives you an idea of how efficient categorical variable storage is in R.

Another useful and versatile function to have a glance at a complex set of data is summary, which gives an account of the distribution of categorical variables as well as a basic statistical summary of all numerical variables:

> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Taking subsets out of a data frame

As we have already seen, the dollar sign is used to designate a field from a data frame:

> iris$Sepal.Length
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9

Note that we could also use the good old square bracket notation to select that column: a data frame resembles a matrix where column are not designated with their indices, but with names (strings of characters):

> iris[,'Sepal.Length']
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9

Selecting a range of lines works, as the lines (or records) in a data frame are numbered:

> iris[50:54,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
50          5.0         3.3          1.4         0.2     setosa
51          7.0         3.2          4.7         1.4 versicolor
52          6.4         3.2          4.5         1.5 versicolor
53          6.9         3.1          4.9         1.5 versicolor
54          5.5         2.3          4.0         1.3 versicolor

What if we now want to select some of the records (lines), based on some test(s) on their content? Below we extract all the records (i.e. lines, corresponding to specimens) for which the length of the sepal is at least 7.7. Note that nothing follows the comma enclosed in the square brackets: we select all columns.

> iris[iris$Sepal.Length>=7.7,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
118          7.7         3.8          6.7         2.2 virginica
119          7.7         2.6          6.9         2.3 virginica
123          7.7         2.8          6.7         2.0 virginica
132          7.9         3.8          6.4         2.0 virginica
136          7.7         3.0          6.1         2.3 virginica

Selecting lines with criteria calculated among several columns is also possible. Below we select all the records where the sepal length is greater or equal than 7.7 AND the petal length is strictly greater than 6.5. The logical AND operator is written &, while the logical OR is | and the logical NOT writes ! (prepended):

> b <- iris[iris$Sepal.Length>=7.7 & iris$Petal.Length>6.5,]
> b
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
118          7.7         3.8          6.7         2.2 virginica
119          7.7         2.6          6.9         2.3 virginica
123          7.7         2.8          6.7         2.0 virginica

Note that the resulting object is still a data frame, where the original names of the records have remained:

> is.data.frame(b)
[1] TRUE
> b[1,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
118          7.7         3.8          6.7         2.2 virginica
> b['123',]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
123          7.7         2.8          6.7           2 virginica
> b[,2]
[1] 3.8 2.6 2.8

Here we see an interesting feature of data frames: their records (lines) and fields (columns) are usually named with character strings, but integer indices are still usable to address elements. If we want to get only a subset of all columns in our resulting data frame, we use the subset function:

> subset1 = subset(iris, Sepal.Length==5.1, select=Sepal.Width)
> subset1
   Sepal.Width
1          3.5
18         3.5
20         3.8
22         3.7
24         3.3
40         3.4
45         3.8
47         3.8
99         2.5

Above, the second argument of subset is the boolean selection operator, while the third argument (select) trims columns out of the output.

> subset1 = subset(iris,abs(Sepal.Length-5)<=0.1,select=c('Sepal.Length','Sepal.Width'))
> subset1
    Sepal.Length Sepal.Width
1            5.1         3.5
2            4.9         3.0
5            5.0         3.6
8            5.0         3.4
10           4.9         3.1
18           5.1         3.5
20           5.1         3.8
22           5.1         3.7
24           5.1         3.3
26           5.0         3.0
27           5.0         3.4
35           4.9         3.1
36           5.0         3.2
38           4.9         3.6
40           5.1         3.4
41           5.0         3.5
44           5.0         3.5
45           5.1         3.8
47           5.1         3.8
50           5.0         3.3
58           4.9         2.4
61           5.0         2.0
94           5.0         2.3
99           5.1         2.5
107          4.9         2.5

Here we use the subset command to extract the specimens with a sepal length less than 0.1 away from 5.0, and on these we select only the two columns relative to sepal dimensions.

Combining rows and columns

In R, two very generic commands exist that allow you to combine data row-wise or column-wise. They are called respectively rbind and cbind. With cbind you may for instance add an additional column to a dataframe. With rbind you could for instance concatenate the records of two similar arrays of data into a single one. Let’s start with a simple example, say you want to add as a first column of the dataframe iris random id numbers for every specimen. We will use the random number generation function runif, with no guarantee not to get duplicates, though. On the first line below, runif(150, min=1, max=100000 generates a series of 150 numbers from the uniform probability distribution on the interval \([1, 100000]\). Of these random numbers, floor gives the numerical floor (largest integer value no greather than). We then combine this new column with the existing iris data frame:

> my_column = floor(runif(150, min=1, max=100000))
> cbind(my_column, iris) -> iris2
> str(iris2)
'data.frame':   150 obs. of  6 variables:
 $ my_column   : num  48173 91295 60679 45880 93772 ...
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Pay attention that the additional column is automatically named by the name of the source. You can always alter the column names later on, accessing these by the names() function, which can be use to output or to input the names:

> names(iris2) -> oldnames
> oldnames[1] <- "ID"
> names(iris2) <- oldnames

You can add additional rows to an existing dataframe with rbind, for instance adding the column averages to any dataframe made of numeric values:

> av1 = mean(iris2[,1]);av2=mean(iris2[,2])
> av3=mean(iris2[,3]);av4=mean(iris2[,4]);av5=mean(iris2[,5])
> av6=names(which.max(table(iris2[,6]))) # # to get the most common category in this factor
> rbind(iris2,c(av1,av2,av3,av4,av5,av6)) -> iris3
> tail(iris3, n=2)
                  ID     Sepal.Length      Sepal.Width Petal.Length
150            73373              5.9                3          5.1
151 53344.2866666667 5.84333333333333 3.05733333333333        3.758
         Petal.Width   Species
150              1.8 virginica
151 1.19933333333333    setosa

So it seems that we are fine: the line containing the average values of all the columns has been added at the bottom of the dataframe. But let’s look at the structure of our new iris3 variable:

> str(iris3)
'data.frame':   151 obs. of  6 variables:
 $ ID          : chr  "48173" "91295" "60679" "45880" ...
 $ Sepal.Length: chr  "5.1" "4.9" "4.7" "4.6" ...
 $ Sepal.Width : chr  "3.5" "3" "3.2" "3.1" ...
 $ Petal.Length: chr  "1.4" "1.4" "1.3" "1.5" ...
 $ Petal.Width : chr  "0.2" "0.2" "0.2" "0.2" ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

What happened here? Furtively, all the numeric variables have been converted to string (values of mode “character”). This is very interesting to explain here, as it draws from the differencies between data types in R.

First of all, we have to be aware that vectors (built for instance through the use of the c command) cannot contain mixed data types. Whenever one wants to build a vector from data of different modes, R automatically coerce some of the data to the most generic mode (usually the “character” mode as lots of basic objects accept conversion to strings by means of the as.character() function). For instance, here:

> mode(av1)
[1] "numeric"
> mode(av6)
[1] "character"
> av1
[1] 53344.29
> c(av1,av2,av3,av4,av5,av6) -> last_row
> last_row
[1] "53344.2866666667" "5.84333333333333" "3.05733333333333" "3.758"           
[5] "1.19933333333333" "setosa"          
> mode(last_row)
[1] "character"

last_row, as an R vector, has only one single mode: here it is made of strings. Very logically then, when we used rbind to concatenate this row to the iris2 table, the contagion spread to the dataframe because we were adding character values to columns formerly of numeric mode.

Data input/output in R: from/to files

Most frequently your work session with R will start with importing some data from a datafile that you have somewhere on your computer. We are going to see how to build an R object from a file, filling the object (it will be that data structure R calls a table) with the data present in that file.

Current working directory

Whenever R is running, it has an internal variable indicating where is its current working directory. If you invoked R from a shell, it will usually be the current working directory of the shell at the moment when you launched R. Otherwise it can be your home directory, or any other directory where R has read and write permissions. You may check what is R’s current working directory with the function getwd():

> getwd()      # obviously you will get a different value here...
[1] "/home/jbde/Trainings/R/r-and-tidyverse-intro-ag-research"
> mydir = getwd()

The current working directory may be set to some other value through the setwd command:

> setwd("~")      # setting it to my home directory...
> getwd()
[1] "/home/jbde"
> setwd(mydir)    # setting the current directory back to original value

Whenever you specify some filename during your R session, R will search for that file from the current working directory, unless of course you give it in the form of an absolute path (i.e. starting at the root of the filesystem, with an initial slash /). So if you just give the name of a file (and not an absolute or relative path), R will expect to find the file you’re talking about in its current working directory. This is also where it will write files.

The generic read.table function

Although there are several specific R functions to read data from files, these are nothing but static refinements of the generic read.table function. We will thus focus on this latter only. It is meant to read data from a file, according to some specifications (whether there is a header line, what is the field separator, etc) and writes the content into an R object called a data frame.

As a first example of input file, let us take the file fev_dataset.txt provided along with this tutorial. This file contains 655 lines (one header line and 654 records). We reproduce below its first 6 lines (for those who want to know, the shell command to get this is head -n 6 fev_dataset.txt):

Age FEV Ht Gender Smoke
9 1.7080 57.0 0 0
8 1.7240 67.5 0 0
7 1.7200 54.5 0 0
9 1.5580 53.0 1 0
9 1.8950 57.0 1 0

Each line comprises 5 fields:

  1. age in years,
  2. FEV (Forced Expiratory Volume) expressed in liters,
  3. height in inches,
  4. gender (boys are coded 1, girls 0),
  5. whether the kid is exposed to smoking in their family environment (also a binary 0/1 variable).

This dataset comes from a study originally performed by Rosner and colleagues to find out whether constant exposure to an environment where at least one of the parents is smoking had an impact on the respiratory capacity of young boys and girls. Or course, the age and height of a child are expected to play a significant role in the determination of the pulmonary capacity of that same child, hence the presence of these variables among the data collected.

Let us first try and use the simplest form of the function read.table:

> read.table("fev_dataset.txt") -> dat1

Okay, we have read the contents of the file and have stored it into a new data frame we called dat1. As R keeps silent after this, let us check that this worked as expected. As two exploratory content-checking commands, you may use head (to first items of a data structure) or str (to get an informative output concerning the internal structure of the object you query):

> is.data.frame(dat1)
[1] TRUE
> head(dat1)
   V1     V2   V3     V4    V5
1 Age    FEV   Ht Gender Smoke
2   9 1.7080 57.0      0     0
3   8 1.7240 67.5      0     0
4   7 1.7200 54.5      0     0
5   9 1.5580 53.0      1     0
6   9 1.8950 57.0      1     0
> str(dat1)
'data.frame':   655 obs. of  5 variables:
 $ V1: chr  "Age" "9" "8" "7" ...
 $ V2: chr  "FEV" "1.7080" "1.7240" "1.7200" ...
 $ V3: chr  "Ht" "57.0" "67.5" "54.5" ...
 $ V4: chr  "Gender" "0" "0" "0" ...
 $ V5: chr  "Smoke" "0" "0" "0" ...

Now, that’s bad: R has seen a file of 655 records, not noticing that the first line is actually not a data record, but the header containing the names of the different fields (or columns). Instead R used an automatic naming convention, dubbing the columns V1, V2, etc. Notice it automatically adds row-naming numbers, also.

The output of the str function is even more informative: we see that as R had to fit into the same column some textual data (coming from the header) and some numerical data as well (the measurement datapoints), it interpreted everything to be categorical variables represented as strings. To correct this, we have to tell R that the first line of the file should not be considered as a data record but as a header, thus giving the names of the columns. This is done through the header option that we set to TRUE:

> read.table("fev_dataset.txt", header=TRUE) -> dat1
> head(dat1)
  Age   FEV   Ht Gender Smoke
1   9 1.708 57.0      0     0
2   8 1.724 67.5      0     0
3   7 1.720 54.5      0     0
4   9 1.558 53.0      1     0
5   9 1.895 57.0      1     0
6   8 2.336 61.0      0     0
> str(dat1)
'data.frame':   654 obs. of  5 variables:
 $ Age   : int  9 8 7 9 9 8 6 6 8 9 ...
 $ FEV   : num  1.71 1.72 1.72 1.56 1.9 ...
 $ Ht    : num  57 67.5 54.5 53 57 61 58 56 58.5 60 ...
 $ Gender: int  0 0 0 1 1 0 0 0 0 0 ...
 $ Smoke : int  0 0 0 0 0 0 0 0 0 0 ...

All is fine now, as R correctly detected that the records contain numerical values. If we really want to insist that the Gender and Smoke variable be treated as categorical, we can force the use of specific classes (or “types”) for the different columns (the backslash character at the end of a commandline is just to indicate that the said line continues uninterrupted onto the next: the line break was necessary only to fit the line into this pagewidth):

> read.table("fev_dataset.txt", header=TRUE, colClasses=c("integer","numeric","numeric","factor","factor")) -> dat1
> str(dat1)
'data.frame':   654 obs. of  5 variables:
 $ Age   : int  9 8 7 9 9 8 6 6 8 9 ...
 $ FEV   : num  1.71 1.72 1.72 1.56 1.9 ...
 $ Ht    : num  57 67.5 54.5 53 57 61 58 56 58.5 60 ...
 $ Gender: Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 1 ...
 $ Smoke : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Field separator

It is important to consider that within your original file, some characters mark the boundary between one field and the following, within one record. These can be for instance tabulation characters (usually represented \t), any number of consecutive white spaces, commas (,), semicolons (;) or colons (:). By default read.table’s field separator is any “white space”, which means one or more spaces, tabs, newlines or carriage returns. In case this is not suitable for your input file, you have to specify the correct field separator by means of the sep option: for a regular CSV file (Comma-Separated Values), you would specify sep="," as an option to read.table.

Decimal separator

Also important to consider is the decimal separator in use for numerical data in your original datafile. Most people around the world use the dot (.) as the decimal point, but you are not without knowing that the convention in French-speaking countries is to have the comma (,) as the decimal point. Such an encoding would be correctly handled by R during importation when the user specifies the decimal point through the sep option, e.g.~sep=",".

Unknown or missing values

Sometimes your file contains missing values, for instance if you have two consecutive commas (,,) in a CSV file using the comma as a field separator. R understand this and uses a special token called NA to encode missing values. Sometimes the people who prepared the datafile also use another predefined token to indicate missing values, for instance “Na”, “na” or “n/a”. In order for R to translate these strings into its proper NA values, you have to indicate the possible NA-coding strings by means of the na.strings option. For instance:

> read.table("myfile.txt", header=TRUE, sep=",", na.strings=c("na", "n/a"))

In this case any value completely missing (two consecutive field separators), as well as any string “na” or “n/a” as a field value, will be understood by R as a missing value and translated into the unique appropriate R token NA.

Data output: writing into a file

Once you want to write some tabulated data (e.g.~a vector, matrix or data frame) from R right into a file in your filesystem, you may use the write.table function. For instance, to write the content of an R data frame called mydata in your current working session into a file that you wawnt to call myfile and write into your current working directory, you will use:

> write.table(mydata, file="myfile")

In this case, the file will be written using the default field separator for write.table, which is a single space character. You may explore the different formatting options by calling the help of write.table. Most of them are the equivalent of what you can find for the reciprocal read.table function.

Caution when importing Excel format into R!

While many people use an advanced spreadsheet software (e.g. LibreOffice Calc, Apache OpenOffice Calc or Microsoft Excel) to prepare their data, unfortunately it is not possible to read data into R directly from any of the .odt, .xls or .xlsx formats. This is mainly because these formats are meant to accomodate complex data in several sheets, which possibly cannot fit into one single R object. When you want to import some data from such a format, you first have to convert it into some CSV format (notice that the field separator can be any other character of your choice, not necessarily a comma), using your favourite spreadsheet software. You will then be able to read the simpler CSV format into R with the read.table function.

For instance, after having transformed the file tutorial_data.xlsx into a comma-separated file that you call tutorial_data_commasep.csv, you would load in into a mydat data frame in your R session using the following command (the backslach character at the end of a commandline is just to indicate that the said line continues uninterrupted onto the next: the line breaks were necessary only to fit the lines into this pagewidth):

> mydat=read.table("tutorial_data_commasep.csv",
+                    header=T,
+                    sep=",",
+                    colClasses=c("integer","factor","integer","numeric","factor",
+                                 "numeric","factor","integer","factor","factor"))
> knitr::knit_exit()