Introduction to Data Types and Objects in R

Purpose

This is a set of notes I put together when I was introduced to R in a financial data analysis course last year. It highlights some of the things that I found to be distinctive about R, especially compared to more common general-purpose languages like C++, Java, or Python. I’m sharing these in the hopes that it will help other folks with programming experience get their heads around R quickly. It is not intended to be an introduction to programming, nor a comprehensive reference to R. It may not even be correct. 🙂 Caveat lector; de gustibus non disputandum est; quidquid latine dictum sit, altum videtur.

The Official R Documentation

The main site for R is http://www.r-project.org/. The single most useful thing you’ll find here is the R reference index. This is the giant list of all the functions and methods in the standard packages. You’ll be using this a lot. Download it now. 🙂 You may also find the other documentation on their manuals page handy.

Vectors

R has a dozen or so basic data types, but the ones that you’ll be dealing with most often are vectors. Roughly, there are two different kinds: atomic vectors and lists (generic vectors).

Atomic vectors can be thought of as arrays of a single kind of value (integers, floating-point values, complex numbers, characters, etc.). A single number (e.g., 3) is treated as a vector with only one element; i.e., there is no more “fundamental” data type that atomic vectors are made out of.

Creating a single-element vector is trivial: 3, 3.0, “foo”, etc. are all vectors. You could assign the vector containing only the integer 3 to the variable foo as follows: foo<-3.

The easiest way to directly create a multi-element vector is using the c() (concatenate) function. For example, to create an integer vector containing the numbers 1, 2, and 3 and assign it to a variable foo: foo<-c(1,2,3). You can also assign “tags” or “names” to the elements of the vector: foo<-c(first=1, second=2, third=3)

Lists may contain atomic vectors and other lists. You can create lists directly using the list() function, à la: foo<-list(1, 2.0, "three"). Again, you can name elements of the list; e.g.: foo<-list(first=1, second=2.0, third="three").

Indexing

Selecting elements from a vector is known as “indexing”. There are two ways of indexing both atomic vectors and lists, plus a way that only works on lists.

foo[index]

This style of indexing expects index to be an integer vector, and returns the elements of foo contained in that vector. This is easiest explained by example. Assume that foo is an integer vector containing 4, 5, and 6, in that order. Then:

foo[1] is 4
foo[c(1,2)] is a vector containing 4 and 5 (in that order)
foo[c(3,3,2,1)] is a vector containing 6, 6, 5, and 4 (in that order)

foo[[index]]

In this case, index must be a single-element integer vector (call it n). This will return the nth element of foo. If foo is an atomic vector, the differences between this and the “[]” style of indexing are small, and the chances that you would need to use this are minimal. However, you may need it when working with lists. Again, an example: say we create a list named foo as follows: foo<-list(1,2,"three"). foo[1] will not return 1, as you would probably expect. It will return a list with a single element, which contains the vector 1. (If this seems pathological, consider foo[c(1,3)]. This can’t return anything but a list: 1 is an integer, and “three” is a string.) If you want to actually “extract” an element from a list, you need to use the “[[]]” style of indexing. In our example, foo[[1]]] will return 1.

foo$name

If you have named elements of a list, instead of using the “[[]]” style of indexing mentioned previously, you can access them by appending $name instead. If you create a list where the members have some kind of meaning, it is a good idea to give the elements names. For instance, if you had a list representing a rectangle, with the first element indicating the length, and the second the width, it might be a good idea to create it as follows: rect<-list(length=3, width=2). This would allow you to get the width of the rectangle using rect$length instead of rect[[1]].

Why lists are important: reverse engineering

Virtually every object you’ll run across in R is really just a list. For example, when you do a linear regression using lm(), the object you get back that contains the results is a list with elements named “coefficients”, which is a vector containing the fitted coefficients; “residuals”, a vector of the residuals, etc. Knowing how to manipulate lists means you can manipulate any object in R. Generally you shouldn’t rely on names of elements in objects to access the data in them — you should use an appropriate method — but if you have an object and you don’t know what’s in it or what you can do with it, being able to look at its components may be very helpful.

In particular, the names() method is very useful for this. Let’s again use an example of a linear regression. I know that lm() is the function I want to use to do a linear regression, but I may not know what to do with the result object (in this example named foo) I get back from it:

y<-c(1,2,4)
x<-c(1,2,3)
foo<-lm(y~x)

If I just enter foo at the prompt, R will print out the most pertinent information contained in the object (the regression formula and the fitted coefficients). summary(foo) will give me more (residuals, etc.), but it turns out there’s more potentially useful information in the object than even that shows. names(foo) prints out the names of all the elements in foo:

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

In addition to what we’ve already seen (coefficients, residuals, etc.), there are things like the fitted values that result from the regression, and foo$fitted.values will print them out. However, as I noted earlier, it’s a better idea to use a method to get this information, if one exists. A good guess for the appropriate method name is just that of the element. In this case, that works: fitted.values(foo) also prints out the fitted values.

Attributes

(Almost) every data type in R can have attributes associated with it as well. To get the list of attributes for an object foo, you can use attributes(foo). You can get and set attributes on an object directly using the attr() function, à la:

attr(foo, "bar") # Returns the "bar" attribute of object foo
attr(foo, "bar")<-"baz" # Sets the "bar" attribute of foo to "baz"

However, much like accessing elements of an object, if a method exists to access an attribute, you should use it instead, and there’s a good chance that the appropriate method name is the same as the attribute name. In practice, there are only a few common attributes that you will care about:

names

This one we’ve already seen. This is a vector containing the names of the elements in an object, and should be accessed using names().

dim

This contains the dimensions of arrays and matrices, and is accessed using dim().

dimnames

This attribute allows you to name rows and columns in arrays and matrices. For example, creating a matrix foo as follows:

foo<-matrix(c(149.53, 147.49, 29.50, 29.16), nrow=2)
dimnames(foo)<-list(dates = c("1 Apr 2008", "2 Apr 2008"), ticker=c("AAPL", "MSFT"))

will give the following result if we display foo:

            ticker
dates          AAPL  MSFT
  1 Apr 2008 149.53 29.50
  2 Apr 2008 147.49 29.16

This attribute should be accessed using the dimnames() method.

class

This attribute tells you what class (or classes) an object belongs to; roughly, “what the object is”. For example, a matrix will have class “matrix”; a linear regression result will have the class “lm”. As you’ve probably guessed, this attribute is accessed using the class() method.

Objects in R

Like other languages you’re probably more familiar with (C++, Java, C#, Python, etc.), R has object-oriented features. Objects have data elements and are members of classes, you use methods to operate on them, etc. However, there’s one major difference between R and these other languages in terms of how objects work: methods are not part of the class definitions. Methods in R look just like normal functions; the difference is that they look at the class of the objects that they’re operating on, and may behave differently based on them.

For example, there is a method called summary(). As the name implies, it gives you a summary of the information in the object you pass it. What constitutes a meaningful summary of the information contained in, for instance, a regression result isn’t going to be the same as a summary of information contained in a numeric vector. What summary() does is check the class attribute of the object that’s handed to it and looks to see if there’s a method for summarizing objects of that class. If there is, it calls it; otherwise, it has some default behaviors for handling simple types, like atomic vectors and lists. In the case of a regression result, it will call summary.lm() to print out the summary.

Also, it’s important not to get confused by the type of objects as opposed to their class. Most objects that you’ll deal with have the type “list”. A regression result is a list, a matrix is a list, and a time series is a list. However, they all have different classes, which tells you and the methods you call on these objects how to interpret the elements of those lists. E.g., a list that has the class “lm” contains regression results, will have elements named “coefficients”, etc., and can have the residuals() method called on it.

Two Common Data Types

Matrices

Matrices can be created directly using matrix(). The first argument you pass to matrix() should be the data you want in the matrix, in column-major order. You’ll almost certainly want to specify the number of rows or the number of columns in the matrix as well (otherwise you’ll end up with one column). For example, foo<-matrix(c(1,2,3,4), nrow=2, ncol=2) will give the following result when foo is printed out:

     [,1] [,2]
[1,]    1    3
[2,]    2    4

Normal arithmetic operations (+, -, *, /) on a matrix will be done element-wise. If you want to do matrix multiplication, you’ll need to use the %*% operator. There are a variety of matrix methods available to you (often with the same names as those in Matlab): det(), eigen(), chol(), qr(), etc.

You can access elements, rows, and columns of the matrix using []:

foo[r,c] will give you the element at row r and column c
foo[r,] will return the rth row
foo[,c] will return the cth column

Data frames

Data frames are lists of data representing different kinds of information taken at the same times, places, etc. A simple example would be prices for a number of different stocks, sampled on the same days. When you read in data from an external source, it will typically be in a data frame. In particular, read.csv() will let you read in data from a file containing comma-separated values. If all the data in the data frame are the same type, you can convert it into a matrix using as.matrix().

neal groothuis

caffeinated and ready to dance