How to access a particular part of your dataset, extract or merge data?
It is easy in R to access only a part of an object. We might for example be interested only in the 5th value of a vector, or would like to perform operations on a specific column of a dataframe. We might also want to extract a couple of columns or rows from a matrix, or simply add a column to an existing data frame. Let’s now see what is available in R to realize those actions.
R objects are easily subsettable. You can access any individual value or set of values in a R object, as long as you know how to get there. The commands might vary slightly depending on the type of object you are working with, but the general idea stays the same: you can access part of an object by its name (if it has one, like a column name) or its index.
The easiest way to select a subset in most R object is to indicate the index (or indices) of the value(s) we want to access between square brackets “[ ]“. For example, if we want the 5th value in a vector:
color[5]
[1] "salmon2"
Or the value contained at the intersection of the 3rd row and the 2nd column of a data frame or matrix:
shirts[3,2]
[1] 4
Heck, we can even select a full column by leaving the row dimension empty:
shirts[,1]
[1] salmon orange pink powderblue salmon2 peach puff
Levels: orange peach puff pink powderblue salmon salmon2
We can not only extract those elements, but also modify them if we want:
color[5]
[1] "salmon2"
color[5]<-"yellow"
color[5]
[1] "yellow"
You know the name of the element you’re interested in in a data frame or a list? Awesome, let’s call it directly with a dollar sign: (I could have made a crappy money joke here, but I have better taste than that!)
shirts$hue
[1] salmon orange pink powderblue salmon2 peach puff
Levels: orange peach puff pink powderblue salmon salmon2
data$students
,, 1
[,1][,2]
[1,] 0 0
[2,] 0 1
[3,] 1 0
[4,] 1 1
,, 2
[,1][,2]
[1,] 0 0
[2,] 0 0
[3,] 1 1
[4,] 1 1
,, 3
[,1][,2]
[1,] 0 0
[2,] 1 0
[3,] 1 1
[4,] 1 1
In the particular case of lists, to do the same feature, you will need to use double bracket instead “[[ ]]”
names(data)
[1] "shirts" "students" "somerawdata"
data[[1]]
color awesomeness
1 salmon 9
2 orange 7
3 pink 4
4 powderblue 3
5 salmon2 NA
6 peach puff 6
It is possible to use simple brackets, but in this case, the return element is a sub-list, and not the elementary object itself.
data[1]
$shirts
color awesomeness
1 salmon 9
2 orange 7
3 pink 4
4 powderblue 3
5 salmon2 NA
6 peach puff 6
class(data[[1]])
[1] "data.frame"
class(data[1])
[1] "list"
Of course, it is possible to select several elements in an object at the same time, simply by concatenating the indices (or names) we want.
color
[1] "salmon" "orange" "pink" "powderblue" "yellow"
[6] "peach puff"
color[c(2,3,5)]
[1] "orange" "pink" "yellow"
Or to remove the ones we don’t want by “negating” them:
color[-c(2,3,5)]
[1] "salmon" "powderblue" "peach puff"
You don’t want the 5th row in the data frame “shirts”? Just say so, and R will take care of it:
shirts[-5,]
hue perfectness
1 salmon 9
2 orange 7
3 pink 4
4 powderblue 3
6 peach puff 6
R also allows you to use logical tests, and subsequently use the results of those tests to select data matching those requirements. We will see in a next chapter (chapter 5) logical tests in more detail, but for the moment, let’s see how we can select values higher than a threshold. Want to know where all the positive values are in vector “vec”? No problem:
vec>0
[1] TRUE TRUE TRUE FALSE FALSE
We see here that the first 3 values answer our test in a positive way, while the 2 last don’t. We can now use that to select them:
vec[vec>0]
[1] 2 3 10
or:
test=vec>0
vec[test]
[1] 2 3 10
Want to know which Ryan Reynolds’ shirts are an awesomeness higher than 6? Let’s get them:
test_shirts <- shirts$perfectness>6
We just figured out which elements in the column “awesomeness” are higher than 6.
Let’s select the rows in our data frame that have values in the “awesomeness” column that match this test:
best_shirts=shirts[test_shirts,]
best_shirts
hue perfectness
1 salmon 9
2 orange 7
NA NA
Oh… What is this last line? Oh, right, we had a missing value in our original dataset, leading to a corresponding missing value in our test, leading to a missing value in our match, leading to NAs in our final output… Can you remove this last row? Here’s in one way to do it:
best_shirts[-nrow(best_shirts),]
hue perfectness
1 salmon 9
2 orange 7
Exercise 2.2– Create a vector ‘SupVecA’ only containing the values of ‘vecA’ higher than the ‘vecA’ ‘s median – Multiply values in ‘vecB’ by the value in the 5th row of the 3rd colum of ‘ex.data’
Answer 2.2
[collapse]
|
The perfect match
We’re interrupting your regular program for a newsflash: stop looking around on internet! R will help you find the perfect (or less perfect if you’re so inclined) match for you! Do you want to find a particular value (or set of) in a vector or data frame? Do you simply want to know if a vector shares some of your values? Look no further, R is here!
The function “match()” will find first occurrence of each value in a vector of interest in a matching target vector.
vec.of.interest=c(2,1,3,7)
target.vec=c(4,4,5,2,1,4,2,3)
match(vec.of.interest , target.vec )
[1] 4 5 8 NA
The results show us that the first element of our vector (2) appears for the 1st time at the 4th position in the target vector. It also let us know that the 4th element of our vector (7) doesn’t appear in the target vector, by returning NA.
A simpler version of this matching process can limit itself to indicating if the values in the vector of interest appear at any point in the target vector. Simply ask if the values of the vector of interest are “%in%” the target vector (the percentage signs are here to let R know that “in” is an instruction).
vec.of.interest %in% target.vec
[1] TRUE TRUE TRUE FALSE
Here, our 3 first elements are present in the target vector (TRUE), and the last one is not (FALSE).
This, in turn, becomes practical if you want to subset a dataset based on matches in another dataset. Let’s say we have environmental records (e.g. precipitation, chemical concentrations) for some locations, and we want to extract only the sites for which we have species abundance records. We have recorded species abundance in 5 locations:
speciesdata=data.frame(location=c(2,3,6,7,9),
abundance=c(40,0,10,23,9) )
We have environmental information for 10 locations:
environment=data.frame(location=c(11,3,7,34,6,8,9,2,4,15),
prec=c(500,100,230,456,438,765,0,94,333,790),
conc=c(2.5,3.75,12.25,5,6,9.46,3.4,0,6,2))
speciesdata
location abundance
1 2 40
2 3 0
3 6 10
4 7 23
5 9 9
environment
location prec conc
1 11 500 2.50
2 3 100 3.75
3 7 230 12.25
4 34 456 5.00
5 6 438 6.00
6 8 765 9.46
7 9 0 3.40
8 2 94 0.00
9 4 333 6.00
10 15 790 2.00
Now, let’s assume we just want the environmental data for the locations for which we have species records, we simply have to match the locations!
matching.locations= match(speciesdata$location,environment$location)
matching.locations
[1] 8 2 5 3 7
Location #2 is the 8th element of the vector ‘environment$location’, and therefore the 8th row of the object ‘environment’ will contain all the information we want for this site! Why not extract that information then?!
myenvironmentdata=environment[matching.locations,]
We just selected the rows of ‘environment’ that were matching our locations, and reordered them to follow the pattern in our original object!
myenvironmentdata
location prec conc
8 2 94 0.00
2 3 100 3.75
5 6 438 6.00
3 7 230 12.25
7 9 0 3.40
We’ll see in one of the following section how to merge those 2 datasets. Before that though, a couple more functions allowing us to find specific elements in a vector or data frame.
Want to know where the maximum or the minimum values are in a vector? Ask R with “which.max()” or “which.min()” respectively.
Example: which site had the lowest precipitation in our dataset?
which.min(environment$prec)
[1] 7
Easy to use this to extract all the information for that location now that we know which row it is in the corresponding data frame!
environment[which.min(environment$prec),]
location prec conc
7 9 0 3.4
Same thing, but for the maximum precipitations?
which.max(environment$prec)
[1] 10
environment[which.max(environment$prec),]
location prec conc
10 15 790 2
As a matter of fact, you can ask for any condition, using the function “which()“, and specifying the condition we want for between parentheses. Remember how we recorded the length of several bear teeth over the years. Each column corresponded to a different year, and each row to a different bear.
bear
col1 col2 col3 col4
[1,] 2.00 2.17 2.57 3.40
[2,] 1.50 1.98 2.14 2.93
[3,] 2.75 3.02 4.44 4.46
Want to know which bear during which year had teeth longer than 3? Just ask for it! When dealing with an array or data frame, don’t forget to ask R to return the array indices of the match by setting the argument ‘arr.ind’ to TRUE : “arr.ind=TRUE”
which(bear>3, arr.ind=T)
row col
[1,] 3 2
[2,] 3 3
[3,] 1 4
[4,] 3 4
We have 4 cases where a bear had teeth longer than 3:
Bear # 3 during years 2, 3 and 4
Bear # 1 during year 4
And we can even access the actual teeth size now we have the corresponding indices:
bear[which(bear>3, arr.ind=T)]
[1] 3.02 4.44 3.40 4.46
Merging datasets and/or adding rows or columns is no more complicated than removing them.
You can bind 2 matrices/dataframes together. If you want to put one over the other, they will have to have the same number of columns. If you want them “side by side”, they must feature the same number of rows.
matrix_of_1=matrix(1,nrow=3,ncol=4)
matrix_of_2=matrix(2,nrow=3,ncol=5)
matrix_of_3=matrix(3,nrow=7,ncol=4)
Binding by rows (i.e. putting on matrix over the other):
rbind(matrix_of_1,matrix_of_3)
[,1][,2][,3][,4]
[1,] 1 1 1 1
[2,] 1 1 1 1
[3,] 1 1 1 1
[4,] 3 3 3 3
[5,] 3 3 3 3
[6,] 3 3 3 3
[7,] 3 3 3 3
[8,] 3 3 3 3
[9,] 3 3 3 3
[10,] 3 3 3 3
Binding by column (i.e. putting matrices next to each other):
cbind(matrix_of_1,matrix_of_2)
[,1][,2][,3][,4][,5][,6][,7][,8][,9]
[1,] 1 1 1 1 2 2 2 2 2
[2,] 1 1 1 1 2 2 2 2 2
[3,] 1 1 1 1 2 2 2 2 2
You can also add a vector as a column or a row to an existing matrix or data frame the same way:
vector_of_4=c(4,4,4,4)
vector_of_5=c(5,5,5)
rbind(matrix_of_1,vector_of_4)
[,1][,2][,3][,4]
1 1 1 1
1 1 1 1
1 1 1 1
vector_of_4 4 4 4 4
cbind(matrix_of_1,vector_of_5)
vector_of_5
[1,] 1 1 1 1 5
[2,] 1 1 1 1 5
[3,] 1 1 1 1 5
Or you can simply add a column to a data frame (and specify a column name at the same time):
shirts
hue perfectness
1 salmon 9
2 orange 7
3 pink 4
4 powderblue 3
5 salmon2 NA
6 peach puff 6
shirts$matching_short_color=c("beige","beige","beige","beige","beige","brown")
shirts
hue perfectness matching_short_color
1 salmon 9 beige
2 orange 7 beige
3 pink 4 beige
4 powderblue 3 beige
5 salmon2 NA beige
6 peach puff 6 brown
Exercise 2.3– Add a column named ‘forlater’ to ‘ex.data’ that is the result of 3 times the first column plus the second column
|
Efficiently applying functions to R objects
Be lazy… Go ahead, be lazy! R is perfect for that. Yes, be lazy, but be smart about it! Nobody wants to write 12 lines of code if you can get away with one. Let me provide some context.
You want to compute the standard deviation of all the rows or columns of a matrix/array? You want to figure out the maximum value of each one of the elements of a list? Or you want to easily add or average data depending on a factor without having to cycle through each factor’s level? Go ahead, do all of that, and be lazy/smart about it.
R got your back.
Basically, what I’m trying to say is that if you have to repeat the same operation on an R object, you just have to apply() yourself. What’s that? This looks like a function? Why, yes, it does. Using the set of functions apply(), lapply() and tapply(), you’ll be able to do exactly this. More specifically, apply() lets you ‘apply’ a function repeatedly over the different dimensions of an array/matrix, and lapply() will do this for each element of a list. On the other hand, tapply() will apply a function based on the levels of a specified factor (or set of). For each, you just have to specify which object to work on, which operation/function you want to apply, and if necessary the way to do it (i.e. over which dimension or factor).
Let’s take our previous examples back, and see how it works.
You want to compute the standard deviation of all the columns of a matrix or array?
apply(bear, MARGIN=2,FUN=sd) # applying the function sd() (via the FUN argument) over all the columns (i.e. the 2nd dimension, via the MARGIN argument) of the 'bear' matrix
0.6291529 0.5538050 1.2228246 0.7837304
You want to figure out the maximum value of each one of the elements of a list?
disney.ratings=list(pixar=c(8,9,6,10,5),
starwars=c(5,6,1,2),
marvel=c(7,9))
lapply(disney.ratings, FUN=max) # applying the function max() (via the FUN argument) to each element of the list 'disney.ratings'
$pixar
[1] 10
$starwars
[1] 6
$marvel
[1] 9
Or you want to easily add or average data depending on a factor without having to cycle through each factor’s level?
laziness.score=c(3,5,1,2,5,2,4,3)
day=factor(c("mon","fri","mon","mon","fri","sun", "sun", "sun"))
tapply(laziness.score, INDEX=day, FUN=mean) # applying on the vector 'laziness.score' the function sum() (via the FUN argument), based on the different levels of the factor 'day' (specified via the INDEX argument)
fri mon sun
5 2 3
Note that in this last case, the INDEX argument has to be a vector of the same length as the vector you’re working on (or a list of vectors, each of the same length as the input one).
Mastering these functions can seem intimidating at first, but prove to be useful in the long term. And what? You’re now a brave R coder. A simple string of characters can’t scare you anymore! Go, apply yourself, and conquer the world!
INTRODUCTION
The core of what we're doing is R is dealing with data. Let's see how to play with it for a bit.
A QUICK LOOK AT OUR DATA
Basic plots for data exploration.
BASIC STATISTICS
Mean, standard deviation, median? We got it all, and it's right here!
CONCLUSION
The return of the sequel…