One of the most common operations in data wrangling is joining two sets of data by a common variable. Probably the most popular method for this is the obscure
vlookup function in Excel (simply because it’s the most widely used software for data manipulaton). The closest alternative in base R is
merge and the dplyr package contains the join function family which is even more convenient. But there is a more simple and direct solution when only one variable needs to be added to a dataset.
Suppose we have two data frames,
df.b, and we wish to get the values of
df.a so that each
id gets their “own” value. There are various methods for joining, each yielding a different result. But in my experience left join on a single variable is the most frequent and this is what we will explore here.
Left join with ‘merge’
merge, we specify the arguments of the function, run it and then through some magic a new dataset with requested columns is created. Note that we don’t need to specify by which variable we wish to merge if variable names are the same.
Left join with ‘match’
A more hands-on approach involves first figuring out which rows in
df.a correspond to which rows in
df.b according to
match function allows us to do just that.
Now that we have the row numbers, we can simply return
df.b where the matches occur. A useful side effect is that we can define the name for the new variable while matching.
Now let’s compare the results.
We can see that the result is essentially the same. What
merge has done is rearranged the rows which is something we might not want to happen. So I encourage the use of
match when possible since it allows the addition of a single column without running a function over entire data sets.
This post was originally written in rmarkdown and the code can be found here.