Propensity score matching with R
In impact evaluations it is often necessary to measure the effect of a treatment, be it a support measure, training program or some other action. To make unbiased conclusions about the true effect, it is then necessary to take into account what would have happened without a treatment, i.e. evaluate the counterfactual situation. Subtracting the outcome of a control group from the result of treated observations allows us to do just that but it is crucial to control for selection bias. Impact evaluations are usually carried out retrospectively and many treatments permit random assignment of individuals into treatment and control groups. Thus, experimental research design (most prominently randomized controlled trials) can not be employed in such cases and we need to adopt non- or quasi-experimental methods. In the latter case, a pseudo control group is created so that it accurately reflects the counterfactual situation. Perhaps the most popular method for this purpose is propensity score matching (PSM), although the advantages of multivariate distance matching (MDM) (King and Nielsen 20161) must be noted. In case of PSM the propensity of each observation to be treated (i.e. propensity score) is assessed and then observations with close values of this assessment are compared. This allows us to compare observations where treatment can be the only source of differences in outcome, thus mitigating selection bias.
In order to carry out PSM analysis we need a rather specific data set. Unfortunately, data that contains non-experimental treatment and control group as well as confounding variables predicting treatment and the outcome is rather hard to come by. Thus, we’ll be using data set provided by Lalonde (19862) that includes the results of an employment and training program. This data set has previously been used by Sekhon (20113) to demonstrate some functions related to PSM. The data is available for treatment and control group observations as separate tables. So we will load these tables and bind them into a data frame using only columns that are present in both tables. The last variables contain earnings of individuals in respective years.
by to get the mean of different variables separately for treated and other individuals.
It is evident that our initial control group is unbalanced since the distribution of the characteristics of individuals is very different. The mean change in earnings for treated individuals between 1975 and 1978 is 2910. For others it is 1196. When adopting a “naive” approach we would assume that treatment increased mean earnings by the difference between the two values, i.e. 1714. As treated individuals might have had a higher increase even without treatment, we would thus overestimate the effect of treatment by considering all of the individuals. It is thus necessary to create a balanced pseudo control group in order to simulate the counterfactual situation.
Building propensity score model
Here we would like to assess the effect of treatment on earnings in 1978. We assume that age, years in education, race, marital status, academic degree and earnings in 1975 have an effect on both. So these need to be considered as confounding variables and will be used to calculate the propensity score.
Since treatment status is indicated as a binary (dummy) variable, we will fit a logistic regression model on the data. Then we perform a stepwise regression that seeks to minimize the value of Akaike Information Criterion (AIC).
According to AIC values, all of the variables improve the quality of the model.
Next, we examine coefficients.
The coefficient of education is not significantly different from zero, so we updated our model not to include it.
The dependent variable in logistic regression is binary and we can adjust the predicted values to describe binary outcome, too. Thus, we can draw a classification table that illustrates how well the model classifies the observations, taking the mean of fitted values as the threshold for predicting treatment.
The classification table indicates that the proportion of correctly classified treated observations is 1.
We also calculate a few other statistics that help us evaluate logistic regression model.
According to McFadden (19784), the value of McFadden’s pseudo-R between 0.2 and 0.4 indicates an excellent model fit. The value we obtained is even higher. We can also reject the null hypothesis that all the coefficients in the model equal zero.
This model performs well enough on our data to use the fitted values of the dependent variable as propensity score values for each individual.
Creating pseudo control group
Next we create a new variable and assign it propensity scores from the model. We can again use by to examine summary statistics of propensity scores separately for treated and other individuals.
The comparison of two groups can also be plotted and adding some jitter gives a better overview. A better plot type for this kind of comparison might be a density plot, but plotting points is more intuitive.
As anticipated, the probability of treated observations to be treated is substantially higher since a great majority of non-treated individuals have a propensity score close to 0. This is an indication that the groups are unbalanced and it is necessary to create a pseudo control group for a more fair comparison.
There are various matching techniques that can be implemented to select appropriate individuals for control group, e.g. nearest-neighbour, caliper, radius or kernel matching. Here we will use the first approach due to its simplicity. First we create a function that (1) takes the propensity score of each treated individual, (2) finds the row of a control group individual that has the most similar propensity score value and (3) returns a logical vector indicating control group members. Thus, the matching is done with replacements, i.e. one control group individual may be matched to more than one treated individual.
Now we apply this function to our data frame.
When plotting the propensity scores of the pseudo control group we obtained by implementing our matching algorithm, it is evident that the distribution of scores is much more similar now.
In order to achieve a meaningful comparison of treatment and control group it is necessary that the groups are similar in terms of all the factors that being treated or not depends on. In other words, the groups need to be balanced. In our case, we need to have the groups balanced according to the independent (confounding) variables that we included in the model for predicting treatment. One simple way to check the balance is by comparing the mean values of treatment, control and pseudo control groups. Sometimes hypothesis testing is used to confirm the lack of difference in distribution or some particular statistic of values but it is usually not recommended.
We can see that when deciding by the mean values for several variables the balance has become better. At this point we might want to go back and adjust the propensity model or matching algorithm but here we are just providing an example. It is important to note that the confounding effect may be more significant for some variables so it is not always necessary to have the groups balanced across all of them.
Estimating the impact of treatment using control group
Here we apply a differences-in-differences approach which means that we don’t only compare the differences between groups but also time. Adding another dimension by comparing change rather than merely the difference of values after treatment allows us to use more information and thus gives a more accurate estimation of the effect of treatment. Let us calculate the change of mean earnings for all three groups.
As previously explained, if we considered the original control group to reflect the counterfactual situation, we could claim that earnings increased by 1714 due to treatment. That seems like an overestimation. When we instead use the mean change in earnings of pseudo control group, we conclude that earnings changed by -157. So treatment actually had a slight negative effect on earnings according to means of changes. These results might be better illustrated by t-tests.
The p-values of the two t-tests also indicate that previously we could have attributed a substantial increase in mean earnings to treatment. After implementing PSM we were unable to demonstrate the mean effect of treatment to be statistically significant and thus we conclude that treatment had no impact on earnings.
King, G., & Nielsen, R. (2016). Why propensity score should not be used for matching. Mimeo, (617), 32. Retrieved from http://gking.harvard.edu/publications/why-propensity-scores-should-not-be-used-formatching ↩
Lalonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review, 76(4), 604–620. https://doi.org/10.1017/CBO9781107415324.004 ↩
Sekhon, J. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R. Journal of Statistical Software, 42(7), 1–52. https://doi.org/10.1.1.335.7044 ↩
McFadden, D. L. (1978). Quantitative Methods for Analyzing Travel Behavior of Individuals: Some Recent Developments. In D. A. Hensher & P. R. Stopher (Eds.), Behavioural Travel Modelling (pp. 279–318). London: Croom Helm. ↩