library(tidyverse)
7 Multiple Linear Regression
Before you start working on this chapter, you need to do the following. If you need a help for each step, see Section 3.1.
Launch RStudio.
Load
POL232.RData
into your current R session.Prepare an R Script to save all your work in this chapter. I suggest you name it “
POL232_Week#_YourLastName.R
” or “POL232_Tutorial#_YourLastName.R
” in which#
is the number of the current week or tutorial session.You also need to load
tidyverse
package into your current R session (Section 1.4.2).
I suggest you actually write the R functions used below in your R script instead of copying and pasting them. See Section 3.1.5 for why.
I also suggest you sufficiently annotate your R script (i.e. leaving notes after the
#
sign) so that you can use your R script as your reference when you work on tutorial exercises or data analysis paper assignments. In other words, this R script will be your notes for this chapter.
7.1 Overview
While we have started learning statistical inference in lectures, in this chapter, we will still consider a linear regression model as the method for descriptive statistics only. We will cover statistical inference for linear regression in later chapters.
This chapter will introduce i) how to fit a multiple linear regression model with two or more independent variables on the right-hand side of the linear regression equation, and ii) dummification of a categorical independent variable in a linear regression model.
Note that we did not cover the second topic — dummification of a categorical independent variable — in lectures. I want you to learn this topic from this chapter. If you need an additional reference for this topic, you may read Kellstedt and Whitten’s Chapter 11.2.2 (3rd Edition, pp.251-254) or Chapter 10.2.2 (2nd Edition, pp.225-227). Recall that the 2nd edition of this book is available online from the UofT library.
7.2 Linear Regression for Multivariate Analysis
7.2.1 Prepare Variables
As in Chapter 6, let’s create percep_economy_cps_n
, which is a numeric
version of percep_economy_cps
, and union_d
, a logical
version of union
.
<- ces2019 |>
ces2019 mutate(percep_economy_cps = fct_relevel(percep_economy_cps,
"(2) Worse", "(3) About the same", "(1) Better") ) |>
mutate(percep_economy_cps = fct_recode(percep_economy_cps,
"Worse" = "(2) Worse",
"Same" = "(3) About the same",
"Better" = "(1) Better") ) |>
mutate(percep_economy_cps_n = as.numeric(percep_economy_cps))
<- ces2019 |> # Change the name of categories of union to TRUE and FALSE.
ces2019 mutate( union_d = fct_recode(union, "TRUE" = "(1) Yes", "FALSE" = "(2) No") ) |>
mutate( union_d = as.logical(union_d) ) # Then, apply the as.logical() function.
7.2.2 Multiple Linear Regression
Recall that we used the lm()
function to fit a simple linear regression model. For example, the code below fits a simple linear regression model of trudeau_therm_cps
on percep_economy_cps_n
.
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)
Coefficients:
(Intercept) percep_economy_cps_n
8.128 18.982
To fit a multiple linear regression model, we continue to use the lm()
function, but now we need to specify multiple independent variables on the right hand side of the equation in the formula
argument. More specifically, suppose that the model we try to fit is given by the following formula in which we have three independent variables, \(X_1\), \(X_2\), and \(X_3\).
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + u \tag{7.1}\]
For this model, we specify the formula
argument in the lm
function as formula = Y ~ X1 + X2 + X3
. That is, ignoring the coefficients, we connect all the independent variables with a +
sign on the right-hand side of tilde, ~
.
Let’s fit a multiple linear regression model of trudeau_therm_cps
on three independent variables, percep_economy_cps_n
, ideology
, and union_d
, where percep_economy_cps_n
corresponds to \(X_1\), ideology
corresponds to \(X_2\), and union_d
corresponds to \(X_3\) in Equation 7.1.
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + union_d, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology +
union_d, data = ces2019)
Coefficients:
(Intercept) percep_economy_cps_n ideology
24.7569 17.1926 -2.7193
union_dTRUE
0.8752
As before, we can see the list of coefficients computed at the bottom of the output. The number below (Intercept)
is the intercept of the linear regression model. This is an estimate of \(\beta_0\) in the linear regression model in Equation 7.1. The number below percep_economy_cps_n
is the coefficient of the percep_economy_cps_n
variable, which corresponds to \(\beta_1\) in the linear regression model in Equation 7.1. The number below ideology
is the coefficient of the ideology
variable, which is an estimate of \(\beta_2\) in Equation 7.1. Finally, the number below union_dTRUE
is the coefficient of the union_d
variable, which is an estimate of \(\beta_3\) in Equation 7.1.
If we have more independent variables than what is shown in this example, we can simply add them on the right hand side of the tilde ~
in the formula
argument connecting each of them by the +
sign.
In the above example, the coefficient on percep_economy_cps_n
is different between simple and multiple linear regression models. This is because the coefficient on percep_economy_cps_n
in the multiple linear regression model reflects the relationship between trudeau_therm_cps
and percep_economy_cps_n
controlling for the other two variables included in the model, ideology
and union_d
. That is, this coefficient indicates how much truedeau_therm_cps
changes, on average, as percep_economy_cps_n
increases by one unit, holding ideology
and union_d
constant.
In general, the coefficient of the same variable may be different between simple and multiple linear regression models, if one or some of the control variables included in the multiple linear regression model are confounding variables for the relationship between the dependent and the main independent variable of our interest. In the present example, either ideology
or union_d
or both of them are confounding variables to the relationship between trudeau_therm_cps
and percep_economy_cps_n
. However, the change in the coefficient of percep_economy_cps_n
between the simple linear regression model, 18.98, and the multiple linear regression model, 17.19, seems to be small, indicating that the confounding effect of these variables is relatively small in this sample.1
7.3 Dummifying Ordinal Categorical Independent Variable
In the analysis above, we transformed percep_economy_cps
to a numeric
variable before we used it as an independent variable in the lm()
function. Even without this transformation, R allows the use of a factor
as an independent variable in the lm()
function, but the result is different from the one we saw before for percep_economy_cps_n
in Section 6.2.1 and Section 7.2.
In the code below, I use percep_economy_cps
instead of percep_economy_cps_n
as the only independent variable for a linear regression of trudeau_therm_cps
.
lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)
Coefficients:
(Intercept) percep_economy_cpsSame percep_economy_cpsBetter
25.13 23.93 36.76
In the output above, you see that percep_economy_cps
is divided into two variables percep_economy_cpsSame
and percep_economy_cpsBetter
, and the coefficients are available for each one of them. What R did here is dummifying this categorical variable.
Dummifying a categorical variable means that each category of the categorical variable (except for one, which is called a reference category) is transformed into a dummy variable for this category. In the current example, R created two dummy variables percep_economy_cpsSame
and percep_economy_cpsBetter
, which are dummy variables for percep_economy_cps
= Same
and percep_economy_cps
= Better
, created from percep_economy_cps
which has three categories (Worse
is used as a reference category).
Table 7.1 below shows how the values and categories of these three variables correspond to each other.
percep_economy_cps | Worse | Same | Better |
---|---|---|---|
percep_economy_cpsSame | 0 | 1 | 0 |
percep_economy_cpsBetter | 0 | 0 | 1 |
As shown in this table, percep_economy_cpsSame
is a dummy variable whose value equals 1
if the value of percep_economy_cps
is Same
and 0
if the value of percep_economy_cps
is Worse
or Better
.
Similarly, percep_economy_cpsBetter
is a dummy variable whose value equals 1
if the value of percep_economy_cps
is Better
and 0
if the value of percep_economy_cps
is Worse
or Same
.
In the process of dummifying a categorical variable, a dummy variable is not created for one of the categories, which is used as a reference category — you will see why this is necessary later. In the example here, Worse
is used as a reference category, and therefore, a dummy variable is not created for this category.
When dummifying a categorical variable, which is stored as a factor
, R uses the first level
(category) of the variable as a reference category. Recall that we can check the levels
of a factor
by the levels()
function.
levels(ces2019$percep_economy_cps)
[1] "Worse" "Same" "Better"
As the first level
of percep_economy_cps
is Worse
, this level
is used as a reference category, and dummy variables are created only for Same
and Better
, each of which is named percep_economy_cpsSame
and percep_economy_cpsBetter
using the name of the variable and the name of each level
. Because one category is used as a reference category, if a categorical variable has L
categories (L
here is an arbitrary number), then dummy variables are created for L-1
categories. In the present example, there are three categories (L = 3
) for percep_economy_cps
, therefore, only two dummy variables (L-1 = 3-1 = 2
) are created for this variable.
7.3.1 Fitted Values from the Linear Regression with Dummified Variables from Categorical Independent Variable
To understand a linear regression model with the dummified variables created from a categorical independent variable further, let’s derive the fitted values of trudeau_therm_cps
for each category of percep_economy_cps
. Recall that a fitted value (or a predicted value) is the value of \(\widehat{Y}\) derived from a linear regression equation (e.g., \(\widehat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2\)) by substituting specific values of independent variables (e.g., specifying \(X_1\) and \(X_2\) at certain values).
Below I reproduced the outcome of the linear regression of trudeau_therm_cps
on percep_economy_cps
.
lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)
Coefficients:
(Intercept) percep_economy_cpsSame percep_economy_cpsBetter
25.13 23.93 36.76
Based on this result, the fitted values of trudeau_therm_cps
can be computed by the following formula.
\[ \begin{aligned} {\color{blue}25.13}~+~{\color{red}23.93}~\times &percep\_economy\_cpsSame\\ &+~{\color{violet}36.76}~\times percep\_economy\_cpsBetter \end{aligned} \tag{7.2}\]
If percep_economy_cps
= Worse
, then both percep_economy_cpsSame
and percep_economy_cpsBetter
must be zero (see Table 7.1). Therefore, substituting \(0\) for both variables in Equation 7.2, the fitted value of trudeau_therm_cps
if percep_economy_cps
= Worse
is given by \({\color{blue}25.13}~+~{\color{red}23.93}\times0+{\color{violet}36.76}\times0 = {\color{blue}25.13}\). Because both independent variables are zero, the fitted value for trudeau_therm_cps
when percep_economy_cps
= Worse
is equivalent to the intercept of the linear regression model in Equation 7.2. This is the reason why no dummy variable is created for a reference category. This category doesn’t need its own dummy variable, since the intercept of a linear regression model represents the fitted value for this reference category.
If percep_economy_cps
= Same
, then percep_economy_cpsSame = 1
and percep_economy_cpsBetter = 0
(see Table 7.1). Therefore, substituting these values in Equation 7.2, the fitted value of trudeau_therm_cps
is given by \({\color{blue}25.13}~+~{\color{red}23.93}\times1+{\color{violet}36.76}\times0 = {\color{blue}25.13}~+~{\color{red}23.93} = {\color{red}49.06}\). This result implies that the coefficient on percep_economy_cpsSame
, \({\color{red}23.93}\), indicates the difference in the fitted values of trudeau_therm_cps
when percep_economy_cps
is Same
and when percep_economy_cps
is its reference category, Worse
(i.e., \({\color{red}49.06} - {\color{blue}25.13} = {\color{red}23.93}\)).
Similarly, if percep_economy_cps
= Better
, then percep_economy_cpsSame = 0
and percep_economy_cpsBetter = 1
(see Table 7.1). Therefore, substituting these values in Equation 7.2, the fitted value of trudeau_therm_cps
is given by \({\color{blue}25.13}~+~{\color{red}23.93}\times0+{\color{violet}36.76}\times1 = {\color{blue}25.13}~+~{\color{violet}36.76} = {\color{violet}61.89}\). Again, the coefficient on percep_economy_cpsBetter
, \({\color{violet}36.76}\), indicates the difference in the fitted values between when percep_economy_cps
is Better
and when percep_economy_cps
is its reference category, Worse
(i.e., \({\color{violet}61.89} - {\color{blue}25.13} = {\color{violet}36.76}\)).
7.3.2 How to Interpret the Linear Regression Coefficients on Dummified Variables from Categorical Independent Variable
More generally, we can say that the coefficient on a dummy variable created for a certain category of a categorical independent variable (say, \(X = a\)) indicates the difference in the fitted values (\(\widehat{Y}\)) for this category (\(X = a\)) and the reference category (\(X =\) reference category).
In other words, the coefficient on a dummy variable for a certain category of a categorical independent variable (\(X = a\)) tells us how the value of \(Y\) for the observations with \(X = a\) differs, on average, from the value of \(Y\) for those with \(X =\) reference category.
For example, in Section 7.3.1, we saw that the coefficient on percep_economy_cpsSame
(\({\color{red}23.93}\)) indicates the difference in the fitted values of trudeau_therm_cps
when percep_economy_cps
is Same
and percep_economy_cps
is the reference category, Worse
(\({\color{red}49.06} - {\color{blue}25.13} = {\color{red}23.93}\)). This coefficient, \({\color{red}23.93}\), can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Same
is greater, on average, by \({\color{red}23.93}\) points from those whose perception of economy is the reference category, Worse
.
We also saw that the coefficient on percep_economy_cpsBetter
(\({\color{violet}36.76}\)) indicates the difference in the fitted values between when percep_economy_cps
is Better
and when percep_economy_cps
is the reference category, Worse
(\({\color{violet}61.89} - {\color{blue}25.13} = {\color{violet}36.76}\)). This coefficient, \({\color{violet}36.76}\), can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Better
is greater, on average, by \({\color{violet}36.76}\) points from those whose perception of economy is the reference category, Worse
.
As we saw in these examples, the coefficient on the dummy variable for a certain category of a categorical independent variable (\(X = a\)) should be understood as the difference in the average value of \(Y\) for this category (\(X = a\)) and the average value of \(Y\) for the reference category (\(X =\) reference category). This is the reason why a reference category is named that way.
As the above discussion has made it clear, the coefficient on the dummy variable for a certain category of a categorical independent variable (\(X = a\)) tells us the comparison between the chosen category (\(X = a\)) and the reference category (\(X =\) reference category).
If we want to compare the chosen category (\(X = a\)) with another category which is not the reference category (say, \(X = b\)), then we need to compute the difference from their coefficients.
For example, suppose that we want to compare the average Trudeau thermometer between the respondents whose perception of economy is Better
and those whose is Same
. The comparison is made by taking the difference of the fitted values of trudeau_therm_cps
when percep_economy_cps
is Better
and when percep_economy_cps
is Same
. As we have seen, the fitted value of trudeau_therm_cps
when percep_economy_cps
is Better
is \({\color{blue}25.13}~+~ {\color{violet}36.76} = {\color{violet}61.89}\) , and the fitted value of trudeau_therm_cps
when percep_economy_cps
is Same
is \({\color{blue}25.13}~+~{\color{red}23.93} = {\color{red}49.06}\). Therefore, the difference of these fitted values is \({\color{violet}61.89} - {\color{red}49.06} = {\color{brown}12.83}\).
However, this is equivalent to the difference between the coefficient of percep_economy_cpsBetter
, \({\color{violet}36.76}\), and the coefficient of percep_economy_cpsSame
, \({\color{red}23.93}\), (\({\color{violet}36.76} - {\color{red}23.93} = {\color{brown}12.83}\)). Therefore, we just needed to compute the difference between the coefficient of percep_economy_cpsBetter
and the coefficient of percep_economy_cpsSame
in the first place.
This resut can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Better
is greater, on average, by \({\color{brown}12.83}\) points from those whose perception of economy is Same
.
7.3.3 With Additional Independent Variables
In Section 7.3.2, we included only the dummy variables for a categorical independent variable on the right-hand side of the linear regression model. We can also include the additional independent variables on the right hand side. For example, below I added ideology
and union_d
on the right-hand side of the linear regression model of trudeau_therm_cps
on percep_economy_cps
.
lm(formula = trudeau_therm_cps ~ percep_economy_cps + ideology + union_d, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps + ideology +
union_d, data = ces2019)
Coefficients:
(Intercept) percep_economy_cpsSame percep_economy_cpsBetter
39.3833 22.0875 33.8127
ideology union_dTRUE
-2.6315 0.7463
As before, in this linear regression analysis, percep_economy_cps
is dummified into two dummy variables, percep_economy_cpsSame
and percep_economy_cpsBetter
. The interpretation of the coefficients of these dummy variables is the same as before except that the other variables ideology
and union_d
are now controlled for.
The coefficient of percep_economy_cpsSame
is \({\color{red}22.09}\). This can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Same
is greater, on average, by \({\color{red}22.09}\) points from those whose perception of economy is the reference category, Worse
, controlling for the respondents’ political ideology and union membership.
The coefficient of percep_economy_cpsBetter
is \({\color{violet}33.81}\). This coefficient, \({\color{violet}33.81}\), can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Better
is greater, on average, by \({\color{violet}33.81}\) points from those whose perception of economy is the reference category, Worse
, holding their political ideology and union membership constant.
If we want to compare the respondents whose perception of economy is Better
and those whose is Same
, we need to compute the difference between the coefficient of percep_economy_cpsBetter
and the coefficient of percep_economy_cpsSame
(\({\color{violet}33.81} - {\color{red}22.09} = {\color{brown}11.72}\)). As the difference is \({\color{brown}11.72}\), it can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Better
is greater, on average, by \({\color{brown}11.72}\) points from those whose perception of economy is Same
, with the respondents’ political ideology and union status held constant.
7.3.4 Difference from Numerical Version
Comparing the linear regression results of the numeric
version and the dummified version of percep_economy_cps
, we can see the difference in these two specifications.
For your quick reference, I have reproduced the outputs of the lm()
function for a simple linear regression of trudeau_therm_cps
with both versions of percep_economy_cps
below.
(1) With the numeric
version of percep_economy_cps
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)
Coefficients:
(Intercept) percep_economy_cps_n
8.128 18.982
(2) With the dummified version of percep_economy_cps
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)
Coefficients:
(Intercept) percep_economy_cpsSame percep_economy_cpsBetter
25.13 23.93 36.76
In the result of a simple linear regression model with the numeric
version, percep_economy_cps_n
, the average difference in the Trudeau thermometer is \(18.98\) points both between Better
and Same
and between Same
and Worse
. On the other hand, in the linear regression model with the dummified version, the average difference in the Trudeau thermometer is \({\color{violet}36.76} - {\color{red}23.93} = {\color{brown}12.83}\) points between Better
and Same
, while it is \({\color{red}23.93}\) points between Same
and Worse
.
Similarly, I have reproduced the outputs of the lm()
function for a multiple linear regression of trudeau_therm_cps
with both versions of percep_economy_cps
below.
(1) With the numeric
version of percep_economy_cps
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology +
union_d, data = ces2019)
Coefficients:
(Intercept) percep_economy_cps_n ideology
24.7569 17.1926 -2.7193
union_dTRUE
0.8752
(2) With the dummified version of percep_economy_cps
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps + ideology +
union_d, data = ces2019)
Coefficients:
(Intercept) percep_economy_cpsSame percep_economy_cpsBetter
39.3833 22.0875 33.8127
ideology union_dTRUE
-2.6315 0.7463
In the result of a multiple linear regression model with the numeric
version, percep_economy_cps_n
, the average difference in the Trudeau thermometer is \(17.19\) points both between Better
and Same
and between Same
and Worse
. On the other hand, in the result with the dummified version, the average difference in the Trudeau thermometer is \({\color{violet}33.81} - {\color{red}22.09} = {\color{brown}11.72}\) points between Better
and Same
, while it is \({\color{red}22.09}\) points between Same
and Worse
.
When we used the numeric
version, percep_economy_cps_n
, the difference in Trudeau thermometer is the same whether we consider the difference between Better
and Same
or that between Same
and Worse
. On the other hand, when we used the dummified version, the difference in Trudeau thermometer is different when we consider the difference between Better
and Same
and when we examine the difference between Same
and Worse
.
In general, the average difference in the dependent variable between one category of ordinal categorical variable and its adjacent category is the same for all categories, if we use a numeric
version of an ordinal categorical variable, but it is different for all categories of this variable, if we use the dummified version of this variable. Hence, dummification may be considered as a more flexible modeling strategy for an ordinal categorical variable than the use of a numeric
version of this variable. This is a modeling choice we need to make when we include an ordinal categorical variable as an independent variable in our linear regression model.
7.4 Dummifying Nominal Categorical Independent Variable: Questions
For an ordinal categorical variable, we can consider dummifying it and transforming it into a numeric
variable as alternatives, but for a nominal categorical variable, dummifying may be the only appropriate option. As the interpretation of the results remain the same as in Section 7.3, instead of repeating the explanation on how to interpret a model, I want you to work on the following questions on dummifying a nominal categorical independent variable. Answers to these questions will be made available in Section 7.5 after all tutorial sections have covered this chapter.
Let’s use the province of residence of the respondents (province
) in the ces2019
data frame as an example.
Question 1
What will be a reference category when we dummify province
on the right hand side of a linear regression model using the lm()
function? Use the levels()
function to find out.
Question 2
Fit a linear regression model of trudeau_therm_cps
on province
using the lm()
function. Interpret the intercept and the coefficient on provinceON
.
Question 3
Fit a linear regression model of trudeau_therm_cps
on province
, percep_economy_cps_n
, ideology
, and union_d
using the lm()
function. Interpret the coefficients on provinceON
, provinceSK
, and percep_economy_cps_n
, respectively.
Question 4
Dummifying all categories except one is not the only way to create a dummy variable from a nominal categorical variable. An alternative is to create a dummy variable only for a single category or a subset of categories of a nominal categorical variable. In this question, we consider a single dummy variable for a subset of categories of province
.
Suppose we theorized that the respondents’ feelings about the incumbent Liberal prime minister are different between the Prairies (Alberta, Manitoba, and Saskatchewan) and the rest of the provinces. In this case, we want to dummify province
such that prairies
= 1 if the respondents lived in these provinces and prairies
= 0 otherwise. We can create such a variable by the ifelse()
function introduced in Section 6.3.4. Read that section again to review the basics of the ifelse()
function. For the current purpose, you may specify the condition in the ifelse()
function as follows.
<- mutate(ces2019,
ces2019 # Create a new variable "prairies" using the ifelse() function.
prairies = ifelse(province %in% c("AB","MB","SK"), "TRUE", "FALSE") )
# This ifelse() function assigns "TRUE" to observations if
# "province" is "AB", "MB" or "SK" but assigns "FALSE" otherwise.
# Then, we transform prairies into a logical variable.
<- mutate(ces2019, prairies = as.logical(prairies) ) ces2019
In the above code, the condition for the ifelse()
function is specified as province %in% c("AB","MB","SK")
, which means that “province
is either AB
or MB
or SK
.” If you want to specify a condition that a certain variable, say var
, equals either A
or B
or C
or D
, you can use %in%
to specify the condition as var %in% c("A","B","C","D")
.
Let’s check if prairies
was created as we wanted.
table(ces2019$prairies, ces2019$province, useNA = "always")
AB BC MB NB NL NS ON PE QC SK <NA>
FALSE 0 804 0 201 201 200 807 201 802 0 0
TRUE 282 0 261 0 0 0 0 0 0 262 0
<NA> 0 0 0 0 0 0 0 0 0 0 0
Fit a linear regression model of trudeau_therm_cps
on prairies
, percep_economy_cps_n
, ideology
, and union_d
using the lm()
function.
Interpret the coefficient on prairiesTRUE
.
Question 5
We may also use multiple dummy variables corresponding to different subsets of a nominal categorical variable. Suppose we theorized that, in addition to the Prairies, the respondents’ feelings about the incumbent Liberal prime minister may also be different in the Maritimes (New Brunswick, Nova Scotia, and Prince Edward Island) from the rest of the provinces.
Create a dummy variable for the Maritimes, named maritimes
, using the ifelse()
function. Then, fit a linear regression model of trudeau_therm_cps
on prairies
, maritimes
, percep_economy_cps_n
, ideology
, and union_d
using the lm()
function.
Interpret the coefficients on prairiesTRUE
and maritimesTRUE
, respectively.
7.5 Dummifying Nominal Categorical Independent Variable: Answers
Answers to the questions posed in Section 7.4 will be made available here after all tutorial sections have covered this chapter. Schedule to come back to review these answers.
Let’s use the province of residence of the respondents (province
) in the ces2019
data frame as an example.
Question 1
What will be a reference category when we dummify province
on the right hand side of a linear regression model using the lm()
function? Use the levels()
function to find out.
Below is a list of levels of this variable.
levels(ces2019$province)
[1] "AB" "BC" "MB" "NB" "NL" "NS" "ON" "PE" "QC" "SK"
Because the first level is AB
, Alberta will be used as a reference category.
Question 2
Fit a linear regression model of trudeau_therm_cps
on province
using the lm()
function. Interpret the intercept and the coefficient on provinceON
.
We fit a linear regression model of trudeau_therm_cps
on province
as below.
lm(formula = trudeau_therm_cps ~ province, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ province, data = ces2019)
Coefficients:
(Intercept) provinceBC provinceMB provinceNB provinceNL provinceNS
24.910 17.566 12.487 21.535 22.019 23.955
provinceON provincePE provinceQC provinceSK
24.083 24.600 24.225 3.426
The intercept is the fitted value for the reference category of this variable (province
). Since the reference category is Alberta, the intercept is the fitted value for Alberta. The intercept in the above linear regression result is \(24.91\). This may be interpreted as follows.
“Feeling thermometer about Trudeau is, on average, 24.91 points for those respondents who lived in Alberta.”
The coefficient on a dummy variable for one categocy of a nominal categorical variable represents the difference in the average value of the dependent variable between the observations for which the variable is this category and those for which the variable is the reference category. Because the coefficient of provinceON
is \(24.08\), and the reference category is Alberta, this coefficient may be interpreted as follows.
“Feeling thermometer about Trudeau is, on average, 24.08 points higher for those respondents who lived in Ontario than those who lived in Alberta.”
Question 3
Fit a linear regression model of trudeau_therm_cps
on province
, percep_economy_cps_n
, ideology
, and union_d
using the lm()
function. Interpret the coefficients on provinceON
, provinceSK
, and percep_economy_cps_n
, respectively.
We fit a linear regression model of trudeau_therm_cps
on province
, percep_economy_cps_n
, ideology
, and union_d
as below.
lm(formula = trudeau_therm_cps ~ province + percep_economy_cps_n
+ ideology + union_d, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ province + percep_economy_cps_n +
ideology + union_d, data = ces2019)
Coefficients:
(Intercept) provinceBC provinceMB
15.2848 11.7557 10.8040
provinceNB provinceNL provinceNS
16.9362 18.4670 12.1679
provinceON provincePE provinceQC
15.3387 15.8607 12.3670
provinceSK percep_economy_cps_n ideology
3.4424 15.8163 -2.6628
union_dTRUE
0.5959
Because there are additional control variables, the coefficient on a dummy variable for one category of a nominal categorical variable represents the difference in the average value of the dependent variable between the observations for which the variable is this category and those for which the variable is the reference category, controlling for other control variables.
The coefficient of provinceON
can be interpreted as follows.
“Feeling thermometer about Trudeau is, on average, 15.34 points higher for those respondents who lived in Ontario than those who lived in Alberta, controlling for the perception of economy, political ideology, and union membership.”
The coefficients on provinceSK
can be interpreted as follows.
“Feeling thermometer about Trudeau is, on average, 3.44 points higher for those respondents who lived in Saskatchewan than those who lived in Alberta, holding the perception of economy, political ideology, and union membership constant.”
The coefficients on percep_economy_cps_n
can be interpreted as below.
“Feeling thermometer about Trudeau is, on average, 15.82 points higher for those respondents whose perception of economy is better
(the same
) than those whose perception of economy is the same
(worse
), controlling for the provinces of residence, political ideology, and union membership.”
Question 4
Dummifying all categories except one is not the only way to create a dummy variable from a nominal categorical variable. An alternative is to create a dummy variable only for a single category or a subset of categories of a nominal categorical variable. In this question, we consider a single dummy variable for a subset of categories of province
.
Suppose we theorized that the respondents’ feelings about the incumbent Liberal prime minister are different between the Prairies (Alberta, Manitoba, and Saskatchewan) and the rest of the provinces. In this case, we want to dummify province
such that prairies
= 1 if the respondents lived in these provinces and prairies
= 0 otherwise. We can create such a variable by the ifelse()
function introduced in Section 6.3.4. Read that section again to review the basics of the ifelse()
function. For the current purpose, you may specify the condition in the ifelse()
function as follows.
<- mutate(ces2019,
ces2019 # Create a new variable "prairies" using the ifelse() function.
prairies = ifelse(province %in% c("AB","MB","SK"), "TRUE", "FALSE") )
# This ifelse() function assigns "TRUE" to observations if
# "province" is "AB", "MB" or "SK" but assigns "FALSE" otherwise.
# Then, we transform prairies into a logical variable.
<- mutate(ces2019, prairies = as.logical(prairies) ) ces2019
In the above code, the condition for the ifelse()
function is specified as province %in% c("AB","MB","SK")
, which means that “province
is either AB
or MB
or SK
.” If you want to specify a condition that a certain variable, say var
, equals either A
or B
or C
or D
, you can use %in%
to specify the condition as var %in% c("A","B","C","D")
.
Let’s check if prairies
was created as we wanted.
table(ces2019$prairies, ces2019$province, useNA = "always")
AB BC MB NB NL NS ON PE QC SK <NA>
FALSE 0 804 0 201 201 200 807 201 802 0 0
TRUE 282 0 261 0 0 0 0 0 0 262 0
<NA> 0 0 0 0 0 0 0 0 0 0 0
Fit a linear regression model of trudeau_therm_cps
on prairies
, percep_economy_cps_n
, ideology
, and union_d
using the lm()
function.
Interpret the coefficient on prairiesTRUE
.
We fit a linear regression model of trudeau_therm_cps
on prairies, percep_economy_cps_n
, ideology
, and union_d
as below.
lm(formula = trudeau_therm_cps ~ prairies + percep_economy_cps_n
+ ideology + union_d, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ prairies + percep_economy_cps_n +
ideology + union_d, data = ces2019)
Coefficients:
(Intercept) prairiesTRUE percep_economy_cps_n
28.6463 -9.0945 16.0341
ideology union_dTRUE
-2.6583 0.6047
Given this dummy variable for the Prairies, the reference category is all provinces other than the Prairies. Therefore, the coefficient on prairiesTRUE
indicates the average difference in trudeau_therm_cps
between the respondents who lived in the Prairies and those who lived in other provinces, controlling for other variables.
The coefficient on prairiesTRUE
may be interpreted as follows.
“Feeling thermometer about Trudeau is, on average, 9.09 points lower for those respondents who lived in the Prairies than those who lived in other provinces, controlling for the perception of economy, political ideology, and union membership.”
Question 5
We may also use multiple dummy variables corresponding to different subsets of a nominal categorical variable. Suppose we theorized that, in addition to the Prairies, the respondents’ feelings about the incumbent Liberal prime minister may also be different in the Maritimes (New Brunswick, Nova Scotia, and Prince Edward Island) from the rest of the provinces.
Create a dummy variable for the Maritimes, named maritimes
, using the ifelse()
function. Then, fit a linear regression model of trudeau_therm_cps
on prairies
, maritimes
, percep_economy_cps_n
, ideology
, and union_d
using the lm()
function.
Interpret the coefficients on prairiesTRUE
and maritimesTRUE
, respectively.
We create a new dummy variable maritimes
as below.
<- mutate(ces2019,
ces2019 # Create a new variable "maritimes" using the ifelse() function.
maritimes = ifelse(province %in% c("NB","NS","PE"), "TRUE", "FALSE") )
# This ifelse() function assigns "TRUE" to observations if
# "province" is "NB", "NS" or "PE" but assigns "FALSE" otherwise.
<- mutate(ces2019, # Then, we transform prairies into a logical variable.
ces2019 prairies = as.logical(prairies) )
Let’s check if maritimes
was created as we wanted.
table(ces2019$maritimes, ces2019$province, useNA = "always")
AB BC MB NB NL NS ON PE QC SK <NA>
FALSE 282 804 261 0 201 0 807 0 802 262 0
TRUE 0 0 0 201 0 200 0 201 0 0 0
<NA> 0 0 0 0 0 0 0 0 0 0 0
lm(formula = trudeau_therm_cps ~ prairies + maritimes + percep_economy_cps_n
+ ideology + union_d, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ prairies + maritimes + percep_economy_cps_n +
ideology + union_d, data = ces2019)
Coefficients:
(Intercept) prairiesTRUE maritimesTRUE
28.3114 -8.8188 1.4265
percep_economy_cps_n ideology union_dTRUE
16.0503 -2.6527 0.6189
Because the above model includes dummy variables for the Prairies and the Maritimes, the reference category is the provinces other than those in the Prairies and the Maritimes. Therefore, the coefficients on prairiesTRUE
and maritimesTRUE
indicate, respectively, the average difference in trudeau_therm_cps
between the respondents who lived in the Prairies and Maritimes and those who lived in the provinces other than the Prairies and the Maritimes, controlling for other variables.
The coefficients on prairiesTRUE
and maritimesTRUE
may be interpreted as follows.
“Feeling thermometer about Trudeau is, on average, 8.82 points lower for those respondents who lived in the Prairies than those who lived in the provinces other than the Prairies and the Maritimes, controlling for the perception of economy, political ideology, and union membership.”
“Feeling thermometer about Trudeau is, on average, 1.43 points higher for those respondents who lived in the Maritimes than those who lived in the provinces other than the Prairies and the Maritimes, holding the perception of economy, political ideology, and union membership constant.”
A caveat here is that while it is not clear in the analyses presented, the number of observations included is different between these simple and multiple linear regressions. We will come back to this issue later.↩︎