7 Multiple Linear Regression

Before you start working on this chapter, you need to do the following. If you need a help for each step, see Section 3.1.

Launch RStudio.
Load POL232.RData into your current R session.
Prepare an R Script to save all your work in this chapter. I suggest you name it “POL232_Week#_YourLastName.R” or “POL232_Tutorial#_YourLastName.R” in which # is the number of the current week or tutorial session.
You also need to load tidyverse package into your current R session (Section 1.4.2).

  library(tidyverse)

I suggest you actually write the R functions used below in your R script instead of copying and pasting them. See Section 3.1.5 for why.
I also suggest you sufficiently annotate your R script (i.e. leaving notes after the # sign) so that you can use your R script as your reference when you work on tutorial exercises or data analysis paper assignments. In other words, this R script will be your notes for this chapter.

7.1 Overview

While we have started learning statistical inference in lectures, in this chapter, we will still consider a linear regression model as the method for descriptive statistics only. We will cover statistical inference for linear regression in later chapters.

This chapter will introduce i) how to fit a multiple linear regression model with two or more independent variables on the right-hand side of the linear regression equation, and ii) dummification of a categorical independent variable in a linear regression model.

Note that we did not cover the second topic — dummification of a categorical independent variable — in lectures. I want you to learn this topic from this chapter. If you need an additional reference for this topic, you may read Kellstedt and Whitten’s Chapter 11.2.2 (3rd Edition, pp.251-254) or Chapter 10.2.2 (2nd Edition, pp.225-227). Recall that the 2nd edition of this book is available online from the UofT library.

7.2 Linear Regression for Multivariate Analysis

7.2.1 Prepare Variables

As in Chapter 6, let’s create percep_economy_cps_n, which is a numeric version of percep_economy_cps, and union_d, a logical version of union.

  ces2019 <- ces2019 |> 
      mutate(percep_economy_cps = fct_relevel(percep_economy_cps, 
              "(2) Worse", "(3) About the same", "(1) Better") ) |> 
      mutate(percep_economy_cps = fct_recode(percep_economy_cps, 
                                "Worse" = "(2) Worse", 
                                "Same" = "(3) About the same", 
                                "Better" = "(1) Better") ) |> 
      mutate(percep_economy_cps_n = as.numeric(percep_economy_cps))

  ces2019 <-  ces2019 |> # Change the name of categories of union to TRUE and FALSE.
      mutate( union_d = fct_recode(union, "TRUE" = "(1) Yes", "FALSE" = "(2) No") ) |> 
      mutate( union_d = as.logical(union_d) )  # Then, apply the as.logical() function.

7.2.2 Multiple Linear Regression

Recall that we used the lm() function to fit a simple linear regression model. For example, the code below fits a simple linear regression model of trudeau_therm_cps on percep_economy_cps_n.

  lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)


Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)

Coefficients:
         (Intercept)  percep_economy_cps_n  
               8.128                18.982

To fit a multiple linear regression model, we continue to use the lm() function, but now we need to specify multiple independent variables on the right hand side of the equation in the formula argument. More specifically, suppose that the model we try to fit is given by the following formula in which we have three independent variables, \(X_1\), \(X_2\), and \(X_3\).

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + u \tag{7.1}\]

For this model, we specify the formula argument in the lm function as formula = Y ~ X1 + X2 + X3. That is, ignoring the coefficients, we connect all the independent variables with a + sign on the right-hand side of tilde, ~.

Let’s fit a multiple linear regression model of trudeau_therm_cps on three independent variables, percep_economy_cps_n, ideology, and union_d, where percep_economy_cps_n corresponds to \(X_1\), ideology corresponds to \(X_2\), and union_d corresponds to \(X_3\) in Equation 7.1.

  lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + union_d, data = ces2019)


Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + 
    union_d, data = ces2019)

Coefficients:
         (Intercept)  percep_economy_cps_n              ideology  
             24.7569               17.1926               -2.7193  
         union_dTRUE  
              0.8752

As before, we can see the list of coefficients computed at the bottom of the output. The number below (Intercept) is the intercept of the linear regression model. This is an estimate of \(\beta_0\) in the linear regression model in Equation 7.1. The number below percep_economy_cps_n is the coefficient of the percep_economy_cps_n variable, which corresponds to \(\beta_1\) in the linear regression model in Equation 7.1. The number below ideology is the coefficient of the ideology variable, which is an estimate of \(\beta_2\) in Equation 7.1. Finally, the number below union_dTRUE is the coefficient of the union_d variable, which is an estimate of \(\beta_3\) in Equation 7.1.

If we have more independent variables than what is shown in this example, we can simply add them on the right hand side of the tilde ~ in the formula argument connecting each of them by the + sign.

In the above example, the coefficient on percep_economy_cps_n is different between simple and multiple linear regression models. This is because the coefficient on percep_economy_cps_n in the multiple linear regression model reflects the relationship between trudeau_therm_cps and percep_economy_cps_n controlling for the other two variables included in the model, ideology and union_d. That is, this coefficient indicates how much truedeau_therm_cps changes, on average, as percep_economy_cps_n increases by one unit, holding ideology and union_d constant.

In general, the coefficient of the same variable may be different between simple and multiple linear regression models, if one or some of the control variables included in the multiple linear regression model are confounding variables for the relationship between the dependent and the main independent variable of our interest. In the present example, either ideology or union_d or both of them are confounding variables to the relationship between trudeau_therm_cps and percep_economy_cps_n. However, the change in the coefficient of percep_economy_cps_n between the simple linear regression model, 18.98, and the multiple linear regression model, 17.19, seems to be small, indicating that the confounding effect of these variables is relatively small in this sample.¹

7.3 Dummifying Ordinal Categorical Independent Variable

In the analysis above, we transformed percep_economy_cps to a numeric variable before we used it as an independent variable in the lm() function. Even without this transformation, R allows the use of a factor as an independent variable in the lm() function, but the result is different from the one we saw before for percep_economy_cps_n in Section 6.2.1 and Section 7.2.

In the code below, I use percep_economy_cps instead of percep_economy_cps_n as the only independent variable for a linear regression of trudeau_therm_cps.

  lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)


Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)

Coefficients:
             (Intercept)    percep_economy_cpsSame  percep_economy_cpsBetter  
                   25.13                     23.93                     36.76

In the output above, you see that percep_economy_cps is divided into two variables percep_economy_cpsSame and percep_economy_cpsBetter, and the coefficients are available for each one of them. What R did here is dummifying this categorical variable.

Dummifying a categorical variable means that each category of the categorical variable (except for one, which is called a reference category) is transformed into a dummy variable for this category. In the current example, R created two dummy variables percep_economy_cpsSame and percep_economy_cpsBetter, which are dummy variables for percep_economy_cps = Same and percep_economy_cps = Better, created from percep_economy_cps which has three categories (Worse is used as a reference category).

Table 7.1 below shows how the values and categories of these three variables correspond to each other.

Table 7.1: Dummifying percep_economy_cps

percep_economy_cps	Worse	Same	Better
percep_economy_cpsSame	0	1	0
percep_economy_cpsBetter	0	0	1

As shown in this table, percep_economy_cpsSame is a dummy variable whose value equals 1 if the value of percep_economy_cps is Same and 0 if the value of percep_economy_cps is Worse or Better.

Similarly, percep_economy_cpsBetter is a dummy variable whose value equals 1 if the value of percep_economy_cps is Better and 0 if the value of percep_economy_cps is Worse or Same.

In the process of dummifying a categorical variable, a dummy variable is not created for one of the categories, which is used as a reference category — you will see why this is necessary later. In the example here, Worse is used as a reference category, and therefore, a dummy variable is not created for this category.

When dummifying a categorical variable, which is stored as a factor, R uses the first level (category) of the variable as a reference category. Recall that we can check the levels of a factor by the levels() function.

  levels(ces2019$percep_economy_cps)

[1] "Worse"  "Same"   "Better"

As the first level of percep_economy_cps is Worse, this level is used as a reference category, and dummy variables are created only for Same and Better, each of which is named percep_economy_cpsSame and percep_economy_cpsBetter using the name of the variable and the name of each level. Because one category is used as a reference category, if a categorical variable has L categories (L here is an arbitrary number), then dummy variables are created for L-1 categories. In the present example, there are three categories (L = 3) for percep_economy_cps, therefore, only two dummy variables (L-1 = 3-1 = 2) are created for this variable.

7.3.1 Fitted Values from the Linear Regression with Dummified Variables from Categorical Independent Variable

To understand a linear regression model with the dummified variables created from a categorical independent variable further, let’s derive the fitted values of trudeau_therm_cps for each category of percep_economy_cps. Recall that a fitted value (or a predicted value) is the value of \(\widehat{Y}\) derived from a linear regression equation (e.g., \(\widehat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2\)) by substituting specific values of independent variables (e.g., specifying \(X_1\) and \(X_2\) at certain values).

Below I reproduced the outcome of the linear regression of trudeau_therm_cps on percep_economy_cps.

  lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)


Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)

Coefficients:
             (Intercept)    percep_economy_cpsSame  percep_economy_cpsBetter  
                   25.13                     23.93                     36.76

Based on this result, the fitted values of trudeau_therm_cps can be computed by the following formula.

\[ \begin{aligned} {\color{blue}25.13}~+~{\color{red}23.93}~\times &percep\_economy\_cpsSame\\ &+~{\color{violet}36.76}~\times percep\_economy\_cpsBetter \end{aligned} \tag{7.2}\]

If percep_economy_cps = Worse, then both percep_economy_cpsSame and percep_economy_cpsBetter must be zero (see Table 7.1). Therefore, substituting \(0\) for both variables in Equation 7.2, the fitted value of trudeau_therm_cps if percep_economy_cps = Worse is given by \({\color{blue}25.13}~+~{\color{red}23.93}\times0+{\color{violet}36.76}\times0 = {\color{blue}25.13}\). Because both independent variables are zero, the fitted value for trudeau_therm_cps when percep_economy_cps = Worse is equivalent to the intercept of the linear regression model in Equation 7.2. This is the reason why no dummy variable is created for a reference category. This category doesn’t need its own dummy variable, since the intercept of a linear regression model represents the fitted value for this reference category.

If percep_economy_cps = Same, then percep_economy_cpsSame = 1 and percep_economy_cpsBetter = 0 (see Table 7.1). Therefore, substituting these values in Equation 7.2, the fitted value of trudeau_therm_cps is given by \({\color{blue}25.13}~+~{\color{red}23.93}\times1+{\color{violet}36.76}\times0 = {\color{blue}25.13}~+~{\color{red}23.93} = {\color{red}49.06}\). This result implies that the coefficient on percep_economy_cpsSame, \({\color{red}23.93}\), indicates the difference in the fitted values of trudeau_therm_cps when percep_economy_cps is Same and when percep_economy_cps is its reference category, Worse (i.e., \({\color{red}49.06} - {\color{blue}25.13} = {\color{red}23.93}\)).

Similarly, if percep_economy_cps = Better, then percep_economy_cpsSame = 0 and percep_economy_cpsBetter = 1 (see Table 7.1). Therefore, substituting these values in Equation 7.2, the fitted value of trudeau_therm_cps is given by \({\color{blue}25.13}~+~{\color{red}23.93}\times0+{\color{violet}36.76}\times1 = {\color{blue}25.13}~+~{\color{violet}36.76} = {\color{violet}61.89}\). Again, the coefficient on percep_economy_cpsBetter, \({\color{violet}36.76}\), indicates the difference in the fitted values between when percep_economy_cps is Better and when percep_economy_cps is its reference category, Worse (i.e., \({\color{violet}61.89} - {\color{blue}25.13} = {\color{violet}36.76}\)).

7.3.2 How to Interpret the Linear Regression Coefficients on Dummified Variables from Categorical Independent Variable

More generally, we can say that the coefficient on a dummy variable created for a certain category of a categorical independent variable (say, \(X = a\)) indicates the difference in the fitted values (\(\widehat{Y}\)) for this category (\(X = a\)) and the reference category (\(X =\) reference category).

In other words, the coefficient on a dummy variable for a certain category of a categorical independent variable (\(X = a\)) tells us how the value of \(Y\) for the observations with \(X = a\) differs, on average, from the value of \(Y\) for those with \(X =\) reference category.

For example, in Section 7.3.1, we saw that the coefficient on percep_economy_cpsSame (\({\color{red}23.93}\)) indicates the difference in the fitted values of trudeau_therm_cps when percep_economy_cps is Sameand percep_economy_cps is the reference category, Worse (\({\color{red}49.06} - {\color{blue}25.13} = {\color{red}23.93}\)). This coefficient, \({\color{red}23.93}\), can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Same is greater, on average, by \({\color{red}23.93}\) points from those whose perception of economy is the reference category, Worse.

We also saw that the coefficient on percep_economy_cpsBetter (\({\color{violet}36.76}\)) indicates the difference in the fitted values between when percep_economy_cps is Better and when percep_economy_cps is the reference category, Worse (\({\color{violet}61.89} - {\color{blue}25.13} = {\color{violet}36.76}\)). This coefficient, \({\color{violet}36.76}\), can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Better is greater, on average, by \({\color{violet}36.76}\) points from those whose perception of economy is the reference category, Worse.

As we saw in these examples, the coefficient on the dummy variable for a certain category of a categorical independent variable (\(X = a\)) should be understood as the difference in the average value of \(Y\) for this category (\(X = a\)) and the average value of \(Y\) for the reference category (\(X =\) reference category). This is the reason why a reference category is named that way.

As the above discussion has made it clear, the coefficient on the dummy variable for a certain category of a categorical independent variable (\(X = a\)) tells us the comparison between the chosen category (\(X = a\)) and the reference category (\(X =\) reference category).

If we want to compare the chosen category (\(X = a\)) with another category which is not the reference category (say, \(X = b\)), then we need to compute the difference from their coefficients.

For example, suppose that we want to compare the average Trudeau thermometer between the respondents whose perception of economy is Better and those whose is Same. The comparison is made by taking the difference of the fitted values of trudeau_therm_cps when percep_economy_cps is Better and when percep_economy_cps is Same. As we have seen, the fitted value of trudeau_therm_cps when percep_economy_cps is Better is \({\color{blue}25.13}~+~ {\color{violet}36.76} = {\color{violet}61.89}\) , and the fitted value of trudeau_therm_cps when percep_economy_cps is Same is \({\color{blue}25.13}~+~{\color{red}23.93} = {\color{red}49.06}\). Therefore, the difference of these fitted values is \({\color{violet}61.89} - {\color{red}49.06} = {\color{brown}12.83}\).

However, this is equivalent to the difference between the coefficient of percep_economy_cpsBetter, \({\color{violet}36.76}\), and the coefficient of percep_economy_cpsSame, \({\color{red}23.93}\), (\({\color{violet}36.76} - {\color{red}23.93} = {\color{brown}12.83}\)). Therefore, we just needed to compute the difference between the coefficient of percep_economy_cpsBetter and the coefficient of percep_economy_cpsSame in the first place.

This resut can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Better is greater, on average, by \({\color{brown}12.83}\) points from those whose perception of economy is Same.

7.3.3 With Additional Independent Variables

In Section 7.3.2, we included only the dummy variables for a categorical independent variable on the right-hand side of the linear regression model. We can also include the additional independent variables on the right hand side. For example, below I added ideology and union_d on the right-hand side of the linear regression model of trudeau_therm_cps on percep_economy_cps.

  lm(formula = trudeau_therm_cps ~ percep_economy_cps + ideology + union_d, data = ces2019)


Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps + ideology + 
    union_d, data = ces2019)

Coefficients:
             (Intercept)    percep_economy_cpsSame  percep_economy_cpsBetter  
                 39.3833                   22.0875                   33.8127  
                ideology               union_dTRUE  
                 -2.6315                    0.7463

As before, in this linear regression analysis, percep_economy_cps is dummified into two dummy variables, percep_economy_cpsSame and percep_economy_cpsBetter. The interpretation of the coefficients of these dummy variables is the same as before except that the other variables ideology and union_d are now controlled for.

The coefficient of percep_economy_cpsSame is \({\color{red}22.09}\). This can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Same is greater, on average, by \({\color{red}22.09}\) points from those whose perception of economy is the reference category, Worse, controlling for the respondents’ political ideology and union membership.

The coefficient of percep_economy_cpsBetter is \({\color{violet}33.81}\). This coefficient, \({\color{violet}33.81}\), can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Better is greater, on average, by \({\color{violet}33.81}\) points from those whose perception of economy is the reference category, Worse, holding their political ideology and union membership constant.

If we want to compare the respondents whose perception of economy is Better and those whose is Same, we need to compute the difference between the coefficient of percep_economy_cpsBetter and the coefficient of percep_economy_cpsSame (\({\color{violet}33.81} - {\color{red}22.09} = {\color{brown}11.72}\)). As the difference is \({\color{brown}11.72}\), it can be interpreted that the Trudeau thermometer for the respondents whose perception of economy is Better is greater, on average, by \({\color{brown}11.72}\) points from those whose perception of economy is Same, with the respondents’ political ideology and union status held constant.

7.3.4 Difference from Numerical Version

Comparing the linear regression results of the numeric version and the dummified version of percep_economy_cps, we can see the difference in these two specifications.

For your quick reference, I have reproduced the outputs of the lm() function for a simple linear regression of trudeau_therm_cps with both versions of percep_economy_cps below.

(1) With the numeric version of percep_economy_cps


Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)

Coefficients:
         (Intercept)  percep_economy_cps_n  
               8.128                18.982

(2) With the dummified version of percep_economy_cps


Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps, data = ces2019)

Coefficients:
             (Intercept)    percep_economy_cpsSame  percep_economy_cpsBetter  
                   25.13                     23.93                     36.76

In the result of a simple linear regression model with the numeric version, percep_economy_cps_n, the average difference in the Trudeau thermometer is \(18.98\) points both between Better and Same and between Same and Worse. On the other hand, in the linear regression model with the dummified version, the average difference in the Trudeau thermometer is \({\color{violet}36.76} - {\color{red}23.93} = {\color{brown}12.83}\) points between Better and Same, while it is \({\color{red}23.93}\) points between Same and Worse.

Similarly, I have reproduced the outputs of the lm() function for a multiple linear regression of trudeau_therm_cps with both versions of percep_economy_cps below.

(1) With the numeric version of percep_economy_cps


Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + 
    union_d, data = ces2019)

Coefficients:
         (Intercept)  percep_economy_cps_n              ideology  
             24.7569               17.1926               -2.7193  
         union_dTRUE  
              0.8752

(2) With the dummified version of percep_economy_cps


Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps + ideology + 
    union_d, data = ces2019)

Coefficients:
             (Intercept)    percep_economy_cpsSame  percep_economy_cpsBetter  
                 39.3833                   22.0875                   33.8127  
                ideology               union_dTRUE  
                 -2.6315                    0.7463

In the result of a multiple linear regression model with the numeric version, percep_economy_cps_n, the average difference in the Trudeau thermometer is \(17.19\) points both between Better and Same and between Same and Worse. On the other hand, in the result with the dummified version, the average difference in the Trudeau thermometer is \({\color{violet}33.81} - {\color{red}22.09} = {\color{brown}11.72}\) points between Better and Same, while it is \({\color{red}22.09}\) points between Same and Worse.

When we used the numeric version, percep_economy_cps_n, the difference in Trudeau thermometer is the same whether we consider the difference between Better and Same or that between Same and Worse. On the other hand, when we used the dummified version, the difference in Trudeau thermometer is different when we consider the difference between Better and Same and when we examine the difference between Same and Worse.

In general, the average difference in the dependent variable between one category of ordinal categorical variable and its adjacent category is the same for all categories, if we use a numeric version of an ordinal categorical variable, but it is different for all categories of this variable, if we use the dummified version of this variable. Hence, dummification may be considered as a more flexible modeling strategy for an ordinal categorical variable than the use of a numeric version of this variable. This is a modeling choice we need to make when we include an ordinal categorical variable as an independent variable in our linear regression model.

7.4 Dummifying Nominal Categorical Independent Variable: Questions

For an ordinal categorical variable, we can consider dummifying it and transforming it into a numeric variable as alternatives, but for a nominal categorical variable, dummifying may be the only appropriate option. As the interpretation of the results remain the same as in Section 7.3, instead of repeating the explanation on how to interpret a model, I want you to work on the following questions on dummifying a nominal categorical independent variable. Answers to these questions will be made available in Section 7.5 after all tutorial sections have covered this chapter.

Let’s use the province of residence of the respondents (province) in the ces2019 data frame as an example.

Question 1
What will be a reference category when we dummify province on the right hand side of a linear regression model using the lm() function? Use the levels() function to find out.

Question 2
Fit a linear regression model of trudeau_therm_cps on province using the lm() function. Interpret the intercept and the coefficient on provinceON.

Question 3
Fit a linear regression model of trudeau_therm_cps on province, percep_economy_cps_n, ideology, and union_d using the lm() function. Interpret the coefficients on provinceON, provinceSK, and percep_economy_cps_n, respectively.

Question 4
Dummifying all categories except one is not the only way to create a dummy variable from a nominal categorical variable. An alternative is to create a dummy variable only for a single category or a subset of categories of a nominal categorical variable. In this question, we consider a single dummy variable for a subset of categories of province.

Suppose we theorized that the respondents’ feelings about the incumbent Liberal prime minister are different between the Prairies (Alberta, Manitoba, and Saskatchewan) and the rest of the provinces. In this case, we want to dummify province such that prairies = 1 if the respondents lived in these provinces and prairies = 0 otherwise. We can create such a variable by the ifelse() function introduced in Section 6.3.4. Read that section again to review the basics of the ifelse() function. For the current purpose, you may specify the condition in the ifelse() function as follows.

  ces2019 <- mutate(ces2019,
      # Create a new variable "prairies" using the ifelse() function.
                    prairies = ifelse(province %in% c("AB","MB","SK"), "TRUE", "FALSE") )
                        # This ifelse() function assigns "TRUE" to observations if
                        # "province" is "AB", "MB" or "SK" but assigns "FALSE" otherwise.
      # Then, we transform prairies into a logical variable.
  ces2019 <- mutate(ces2019, prairies = as.logical(prairies) )

In the above code, the condition for the ifelse() function is specified as province %in% c("AB","MB","SK"), which means that “province is either AB or MB or SK.” If you want to specify a condition that a certain variable, say var, equals either A or B or C or D, you can use %in% to specify the condition as var %in% c("A","B","C","D").

Let’s check if prairies was created as we wanted.

  table(ces2019$prairies, ces2019$province, useNA = "always")

       
         AB  BC  MB  NB  NL  NS  ON  PE  QC  SK <NA>
  FALSE   0 804   0 201 201 200 807 201 802   0    0
  TRUE  282   0 261   0   0   0   0   0   0 262    0
  <NA>    0   0   0   0   0   0   0   0   0   0    0

Fit a linear regression model of trudeau_therm_cps on prairies, percep_economy_cps_n, ideology, and union_d using the lm() function.

Interpret the coefficient on prairiesTRUE.

Question 5
We may also use multiple dummy variables corresponding to different subsets of a nominal categorical variable. Suppose we theorized that, in addition to the Prairies, the respondents’ feelings about the incumbent Liberal prime minister may also be different in the Maritimes (New Brunswick, Nova Scotia, and Prince Edward Island) from the rest of the provinces.

Create a dummy variable for the Maritimes, named maritimes, using the ifelse() function. Then, fit a linear regression model of trudeau_therm_cps on prairies, maritimes, percep_economy_cps_n, ideology, and union_d using the lm() function.

Interpret the coefficients on prairiesTRUE and maritimesTRUE, respectively.

7.5 Dummifying Nominal Categorical Independent Variable: Answers

Answers to the questions posed in Section 7.4 will be made available here after all tutorial sections have covered this chapter. Schedule to come back to review these answers.

Let’s use the province of residence of the respondents (province) in the ces2019 data frame as an example.

Below is a list of levels of this variable.

  levels(ces2019$province)

 [1] "AB" "BC" "MB" "NB" "NL" "NS" "ON" "PE" "QC" "SK"

Because the first level is AB, Alberta will be used as a reference category.

Question 2
Fit a linear regression model of trudeau_therm_cps on province using the lm() function. Interpret the intercept and the coefficient on provinceON.

We fit a linear regression model of trudeau_therm_cps on province as below.

  lm(formula = trudeau_therm_cps ~ province, data = ces2019)


Call:
lm(formula = trudeau_therm_cps ~ province, data = ces2019)

Coefficients:
(Intercept)   provinceBC   provinceMB   provinceNB   provinceNL   provinceNS  
     24.910       17.566       12.487       21.535       22.019       23.955  
 provinceON   provincePE   provinceQC   provinceSK  
     24.083       24.600       24.225        3.426

The intercept is the fitted value for the reference category of this variable (province). Since the reference category is Alberta, the intercept is the fitted value for Alberta. The intercept in the above linear regression result is \(24.91\). This may be interpreted as follows.

“Feeling thermometer about Trudeau is, on average, 24.91 points for those respondents who lived in Alberta.”

The coefficient on a dummy variable for one categocy of a nominal categorical variable represents the difference in the average value of the dependent variable between the observations for which the variable is this category and those for which the variable is the reference category. Because the coefficient of provinceON is \(24.08\), and the reference category is Alberta, this coefficient may be interpreted as follows.

“Feeling thermometer about Trudeau is, on average, 24.08 points higher for those respondents who lived in Ontario than those who lived in Alberta.”

We fit a linear regression model of trudeau_therm_cps on province, percep_economy_cps_n, ideology, and union_d as below.

  lm(formula = trudeau_therm_cps ~ province + percep_economy_cps_n
                                    + ideology + union_d, data = ces2019)


Call:
lm(formula = trudeau_therm_cps ~ province + percep_economy_cps_n + 
    ideology + union_d, data = ces2019)

Coefficients:
         (Intercept)            provinceBC            provinceMB  
             15.2848               11.7557               10.8040  
          provinceNB            provinceNL            provinceNS  
             16.9362               18.4670               12.1679  
          provinceON            provincePE            provinceQC  
             15.3387               15.8607               12.3670  
          provinceSK  percep_economy_cps_n              ideology  
              3.4424               15.8163               -2.6628  
         union_dTRUE  
              0.5959

Because there are additional control variables, the coefficient on a dummy variable for one category of a nominal categorical variable represents the difference in the average value of the dependent variable between the observations for which the variable is this category and those for which the variable is the reference category, controlling for other control variables.

The coefficient of provinceON can be interpreted as follows.

“Feeling thermometer about Trudeau is, on average, 15.34 points higher for those respondents who lived in Ontario than those who lived in Alberta, controlling for the perception of economy, political ideology, and union membership.”

The coefficients on provinceSK can be interpreted as follows.

“Feeling thermometer about Trudeau is, on average, 3.44 points higher for those respondents who lived in Saskatchewan than those who lived in Alberta, holding the perception of economy, political ideology, and union membership constant.”

The coefficients on percep_economy_cps_n can be interpreted as below.

“Feeling thermometer about Trudeau is, on average, 15.82 points higher for those respondents whose perception of economy is better (the same) than those whose perception of economy is the same (worse), controlling for the provinces of residence, political ideology, and union membership.”

  ces2019 <- mutate(ces2019,
      # Create a new variable "prairies" using the ifelse() function.
                    prairies = ifelse(province %in% c("AB","MB","SK"), "TRUE", "FALSE") )
                        # This ifelse() function assigns "TRUE" to observations if
                        # "province" is "AB", "MB" or "SK" but assigns "FALSE" otherwise.
      # Then, we transform prairies into a logical variable.
  ces2019 <- mutate(ces2019, prairies = as.logical(prairies) )

Let’s check if prairies was created as we wanted.

  table(ces2019$prairies, ces2019$province, useNA = "always")

       
         AB  BC  MB  NB  NL  NS  ON  PE  QC  SK <NA>
  FALSE   0 804   0 201 201 200 807 201 802   0    0
  TRUE  282   0 261   0   0   0   0   0   0 262    0
  <NA>    0   0   0   0   0   0   0   0   0   0    0

Fit a linear regression model of trudeau_therm_cps on prairies, percep_economy_cps_n, ideology, and union_d using the lm() function.

Interpret the coefficient on prairiesTRUE.

We fit a linear regression model of trudeau_therm_cps on prairies, percep_economy_cps_n, ideology, and union_d as below.

  lm(formula = trudeau_therm_cps ~ prairies + percep_economy_cps_n
                                    + ideology + union_d, data = ces2019)


Call:
lm(formula = trudeau_therm_cps ~ prairies + percep_economy_cps_n + 
    ideology + union_d, data = ces2019)

Coefficients:
         (Intercept)          prairiesTRUE  percep_economy_cps_n  
             28.6463               -9.0945               16.0341  
            ideology           union_dTRUE  
             -2.6583                0.6047

Given this dummy variable for the Prairies, the reference category is all provinces other than the Prairies. Therefore, the coefficient on prairiesTRUE indicates the average difference in trudeau_therm_cps between the respondents who lived in the Prairies and those who lived in other provinces, controlling for other variables.

The coefficient on prairiesTRUE may be interpreted as follows.

“Feeling thermometer about Trudeau is, on average, 9.09 points lower for those respondents who lived in the Prairies than those who lived in other provinces, controlling for the perception of economy, political ideology, and union membership.”

Interpret the coefficients on prairiesTRUE and maritimesTRUE, respectively.

We create a new dummy variable maritimes as below.

  ces2019 <- mutate(ces2019,
      # Create a new variable "maritimes" using the ifelse() function.
                    maritimes = ifelse(province %in% c("NB","NS","PE"), "TRUE", "FALSE") )
      # This ifelse() function assigns "TRUE" to observations if
      # "province" is "NB", "NS" or "PE" but assigns "FALSE" otherwise.
  ces2019 <- mutate(ces2019, # Then, we transform prairies into a logical variable.
                  prairies = as.logical(prairies) )

Let’s check if maritimes was created as we wanted.

  table(ces2019$maritimes, ces2019$province, useNA = "always")

       
         AB  BC  MB  NB  NL  NS  ON  PE  QC  SK <NA>
  FALSE 282 804 261   0 201   0 807   0 802 262    0
  TRUE    0   0   0 201   0 200   0 201   0   0    0
  <NA>    0   0   0   0   0   0   0   0   0   0    0

  lm(formula = trudeau_therm_cps ~ prairies + maritimes + percep_economy_cps_n
                                    + ideology + union_d, data = ces2019)


Call:
lm(formula = trudeau_therm_cps ~ prairies + maritimes + percep_economy_cps_n + 
    ideology + union_d, data = ces2019)

Coefficients:
         (Intercept)          prairiesTRUE         maritimesTRUE  
             28.3114               -8.8188                1.4265  
percep_economy_cps_n              ideology           union_dTRUE  
             16.0503               -2.6527                0.6189

Because the above model includes dummy variables for the Prairies and the Maritimes, the reference category is the provinces other than those in the Prairies and the Maritimes. Therefore, the coefficients on prairiesTRUE and maritimesTRUE indicate, respectively, the average difference in trudeau_therm_cps between the respondents who lived in the Prairies and Maritimes and those who lived in the provinces other than the Prairies and the Maritimes, controlling for other variables.

The coefficients on prairiesTRUE and maritimesTRUE may be interpreted as follows.

“Feeling thermometer about Trudeau is, on average, 8.82 points lower for those respondents who lived in the Prairies than those who lived in the provinces other than the Prairies and the Maritimes, controlling for the perception of economy, political ideology, and union membership.”

“Feeling thermometer about Trudeau is, on average, 1.43 points higher for those respondents who lived in the Maritimes than those who lived in the provinces other than the Prairies and the Maritimes, holding the perception of economy, political ideology, and union membership constant.”

A caveat here is that while it is not clear in the analyses presented, the number of observations included is different between these simple and multiple linear regressions. We will come back to this issue later.↩︎