8  Statistical Inference for Linear Regression

Before you start working on this chapter, you need to do the following. If you need a help for each step, see Section 3.1.

  library(tidyverse)

8.1 Prepare Variables

As in the previous chapters, create percep_economy_cps_n, which is a numeric version of percep_economy_cps, and union_d, a logical version of union.

  ces2019 <- ces2019 |> 
      mutate(percep_economy_cps = fct_relevel(percep_economy_cps, 
              "(2) Worse", "(3) About the same", "(1) Better") ) |> 
      mutate(percep_economy_cps = fct_recode(percep_economy_cps, 
                                "Worse" = "(2) Worse", 
                                "Same" = "(3) About the same", 
                                "Better" = "(1) Better") ) |> 
      mutate(percep_economy_cps_n = as.numeric(percep_economy_cps))
  ces2019 <-  ces2019 |> # Change the name of categories of union to TRUE and FALSE.
      mutate( union_d = fct_recode(union, "TRUE" = "(1) Yes", "FALSE" = "(2) No") ) |> 
      mutate( union_d = as.logical(union_d) )  # Then, apply the as.logical() function.

8.2 Estimate Linear Regression Models

First, let’s estimate a simple linear regression model of trudeau_therm_cps on percep_economy_cps_n.

  lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)

Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)

Coefficients:
         (Intercept)  percep_economy_cps_n  
               8.128                18.982  

We can assign the result of the lm() function to a new object. For example, I assigned the above result to an object called model1 below.

  model1 <- lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)

Then, we can see the linear regression result stored in model1 by typing the name of this new object in the Console.

  model1

Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)

Coefficients:
         (Intercept)  percep_economy_cps_n  
               8.128                18.982  

Let’s assign the result of a multiple linear regression of trudeau_therm_cps on percep_economy_cps_n and ideology and union_d to a new object, named model2.

  model2 <- lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + union_d, 
                      data = ces2019)

Again, we can access this result by typing model2 in the Console.

  model2

Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + 
    union_d, data = ces2019)

Coefficients:
         (Intercept)  percep_economy_cps_n              ideology  
             24.7569               17.1926               -2.7193  
         union_dTRUE  
              0.8752  

There is more information in the result of the lm() function than what appears in the R console this way. We can access them, for example, by the summary() function. A basic syntax of the summary() function is to include the output of the lm() function as its argument (i.e., inside the parentheses).

  summary(model2)

Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + 
    union_d, data = ces2019)

Residuals:
   Min     1Q Median     3Q    Max 
-71.77 -20.63   0.86  19.87  85.24 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           24.7569     2.2629  10.940   <2e-16 ***
percep_economy_cps_n  17.1926     0.7728  22.248   <2e-16 ***
ideology              -2.7193     0.2549 -10.668   <2e-16 ***
union_dTRUE            0.8752     1.3498   0.648    0.517    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.59 on 2342 degrees of freedom
  (1675 observations deleted due to missingness)
Multiple R-squared:  0.2416,    Adjusted R-squared:  0.2407 
F-statistic: 248.8 on 3 and 2342 DF,  p-value: < 2.2e-16

Most of the outputs here concern statistical inference. We can interpret the result of statistical inference for a linear regression model from this output of the summary() function, which may be considered as a default option to examine the linear regression result in R. I covered how we may interpret the result from this table in my lecture. Here I will introduce another function, summ(), from a package called jtools.

8.3 Summarize Linear Regression Results: summ()

Install the jtools package. Refer to Section 1.4.2 if you want to recall more details on how to install a new package.

  install.packages("jtools")

Then, load jtools.

  library(jtools)

Here we use the summ() function from the jtools package to derive the detailed results of the lm() function. This function gives us similar information on the result of the lm() function to the one derived by the summary() function. Arguably, the output from the summ() function is more readable than the ouput from the summary() function.

The basic syntax of the summ() function is the same as the summary()function — i.e., to include the output of the lm() function in the summ() function as its argument. Let’s show the detailed information of the linear regression result stored in model2 using the summ() function.

  summ(model2)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression 

MODEL FIT:
F(3,2342) = 248.76, p = 0.00
R² = 0.24
Adj. R² = 0.24 

Standard errors:OLS
---------------------------------------------------------
                              Est.   S.E.   t val.      p
-------------------------- ------- ------ -------- ------
(Intercept)                  24.76   2.26    10.94   0.00
percep_economy_cps_n         17.19   0.77    22.25   0.00
ideology                     -2.72   0.25   -10.67   0.00
union_dTRUE                   0.88   1.35     0.65   0.52
---------------------------------------------------------

There are three sets of information in the above output of the summ() function. The first set of information at the top of the output is MODEL INFO. The most useful piece of information here is the number of observations included in the model and the number of missing observations (those with missing values for at least one of the variables included in the model) excluded from the estimation.

The second set of information at the middle of the output is MODEL FIT. I will not cover these statistics for a model fit in this class, except for R-squared (\(R^2\)), which I will briefly cover in my lecture later in the semester.

The third set of information at the bottom is a linear regression table. The first column of the table is a list of the linear regression coefficients. Est. in the second column is the point estimates of the linear regression coefficients for the intercept and the variables listed in the first column. S.E. in the third column is the standard errors of those linear regression coefficients. Many academic articles on empirical research in political science report these statistics.

The information in the fourth and fifth columns concern hypothesis testing. t val. in the fourth column is called a t-statistic or t-value, which is the point estimate of a coefficient divided by its standard error. Note that t-statistic is optional in my lecture as it may be too technical for this class.

p in the fifth column is a p-value, from which we can interpret the statistical significance of each linear regression coefficient.

We can also change the number of decimal places that appear in the table by specifying the digits argument. For example, if we specify digits = 4, it will display numbers up to the fourth decimal place, as below.

  summ(model2, digits = 4)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression 

MODEL FIT:
F(3,2342) = 248.7557, p = 0.0000
R² = 0.2416
Adj. R² = 0.2407 

Standard errors:OLS
-----------------------------------------------------------------
                                Est.     S.E.     t val.        p
-------------------------- --------- -------- ---------- --------
(Intercept)                  24.7569   2.2629    10.9405   0.0000
percep_economy_cps_n         17.1926   0.7728    22.2481   0.0000
ideology                     -2.7193   0.2549   -10.6683   0.0000
union_dTRUE                   0.8752   1.3498     0.6483   0.5168
-----------------------------------------------------------------

We can also display the confidence intervals for linear regression coefficients by setting the confint = TRUE.

  summ(model2, confint = TRUE)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression 

MODEL FIT:
F(3,2342) = 248.76, p = 0.00
R² = 0.24
Adj. R² = 0.24 

Standard errors:OLS
------------------------------------------------------------------
                              Est.    2.5%   97.5%   t val.      p
-------------------------- ------- ------- ------- -------- ------
(Intercept)                  24.76   20.32   29.19    10.94   0.00
percep_economy_cps_n         17.19   15.68   18.71    22.25   0.00
ideology                     -2.72   -3.22   -2.22   -10.67   0.00
union_dTRUE                   0.88   -1.77    3.52     0.65   0.52
------------------------------------------------------------------

In the above output, 2.5% in the third column is the lower bound of a 95% confidence interval, and 97.5% in the fourth column is the upper bound. Note that the percentage between 2.5% and 97.5% is 95%. Together, they form a 95% confidence interval.

We can change the confidence level by the ci.width argument. We need to specify the confidence level as a number between 0 and 1, rather than a percentage. Below, I specified ci.width = 0.99 to derive the 99% confidence intervals.

  summ(model2, confint = TRUE, ci.width = 0.99)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression 

MODEL FIT:
F(3,2342) = 248.76, p = 0.00
R² = 0.24
Adj. R² = 0.24 

Standard errors:OLS
------------------------------------------------------------------
                              Est.    0.5%   99.5%   t val.      p
-------------------------- ------- ------- ------- -------- ------
(Intercept)                  24.76   18.92   30.59    10.94   0.00
percep_economy_cps_n         17.19   15.20   19.18    22.25   0.00
ideology                     -2.72   -3.38   -2.06   -10.67   0.00
union_dTRUE                   0.88   -2.60    4.35     0.65   0.52
------------------------------------------------------------------

As before, we can also change the number of decimal places to appear by specifying the digits argument.

  summ(model2, confint = TRUE, ci.width = 0.99, digits = 4)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression 

MODEL FIT:
F(3,2342) = 248.7557, p = 0.0000
R² = 0.2416
Adj. R² = 0.2407 

Standard errors:OLS
----------------------------------------------------------------------------
                                Est.      0.5%     99.5%     t val.        p
-------------------------- --------- --------- --------- ---------- --------
(Intercept)                  24.7569   18.9234   30.5905    10.9405   0.0000
percep_economy_cps_n         17.1926   15.2004   19.1847    22.2481   0.0000
ideology                     -2.7193   -3.3764   -2.0622   -10.6683   0.0000
union_dTRUE                   0.8752   -2.6046    4.3549     0.6483   0.5168
----------------------------------------------------------------------------

As I covered in my lecture, when a (\(100 - \alpha\)) % confidence interval for a linear regression coefficient does not include zero, we say that this coefficient is statistically significant (or statistically distinguishable from zero) at the \(\alpha\) % significance level. In the above example, since the 99% confidence intervals for all linear regression coefficients except for union_d don’t include zero, all these coefficients except union_d are statistically significant at the 1% significance level.

We can also interpret statistical significance from the p-values presented in the last column indicated by p.

If the p-value is smaller than a (a is a number between 0 and 1), this suggests that the coefficient is statistically significant (or statistically distinguishable from zero) at the (a \(\times\) 100)% significance level. For example, if the p-value is smaller than 0.05, the coefficient is statistically significant at the 5% significance level. In the above table, the p-values for all except union_d are indicated as 0.0000, which means that the p-values are smaller than 0.0000, and therefore, rounded up to 0.0000. Since these p-values are smaller than 0.05, these coefficients are statistically significant at the 5% significance level.

Moreover, these coefficients are also statistically significant at the 1% significance level as their p-values are smaller than 0.01. They are also statistically significant at the 0.1% significance level because they are smaller than 0.001. They are statistically significant even at the 0.01% significance level as the p-values are smaller than 0.0001.

On the other hand, the p-value for union_d is 0.52, indicating that this coefficient is statistically insignificant (or not statistically distinguishable from zero) even at the 10% significance level.

8.4 Linear Regression Table: export_summs()

We can also format the result of a linear regression into a nice table using the export_summs() function of the jtools package. To use this function, we also need the huxtable package.

First, install the huxtable package.

  install.packages("huxtable")

Then, load huxtable.

  library(huxtable)

Now we can use the export_summs() function from the jtools package. The basic syntax is the same as the summ() function.

  export_summs(model2)
Model 1
(Intercept)24.76 ***
(2.26)   
percep_economy_cps_n17.19 ***
(0.77)   
ideology-2.72 ***
(0.25)   
union_dTRUE0.88    
(1.35)   
N2346       
R20.24    
*** p < 0.001; ** p < 0.01; * p < 0.05.

This is similar to the popular format of a linear regression table we see in many political science research articles. As you can see in the table, the first column is a list of linear regression coefficients including an intercept. In the next column, we can see the point estimates of these coefficients. The numbers in parentheses are the standard errors for each coefficient. N is the number of observations included in the linear regression analysis, and R2 is the R-squared for this linear regression model (as I mentioned before, I will cover R-squared in a lecture later in the semester).

The star signs (asterisks) right next to each coefficient estimate indicate statistical significance of each coefficient. You can find the legend for these star signs at the bottom of the table, which states that

     *** p < 0.001, ** p < 0.01, and * p < 0.05.


p in this legend refers to the p-value.

This legend can be read as:

    *** indicates that the p-value is smaller than 0.001;
     ** indicates that the p-value is smaller than 0.01; and
      * indicates that the p-value is smaller than 0.05.


Therefore, if the coefficient is given the *** sign, this means that this coefficient is statistically significant at the 0.1% significance level;

if it is given **, this coefficient is statistically significant at the 1% significance level (but not statistically significant at the 0.1% significance level);

if it is given *, this coefficient is statistically significant at the 5% significance level (but not at the 1% significance level); and

if it is not given any stars, the coefficient is not statistically significant at the 5% or lower significance levels.

In the example above, all coefficients except for union_dTRUE are given ***, indicating that these coefficients are statistically significant at the 0.1% significance level. There is no star sign given to union_dTRUE, suggesting that this coefficient is not statistically significant even at the 5% significance level, the highest significance level reported in this table.

Below is another example, which is a linear regression of the democratic candidate’s vote share in the U.S. gubernatorial elections (ranney3_gub_prop) on the proportion of Democratic identifiers (democrat) in each state.

  lm(formula = ranney3_gub_prop*100 ~ democrat, data = usstates2010) |> 
    export_summs()
Model 1
(Intercept)26.76 ***
(7.09)   
democrat0.66 ** 
(0.22)   
N49       
R20.16    
*** p < 0.001; ** p < 0.01; * p < 0.05.

In this table, the coefficient on democrat is given two stars (**), indicating that the coefficient on this variable is statistically significant at the 1% significance level (but not at the 0.1% significance level).

The export_summs() function can also create a table which juxtaposes a simple and multiple linear regression models. For example, the following code creates a table with both simple and multiple linear regression models for trudeau_therm_cps (recall that model1 is the simple linear regression model and model2 is the multiple linear regression model created earlier in Section 8.2).

  export_summs(model1, model2)
Model 1Model 2
(Intercept)8.13 ***24.76 ***
(1.22)   (2.26)   
percep_economy_cps_n18.98 ***17.19 ***
(0.61)   (0.77)   
ideology       -2.72 ***
       (0.25)   
union_dTRUE       0.88    
       (1.35)   
N3859       2346       
R20.20    0.24    
*** p < 0.001; ** p < 0.01; * p < 0.05.

A table like this is convenient for comparing the coefficient estimates for a certain variable of our interest across different models. In the present example, we may be interested in how the coefficient estimate for percep_economy_cps_n changes between simple and multiple linear regression models.

We can replace the standard errors in the table with confidence intervals by specifying the following argument in the export_summs() function.

  export_summs(model1, model2,
               error_format = "[{conf.low}, {conf.high}]")
Model 1Model 2
(Intercept)8.13 ***24.76 ***
[5.74, 10.52]   [20.32, 29.19]   
percep_economy_cps_n18.98 ***17.19 ***
[17.79, 20.17]   [15.68, 18.71]   
ideology       -2.72 ***
       [-3.22, -2.22]   
union_dTRUE       0.88    
       [-1.77, 3.52]   
N3859       2346       
R20.20    0.24    
*** p < 0.001; ** p < 0.01; * p < 0.05.

In this table, a pair of numbers in the squared brackets below each coefficient estimate indicates the confidence interval for that coefficient. For example, [17.79, 20.17] below the coefficient estimate for percep_economy_cps_n in Model 1 is the 95% confidence interval for the coefficient on percep_economy_cps_n.

As before, we can change the confidence level using the ci.width argument and the number of decimal places to appear using the digits argument. In the code below, I set the confidence level at 99% and decimal places to appear at four.

  export_summs(model1, model2,
               error_format = "[{conf.low}, {conf.high}]", 
               ci.width = 0.99, digits = 4)
Model 1Model 2
(Intercept)8.1281 ***24.7569 ***
[5.7399, 10.5162]   [20.3195, 29.1944]   
percep_economy_cps_n18.9823 ***17.1926 ***
[17.7940, 20.1705]   [15.6772, 18.7080]   
ideology         -2.7193 ***
         [-3.2191, -2.2194]   
union_dTRUE         0.8752    
         [-1.7718, 3.5222]   
N3859         2346         
R20.2028    0.2416    
*** p < 0.001; ** p < 0.01; * p < 0.05.

8.5 Use Same Subset of Data Across Different Models: drop_na()

In the linear regression table presented above, which juxtaposed the result of simple and multiple linear regression models of trudeau_therm_cps, you might have noticed that the number of observations included in the models is different between these models. In particular, there are 3,859 observations included in the simple linear regression model (Model 1 in the table), while the number of observations is dropped to 2,346 in the multiple linear regression model (Model 2). The reason for this relatively large difference in the number of observations is that these two variables are based on the questions asked in different portions of the Canadian Election Study in 2019. More precisely, percep_economy_cps_n is based on the question asked in the Campaign Period Survey (CPS) and ideology is based on the question asked in the Post Election Survey (PES). PES was administered for a subset of the respondents included in CPS.

Given that the difference in the number of observations is relatively large, we may want to restrict the observations included in the simple linear regression model to those included in the multiple linear regression model. We can restrict our sample by the drop_na() function. The following code creates a new data frame, named ev_data which includes only the respondents who answered all the questions related to trudeau_therm_cps, percep_economy_cps_n, ideolgoy, and union_d.

  ev_data <- ces2019 |> 
              drop_na(trudeau_therm_cps, percep_economy_cps_n, ideology, union_d)

Let’s reestimate the simple and multiple linear regression models of trudeau_therm_cps on percep_economy_cps_n, using this new data frame.

  model1b <- lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ev_data)
  model2b <- lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + union_d, 
                data = ev_data)

Let’s create a table for the results of both linear regression models using the export_summs() function.

  export_summs(model1b, model2b,
               error_format = "[{conf.low}, {conf.high}]")
Model 1Model 2
(Intercept)7.79 ***24.76 ***
[4.67, 10.91]   [20.32, 29.19]   
percep_economy_cps_n18.96 ***17.19 ***
[17.44, 20.47]   [15.68, 18.71]   
ideology       -2.72 ***
       [-3.22, -2.22]   
union_dTRUE       0.88    
       [-1.77, 3.52]   
N2346       2346       
R20.20    0.24    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Now we have the same number of observations in both models. That is, the number of observations for the simple linear regression model is now much smaller (2,346) than before (3,859). However, the reestimation result of the simple linear regression is not much different from the previous result when the sample was not restricted to those without missing values for ideology and union_d.

8.6 Export Linear Regression Table to MS-Word Document

We can also produce an MS-Word document of the table produced above using the export_summs() function. But for this purpose, we also need the officer and flextable packages.

First, install both the officer and flextable packages.

  install.packages( c("officer", "flextable") )

Then, load officer and flextable.

  library(officer)
  library(flextable)

We can export the table to an MS-Word document by specifying the to.file and file.name arguments in the explort_summs() function. The to.file argument should be "docx", as we want to produce an MS-Word document. You can specify any file name for the MS-Word file in the file.name argument. Here I used "table.docx".

  export_summs(model1b, model2b,
               error_format = "[{conf.low}, {conf.high}]",
               to.file = "docx", file.name = "table.docx")

An MS-Word document named “table.docx” should be created in your working directory. Open this Word document to check its content. The same table produced above should be given in this Word document. As you can see in this Word document, the variable names in R are used in the table produced this way. You should change these names to those you use in your paper, which should also be more easily interpreted by readers. For example, you may replace percep_ceonomy_cps_n with “Perception of Economy” or “PecepEcon” and union_dTRUE with “Union Membership” or “UNION.” If you use something like “PercepEcon” and “UNION,” you should first define what these symbols mean in your paper.