library(tidyverse)
8 Statistical Inference for Linear Regression
Before you start working on this chapter, you need to do the following. If you need a help for each step, see Section 3.1.
Launch RStudio.
Load
POL232.RData
into your current R session.Prepare an R Script to save all your work in this chapter. I suggest you name it “
POL232_Week#_YourLastName.R
” or “POL232_Tutorial#_YourLastName.R
” in which#
is the number of the current week or tutorial session.You also need to load
tidyverse
package into your current R session (Section 1.4.2).
I suggest you actually write the R functions used below in your R script instead of copying and pasting them. See Section 3.1.5 for why.
I also suggest you sufficiently annotate your R script (i.e. leaving notes after the
#
sign) so that you can use your R script as your reference when you work on tutorial exercises or data analysis paper assignments. In other words, this R script will be your notes for this chapter.
8.1 Prepare Variables
As in the previous chapters, create percep_economy_cps_n
, which is a numeric
version of percep_economy_cps
, and union_d
, a logical
version of union
.
<- ces2019 |>
ces2019 mutate(percep_economy_cps = fct_relevel(percep_economy_cps,
"(2) Worse", "(3) About the same", "(1) Better") ) |>
mutate(percep_economy_cps = fct_recode(percep_economy_cps,
"Worse" = "(2) Worse",
"Same" = "(3) About the same",
"Better" = "(1) Better") ) |>
mutate(percep_economy_cps_n = as.numeric(percep_economy_cps))
<- ces2019 |> # Change the name of categories of union to TRUE and FALSE.
ces2019 mutate( union_d = fct_recode(union, "TRUE" = "(1) Yes", "FALSE" = "(2) No") ) |>
mutate( union_d = as.logical(union_d) ) # Then, apply the as.logical() function.
8.2 Estimate Linear Regression Models
First, let’s estimate a simple linear regression model of trudeau_therm_cps
on percep_economy_cps_n
.
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)
Coefficients:
(Intercept) percep_economy_cps_n
8.128 18.982
We can assign the result of the lm()
function to a new object. For example, I assigned the above result to an object called model1
below.
<- lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019) model1
Then, we can see the linear regression result stored in model1
by typing the name of this new object in the Console.
model1
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ces2019)
Coefficients:
(Intercept) percep_economy_cps_n
8.128 18.982
Let’s assign the result of a multiple linear regression of trudeau_therm_cps
on percep_economy_cps_n
and ideology
and union_d
to a new object, named model2
.
<- lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + union_d,
model2 data = ces2019)
Again, we can access this result by typing model2
in the Console.
model2
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology +
union_d, data = ces2019)
Coefficients:
(Intercept) percep_economy_cps_n ideology
24.7569 17.1926 -2.7193
union_dTRUE
0.8752
There is more information in the result of the lm()
function than what appears in the R console this way. We can access them, for example, by the summary()
function. A basic syntax of the summary()
function is to include the output of the lm()
function as its argument (i.e., inside the parentheses).
summary(model2)
Call:
lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology +
union_d, data = ces2019)
Residuals:
Min 1Q Median 3Q Max
-71.77 -20.63 0.86 19.87 85.24
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.7569 2.2629 10.940 <2e-16 ***
percep_economy_cps_n 17.1926 0.7728 22.248 <2e-16 ***
ideology -2.7193 0.2549 -10.668 <2e-16 ***
union_dTRUE 0.8752 1.3498 0.648 0.517
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26.59 on 2342 degrees of freedom
(1675 observations deleted due to missingness)
Multiple R-squared: 0.2416, Adjusted R-squared: 0.2407
F-statistic: 248.8 on 3 and 2342 DF, p-value: < 2.2e-16
Most of the outputs here concern statistical inference. We can interpret the result of statistical inference for a linear regression model from this output of the summary()
function, which may be considered as a default option to examine the linear regression result in R. I covered how we may interpret the result from this table in my lecture. Here I will introduce another function, summ()
, from a package called jtools
.
8.3 Summarize Linear Regression Results: summ()
Install the jtools
package. Refer to Section 1.4.2 if you want to recall more details on how to install a new package.
install.packages("jtools")
Then, load jtools
.
library(jtools)
Here we use the summ()
function from the jtools
package to derive the detailed results of the lm()
function. This function gives us similar information on the result of the lm()
function to the one derived by the summary()
function. Arguably, the output from the summ()
function is more readable than the ouput from the summary()
function.
The basic syntax of the summ()
function is the same as the summary()
function — i.e., to include the output of the lm()
function in the summ()
function as its argument. Let’s show the detailed information of the linear regression result stored in model2
using the summ()
function.
summ(model2)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression
MODEL FIT:
F(3,2342) = 248.76, p = 0.00
R² = 0.24
Adj. R² = 0.24
Standard errors:OLS
---------------------------------------------------------
Est. S.E. t val. p
-------------------------- ------- ------ -------- ------
(Intercept) 24.76 2.26 10.94 0.00
percep_economy_cps_n 17.19 0.77 22.25 0.00
ideology -2.72 0.25 -10.67 0.00
union_dTRUE 0.88 1.35 0.65 0.52
---------------------------------------------------------
There are three sets of information in the above output of the summ()
function. The first set of information at the top of the output is MODEL INFO
. The most useful piece of information here is the number of observations included in the model and the number of missing observations (those with missing values for at least one of the variables included in the model) excluded from the estimation.
The second set of information at the middle of the output is MODEL FIT
. I will not cover these statistics for a model fit in this class, except for R-squared (\(R^2\)), which I will briefly cover in my lecture later in the semester.
The third set of information at the bottom is a linear regression table. The first column of the table is a list of the linear regression coefficients. Est.
in the second column is the point estimates of the linear regression coefficients for the intercept and the variables listed in the first column. S.E.
in the third column is the standard errors of those linear regression coefficients. Many academic articles on empirical research in political science report these statistics.
The information in the fourth and fifth columns concern hypothesis testing. t val.
in the fourth column is called a t-statistic or t-value, which is the point estimate of a coefficient divided by its standard error. Note that t-statistic is optional in my lecture as it may be too technical for this class.
p
in the fifth column is a p-value, from which we can interpret the statistical significance of each linear regression coefficient.
We can also change the number of decimal places that appear in the table by specifying the digits
argument. For example, if we specify digits = 4
, it will display numbers up to the fourth decimal place, as below.
summ(model2, digits = 4)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression
MODEL FIT:
F(3,2342) = 248.7557, p = 0.0000
R² = 0.2416
Adj. R² = 0.2407
Standard errors:OLS
-----------------------------------------------------------------
Est. S.E. t val. p
-------------------------- --------- -------- ---------- --------
(Intercept) 24.7569 2.2629 10.9405 0.0000
percep_economy_cps_n 17.1926 0.7728 22.2481 0.0000
ideology -2.7193 0.2549 -10.6683 0.0000
union_dTRUE 0.8752 1.3498 0.6483 0.5168
-----------------------------------------------------------------
We can also display the confidence intervals for linear regression coefficients by setting the confint = TRUE
.
summ(model2, confint = TRUE)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression
MODEL FIT:
F(3,2342) = 248.76, p = 0.00
R² = 0.24
Adj. R² = 0.24
Standard errors:OLS
------------------------------------------------------------------
Est. 2.5% 97.5% t val. p
-------------------------- ------- ------- ------- -------- ------
(Intercept) 24.76 20.32 29.19 10.94 0.00
percep_economy_cps_n 17.19 15.68 18.71 22.25 0.00
ideology -2.72 -3.22 -2.22 -10.67 0.00
union_dTRUE 0.88 -1.77 3.52 0.65 0.52
------------------------------------------------------------------
In the above output, 2.5%
in the third column is the lower bound of a 95% confidence interval, and 97.5%
in the fourth column is the upper bound. Note that the percentage between 2.5%
and 97.5%
is 95%. Together, they form a 95% confidence interval.
We can change the confidence level by the ci.width
argument. We need to specify the confidence level as a number between 0 and 1, rather than a percentage. Below, I specified ci.width = 0.99
to derive the 99% confidence intervals.
summ(model2, confint = TRUE, ci.width = 0.99)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression
MODEL FIT:
F(3,2342) = 248.76, p = 0.00
R² = 0.24
Adj. R² = 0.24
Standard errors:OLS
------------------------------------------------------------------
Est. 0.5% 99.5% t val. p
-------------------------- ------- ------- ------- -------- ------
(Intercept) 24.76 18.92 30.59 10.94 0.00
percep_economy_cps_n 17.19 15.20 19.18 22.25 0.00
ideology -2.72 -3.38 -2.06 -10.67 0.00
union_dTRUE 0.88 -2.60 4.35 0.65 0.52
------------------------------------------------------------------
As before, we can also change the number of decimal places to appear by specifying the digits
argument.
summ(model2, confint = TRUE, ci.width = 0.99, digits = 4)
MODEL INFO:
Observations: 2346 (1675 missing obs. deleted)
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression
MODEL FIT:
F(3,2342) = 248.7557, p = 0.0000
R² = 0.2416
Adj. R² = 0.2407
Standard errors:OLS
----------------------------------------------------------------------------
Est. 0.5% 99.5% t val. p
-------------------------- --------- --------- --------- ---------- --------
(Intercept) 24.7569 18.9234 30.5905 10.9405 0.0000
percep_economy_cps_n 17.1926 15.2004 19.1847 22.2481 0.0000
ideology -2.7193 -3.3764 -2.0622 -10.6683 0.0000
union_dTRUE 0.8752 -2.6046 4.3549 0.6483 0.5168
----------------------------------------------------------------------------
As I covered in my lecture, when a (\(100 - \alpha\)) % confidence interval for a linear regression coefficient does not include zero, we say that this coefficient is statistically significant (or statistically distinguishable from zero) at the \(\alpha\) % significance level. In the above example, since the 99% confidence intervals for all linear regression coefficients except for union_d
don’t include zero, all these coefficients except union_d
are statistically significant at the 1% significance level.
We can also interpret statistical significance from the p-values presented in the last column indicated by p
.
If the p-value is smaller than a
(a
is a number between 0 and 1), this suggests that the coefficient is statistically significant (or statistically distinguishable from zero) at the (a
\(\times\) 100)% significance level. For example, if the p-value is smaller than 0.05, the coefficient is statistically significant at the 5% significance level. In the above table, the p-values for all except union_d
are indicated as 0.0000, which means that the p-values are smaller than 0.0000, and therefore, rounded up to 0.0000. Since these p-values are smaller than 0.05, these coefficients are statistically significant at the 5% significance level.
Moreover, these coefficients are also statistically significant at the 1% significance level as their p-values are smaller than 0.01. They are also statistically significant at the 0.1% significance level because they are smaller than 0.001. They are statistically significant even at the 0.01% significance level as the p-values are smaller than 0.0001.
On the other hand, the p-value for union_d
is 0.52, indicating that this coefficient is statistically insignificant (or not statistically distinguishable from zero) even at the 10% significance level.
8.4 Linear Regression Table: export_summs()
We can also format the result of a linear regression into a nice table using the export_summs()
function of the jtools
package. To use this function, we also need the huxtable
package.
First, install the huxtable
package.
install.packages("huxtable")
Then, load huxtable
.
library(huxtable)
Now we can use the export_summs()
function from the jtools
package. The basic syntax is the same as the summ()
function.
export_summs(model2)
Model 1 | |
---|---|
(Intercept) | 24.76 *** |
(2.26) | |
percep_economy_cps_n | 17.19 *** |
(0.77) | |
ideology | -2.72 *** |
(0.25) | |
union_dTRUE | 0.88 |
(1.35) | |
N | 2346 |
R2 | 0.24 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
This is similar to the popular format of a linear regression table we see in many political science research articles. As you can see in the table, the first column is a list of linear regression coefficients including an intercept. In the next column, we can see the point estimates of these coefficients. The numbers in parentheses are the standard errors for each coefficient. N
is the number of observations included in the linear regression analysis, and R2
is the R-squared for this linear regression model (as I mentioned before, I will cover R-squared in a lecture later in the semester).
The star signs (asterisks) right next to each coefficient estimate indicate statistical significance of each coefficient. You can find the legend for these star signs at the bottom of the table, which states that
*** p < 0.001, ** p < 0.01, and * p < 0.05
.
p
in this legend refers to the p-value.
This legend can be read as:
***
indicates that the p-value is smaller than 0.001;**
indicates that the p-value is smaller than 0.01; and*
indicates that the p-value is smaller than 0.05.
Therefore, if the coefficient is given the ***
sign, this means that this coefficient is statistically significant at the 0.1% significance level;
if it is given **
, this coefficient is statistically significant at the 1% significance level (but not statistically significant at the 0.1% significance level);
if it is given *
, this coefficient is statistically significant at the 5% significance level (but not at the 1% significance level); and
if it is not given any stars, the coefficient is not statistically significant at the 5% or lower significance levels.
In the example above, all coefficients except for union_dTRUE
are given ***
, indicating that these coefficients are statistically significant at the 0.1% significance level. There is no star sign given to union_dTRUE
, suggesting that this coefficient is not statistically significant even at the 5% significance level, the highest significance level reported in this table.
Below is another example, which is a linear regression of the democratic candidate’s vote share in the U.S. gubernatorial elections (ranney3_gub_prop
) on the proportion of Democratic identifiers (democrat
) in each state.
lm(formula = ranney3_gub_prop*100 ~ democrat, data = usstates2010) |>
export_summs()
Model 1 | |
---|---|
(Intercept) | 26.76 *** |
(7.09) | |
democrat | 0.66 ** |
(0.22) | |
N | 49 |
R2 | 0.16 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
In this table, the coefficient on democrat
is given two stars (**
), indicating that the coefficient on this variable is statistically significant at the 1% significance level (but not at the 0.1% significance level).
The export_summs()
function can also create a table which juxtaposes a simple and multiple linear regression models. For example, the following code creates a table with both simple and multiple linear regression models for trudeau_therm_cps
(recall that model1
is the simple linear regression model and model2
is the multiple linear regression model created earlier in Section 8.2).
export_summs(model1, model2)
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 8.13 *** | 24.76 *** |
(1.22) | (2.26) | |
percep_economy_cps_n | 18.98 *** | 17.19 *** |
(0.61) | (0.77) | |
ideology | -2.72 *** | |
(0.25) | ||
union_dTRUE | 0.88 | |
(1.35) | ||
N | 3859 | 2346 |
R2 | 0.20 | 0.24 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
A table like this is convenient for comparing the coefficient estimates for a certain variable of our interest across different models. In the present example, we may be interested in how the coefficient estimate for percep_economy_cps_n
changes between simple and multiple linear regression models.
We can replace the standard errors in the table with confidence intervals by specifying the following argument in the export_summs()
function.
export_summs(model1, model2,
error_format = "[{conf.low}, {conf.high}]")
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 8.13 *** | 24.76 *** |
[5.74, 10.52] | [20.32, 29.19] | |
percep_economy_cps_n | 18.98 *** | 17.19 *** |
[17.79, 20.17] | [15.68, 18.71] | |
ideology | -2.72 *** | |
[-3.22, -2.22] | ||
union_dTRUE | 0.88 | |
[-1.77, 3.52] | ||
N | 3859 | 2346 |
R2 | 0.20 | 0.24 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
In this table, a pair of numbers in the squared brackets below each coefficient estimate indicates the confidence interval for that coefficient. For example, [17.79, 20.17]
below the coefficient estimate for percep_economy_cps_n
in Model 1
is the 95% confidence interval for the coefficient on percep_economy_cps_n
.
As before, we can change the confidence level using the ci.width
argument and the number of decimal places to appear using the digits
argument. In the code below, I set the confidence level at 99% and decimal places to appear at four.
export_summs(model1, model2,
error_format = "[{conf.low}, {conf.high}]",
ci.width = 0.99, digits = 4)
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 8.1281 *** | 24.7569 *** |
[5.7399, 10.5162] | [20.3195, 29.1944] | |
percep_economy_cps_n | 18.9823 *** | 17.1926 *** |
[17.7940, 20.1705] | [15.6772, 18.7080] | |
ideology | -2.7193 *** | |
[-3.2191, -2.2194] | ||
union_dTRUE | 0.8752 | |
[-1.7718, 3.5222] | ||
N | 3859 | 2346 |
R2 | 0.2028 | 0.2416 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
8.5 Use Same Subset of Data Across Different Models: drop_na()
In the linear regression table presented above, which juxtaposed the result of simple and multiple linear regression models of trudeau_therm_cps
, you might have noticed that the number of observations included in the models is different between these models. In particular, there are 3,859 observations included in the simple linear regression model (Model 1
in the table), while the number of observations is dropped to 2,346 in the multiple linear regression model (Model 2
). The reason for this relatively large difference in the number of observations is that these two variables are based on the questions asked in different portions of the Canadian Election Study in 2019. More precisely, percep_economy_cps_n
is based on the question asked in the Campaign Period Survey (CPS) and ideology
is based on the question asked in the Post Election Survey (PES). PES was administered for a subset of the respondents included in CPS.
Given that the difference in the number of observations is relatively large, we may want to restrict the observations included in the simple linear regression model to those included in the multiple linear regression model. We can restrict our sample by the drop_na()
function. The following code creates a new data frame, named ev_data
which includes only the respondents who answered all the questions related to trudeau_therm_cps
, percep_economy_cps_n
, ideolgoy
, and union_d
.
<- ces2019 |>
ev_data drop_na(trudeau_therm_cps, percep_economy_cps_n, ideology, union_d)
Let’s reestimate the simple and multiple linear regression models of trudeau_therm_cps
on percep_economy_cps_n
, using this new data frame.
<- lm(formula = trudeau_therm_cps ~ percep_economy_cps_n, data = ev_data) model1b
<- lm(formula = trudeau_therm_cps ~ percep_economy_cps_n + ideology + union_d,
model2b data = ev_data)
Let’s create a table for the results of both linear regression models using the export_summs()
function.
export_summs(model1b, model2b,
error_format = "[{conf.low}, {conf.high}]")
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 7.79 *** | 24.76 *** |
[4.67, 10.91] | [20.32, 29.19] | |
percep_economy_cps_n | 18.96 *** | 17.19 *** |
[17.44, 20.47] | [15.68, 18.71] | |
ideology | -2.72 *** | |
[-3.22, -2.22] | ||
union_dTRUE | 0.88 | |
[-1.77, 3.52] | ||
N | 2346 | 2346 |
R2 | 0.20 | 0.24 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Now we have the same number of observations in both models. That is, the number of observations for the simple linear regression model is now much smaller (2,346) than before (3,859). However, the reestimation result of the simple linear regression is not much different from the previous result when the sample was not restricted to those without missing values for ideology
and union_d
.
8.6 Export Linear Regression Table to MS-Word Document
We can also produce an MS-Word document of the table produced above using the export_summs()
function. But for this purpose, we also need the officer
and flextable
packages.
First, install both the officer
and flextable
packages.
install.packages( c("officer", "flextable") )
Then, load officer
and flextable
.
library(officer)
library(flextable)
We can export the table to an MS-Word document by specifying the to.file
and file.name
arguments in the explort_summs()
function. The to.file
argument should be "docx"
, as we want to produce an MS-Word document. You can specify any file name for the MS-Word file in the file.name
argument. Here I used "table.docx"
.
export_summs(model1b, model2b,
error_format = "[{conf.low}, {conf.high}]",
to.file = "docx", file.name = "table.docx")
An MS-Word document named “table.docx” should be created in your working directory. Open this Word document to check its content. The same table produced above should be given in this Word document. As you can see in this Word document, the variable names in R are used in the table produced this way. You should change these names to those you use in your paper, which should also be more easily interpreted by readers. For example, you may replace percep_ceonomy_cps_n
with “Perception of Economy” or “PecepEcon” and union_dTRUE
with “Union Membership” or “UNION.” If you use something like “PercepEcon” and “UNION,” you should first define what these symbols mean in your paper.