9  Substantive Significance

Before you start working on this chapter, you need to do the following. If you need a help for each step, see Section 3.1.

  library(tidyverse)
  library(jtools)

9.1 Preparation

9.1.1 Prepare Variables

As in the previous chapters, create percep_economy_cps_n, which is a numeric version of percep_economy_cps, and union_d, a logical version of union.

  ces2019 <- ces2019 |> 
      mutate(percep_economy_cps = fct_relevel(percep_economy_cps, 
              "(2) Worse", "(3) About the same", "(1) Better") ) |> 
      mutate(percep_economy_cps = fct_recode(percep_economy_cps, 
                                "Worse" = "(2) Worse", 
                                "Same" = "(3) About the same", 
                                "Better" = "(1) Better") ) |> 
      mutate(percep_economy_cps_n = as.numeric(percep_economy_cps))
  ces2019 <-  ces2019 |> # Change the name of categories of union to TRUE and FALSE.
      mutate( union_d = fct_recode(union, "TRUE" = "(1) Yes", "FALSE" = "(2) No") ) |> 
      mutate( union_d = as.logical(union_d) )  # Then, apply the as.logical() function.

9.1.2 Prepare New Data Frame

Also construct a new data frame which includes non-missing observations for the variables used in the example linear regression model in this chapter. These variables are trudeau_therm_cps, percep_economy_cps_n, percep_economy_cps, ideology, union_d, and province. The new data frame is named my_data in the code below.

  my_data <- ces2019 |> 
      drop_na( trudeau_therm_cps, percep_economy_cps_n, percep_economy_cps, 
               ideology, union_d, province )

9.1.3 Estimate Example Model

As a primary example, I will use a multiple linear regression of trudeau_therm_cps on percep_economy_cps_n, ideology, union_d and province. Let’s estimate this model and assign it to a new object, named model.

  model <- lm(formula = trudeau_therm_cps ~ percep_economy_cps_n
                                              + ideology + union_d + province, 
                      data = my_data)

Note that the model includes four different types of independent variables:

  • a numeric version of an ordinal categorical variable (percep_economy_cps_n),

  • a quantitative variable (ideology),

  • a dummy variable stored as a logical (union_d), and

  • a nominal categorical variable stored as a factor (province).

I will discuss how we can evaluate the substantive significance for each type of variable.

Let’s take a look at the estimation result by the summ() function from the jtools package.

  summ(model, confint = TRUE, digits = 4)
MODEL INFO:
Observations: 2346
Dependent Variable: trudeau_therm_cps
Type: OLS linear regression 

MODEL FIT:
F(12,2333) = 69.7279, p = 0.0000
R² = 0.2640
Adj. R² = 0.2602 

Standard errors:OLS
----------------------------------------------------------------------------
                                Est.      2.5%     97.5%     t val.        p
-------------------------- --------- --------- --------- ---------- --------
(Intercept)                  15.2848    9.7034   20.8662     5.3702   0.0000
percep_economy_cps_n         15.8163   14.2630   17.3697    19.9667   0.0000
ideology                     -2.6628   -3.1577   -2.1678   -10.5489   0.0000
union_dTRUE                   0.5959   -2.0253    3.2171     0.4458   0.6558
provinceBC                   11.7557    7.1483   16.3631     5.0034   0.0000
provinceMB                   10.8040    5.1630   16.4450     3.7558   0.0002
provinceNB                   16.9362   10.7094   23.1630     5.3336   0.0000
provinceNL                   18.4670   12.1214   24.8126     5.7068   0.0000
provinceNS                   12.1679    5.9590   18.3769     3.8430   0.0001
provinceON                   15.3387   10.6994   19.9780     6.4835   0.0000
provincePE                   15.8607    9.6777   22.0437     5.0303   0.0000
provinceQC                   12.3670    7.6228   17.1111     5.1118   0.0000
provinceSK                    3.4424   -2.2297    9.1144     1.1901   0.2341
----------------------------------------------------------------------------

The linear regression table above shows the 95% confidence intervals for all coefficients. You can also read the statistical significance from the p-values at the right most column (p). If this number times 100 percent — (p \(\times\) 100)% — is smaller than a certain significance level (e.g., 5%, 1%, 0.1%), then the coefficient with this p-value is statistically significant at this significance level.

If you want R to tell you which coefficient is statistically significant at the 5%, 1% or 0.1% significance level, you may use the summary() function. Or if you use the export_summs() function from the huxtable package, then you can see both confidence intervals and the signs for statistical significance together in a single table. Review Section 8.4 for detail.

9.2 Substantive Significance

9.2.1 Basic Idea

When we have a certain substantive/theoretical expectation for the relationship of two variables in the population, the first step to examine this expectation is to see whether we have a point estimate of the coefficient of a variable of our interest in the expected direction (i.e., positive or negative) and what range of values its confidence interval encompasses. If the confidence interval does not include zero, we may say that the coefficient of our interest is statistically distinguishable from zero, or more simply, statistically significant.1 We have seen how to conduct these analyses in R in Chapter 8.

The second step is to examine whether the relationship found is substantively important or significant. The relationship may be substantively important/significant if the magnitude of the estimated relationship is meaningfully large in its specific context.2 There is no unique way to assess the substantive significance of the relationship. In general, you need to make an argument based on the specific nature of the relationship of your interest and the findings in your empirical research.

There may be many different ways to make such an argument, but for the final paper in this class, I suggest you estimate the change in the expected value of the dependent variable (\(E(Y|X)\)) corresponding to a meaningful change in the independent variable of your interest (\(X\)) based on the results of your linear regression analysis.

This is what I have suggested in my lecture. To recap, let me restate my suggestion below.

  1. Identify a substantively meaningful amount of change in \(X\). Let’s say this is identified as the change from \(X^*\) to \(X^{**}\). Then, it may be denoted as (\(\Delta X = X^{**} - X^*\)).
  2. Estimate the change in \(E(Y|X)\), the change in the conditional expectation/mean of Y, corresponding to this substantively meaningful change in \(X\). This change can be denoted by \(E(Y|X^{**}) - E(Y|X^{*}) = \beta \times \Delta X\), where \(\beta\) is the coefficient of \(X\).
  3. Estimate the confidence interval for this change in \(E(Y|X)\).
  4. Discuss the substantive magnitude of the relationship between \(Y\) and \(X\) based on both the point estimate and confidence interval for this change in \(E(Y|X)\).

As you can see in the above figure, this change in \(E(Y|X) = E(Y|X^{**}) - E(Y|X^{*})\) is the difference in the values of the dependent variable along a linear regression line in the population. This “difference (or change) in \(E(Y|X)\)” is the population parameter of our interest in the present context.

Then, we estimate the change in \(E(Y|X)\) in the population using the difference in \(\widehat{Y}\) or the difference in \(\widehat{E(Y|X)} = \widehat{E(Y|X^{**})} - \widehat{E(Y|X^{*})}\), which is the difference in the predicted/fitted values computed from our sample linear regression model or the difference in the values of the dependent variable along a linear regression line in our sample. In other words, the difference in the predicted/fitted values, \(\widehat{E(Y|X)}\) or \(\widehat{Y}\), is our estimator or estimate for the difference in \(E(Y|X)\) in the population.

Throughout this chapter, I will use “the difference (or change) in \(E(Y|X)\) (or the conditional expectation/mean)” and \(E(Y|X^{**}) - E(Y|X^{*})\) to refer to our population parameter

Similarly, I will use “the difference (or change) in \(\widehat{E(Y|X)}\) and \(\widehat{E(Y|X^{**})} - \widehat{E(Y|X^{*})}\) to refer to our estimator or estimate for the change in \(E(Y|X)\) in the population.

Note that the discussion so far implicitly assumes that we have only one independent variable, \(X\), on the right hand side of a linear regression equation. However, we normally control for some other variables, which are collectively denoted by \(Z\) here. Therefore, the change in the conditional expectation of \(Y\) corresponding to the meaningful change in \(X\) is estimated holding all other variables \(Z\) constant at certain values, say \(Z^+\). Incorporating \(Z^+\), our population parameter can be denoted as “the difference (or change) in \(E(Y|X,Z^+) = E(Y|X^{**},Z^+) - E(Y|X^{*},Z^+)\).

For example, suppose we want to evaluate the substantive magnitude of the relationship between trudeau_therm_cps (\(Y\)) and percep_econ_cps_n (\(X\)) controlling for other variables (\(Z\)) based on the multiple linear regression estimated in Section 9.1.3. In this case, we first estimate the difference in the conditional expectations of trudeau_therm_cps (\(E(Y|X,Z)\)) between the individuals who perceived that the state of national economy had gotten better (\(X = 3\)) and those who perceived that it had gotten worse (\(X = 1\)), holding all other variables constant (\(Z = Z^+\)). In equation, the difference we estimate may be written as \(E(Y | X = 3, Z^+) - E(Y | X = 1, Z^+)\)

We should also derive the confidence interval for the difference in the conditional expectations of trudeau_therm_cps between the individuals who perceived that the state of economy had gotten better and those who perceived that it had gotten worse, holding other variables constant. Then, we will offer an argument about whether this difference in \(E(Y|X,Z^+)\) is meaningfully large in the particular context of this relationship based on its point estimate and confidence interval.

9.2.2 Estimation by the emmeans() Package

To estimate the difference in \(E(Y|X,Z^+)\) and derive its confidence interval, we will use the emmeans() function and the contrast() function from the emmeans package.

First, install the emmeans package (Section 1.4).

  install.packages("emmeans")

Then, load this package.

  library(emmeans)

Our analysis is in two steps.

  • In the first step, we will use the emmeans() function to estimate \(E(Y|X,Z^+)\) for different values of \(X\), such as \(X^*\) and \(X^{**}\), holding other variables \(Z\) constant at \(Z^+\).

  • In the second step, we will supply the result of the emmeans() function to the contrast() function to conduct statistical inference for the difference in \(E(Y|X,Z^+) = E(Y|X^{**},Z^+) - E(Y|X^{*},Z^+)\).

I will explain each step in turn in the following two sections.

9.2.3 Estimate \(E(Y|X,Z^+)\): emmeans()

For the first step to estimate \(E(Y|X,Z^+)\), we use the emmeans() function. For our purpose, we specify the following three arguments in the emmeans() function.

  • The first argument (object) is the outcome of the lm() function, which is used to estimate the difference in \(E(Y|X,Z^+)\)

  • The second argument (spec) is the name of the independent variable of our interest, \(X\).

  • The third argument (at) is the values taken by the main independent variable of our interest, \(X\), and control variables, \(Z\). For \(X\), we will specify both \(X = X^*\) and \(X = X^{**}\). For all other control variables, \(Z\), we can specify \(Z = Z^+\), but we will see that the specification of the values for \(Z\) can be omitted for our purpose to compute the difference in \(E(Y|X,Z^+)\).

As we use the multiple linear regression model estimated in Section 9.1.3, \(Y\) is trudeau_therm_cps, \(X\) is percep_economy_cps_n, and \(Z\) includes ideology, union_d, and province.

Let’s estimate the conditional expectation of trudeau_therm_cps between the respondents who perceived the state of economy had gotten better (percep_economy_cps_n = 3) and those who perceived it had gotten worse (percep_economy_cps_n = 1), holding the other control variables constant this way: ideology= 5,union_d=FALSE, andprovince=AB` (i.e., non-union member, Alberta resident with neutral ideology).

Here,

  • \(X = X^{**}\) corresponds to “percep_economy_cps_n = 3,”

  • \(X = X^*\) corresponds to “percep_economy_cps_n = 1,” and

  • \(Z = Z^+\) corresponds to “ideology = 5, union_d = FALSE, and province = AB”.

For simplicity, let’s estimate \(E(Y|X^*,Z^+)\) first. This is the conditional expectation of \(Y\) when the value of \(X\) is held at \(X^*\) and the values of \(Z\) at \(Z^+\).

  emmeans( object = model, # The first argument is the output from the lm() function.
           
           specs = "percep_economy_cps_n",  # The second argument is the independent 
                                            # variable of our interest.
           
              # In the third argument, we specify the values at which we want to hold 
              # both the main independent variable of our interest and the other 
              # control variables. We may include the equations for all independent  
              # variables within the list() function on the right-hand side of this 
              # argument.
           at = list(percep_economy_cps_n = 1, 
                     ideology = 5, union_d = FALSE, province = "AB") )
 percep_economy_cps_n emmean   SE   df lower.CL upper.CL
                    1   17.8 2.04 2333     13.8     21.8

Confidence level used: 0.95 

In the output above, you can see the point estimate of \(E(Y|X^*,Z^+)\) under emmean, its standard error under SE, the lower end of the 95% confidence interval for \(E(Y|X^*,Z^+)\) under lower.CL, and the upper end under upper.CL. I skip the “df” column as we don’t cover the degrees of freedom (“df”) in this class. According to this result, the conditional expectation of trudeau_therm_cps when percep_economy_cps_n = 1, ideology = 5, union_d = FALSE, and province = AB is estimated as 17.8, and its 95% confidence interval is [13.8, 21.8].

In the second argument (specs), we specified percep_economy_cps_n as the main independent variable, X, for which we want to estimate \(E(Y|X,Z^+)\) at different values.

If we specify multiple values for percep_economy_cps_n in the third argument (at), then the emmeans() function will estimate multiple \(E(Y|X,Z^+)\) corresponding to each of the values specified for percep_economy_cps_n.

Below I specified three values — 1, 2 and 3 — for percep_economy_cps_n using c() to list all these values.

  emmeans( object = model, 
           specs = "percep_economy_cps_n",  
           
              # In the third argument, we specify three values --- 1, 2 and 3 --- for 
              # percep_economy_cps_n.
           at = list(percep_economy_cps_n = c(1,2,3), 
                     ideology = 5, union_d = FALSE, province = "AB") )
 percep_economy_cps_n emmean   SE   df lower.CL upper.CL
                    1   17.8 2.04 2333     13.8     21.8
                    2   33.6 2.05 2333     29.6     37.6
                    3   49.4 2.33 2333     44.8     54.0

Confidence level used: 0.95 

Then, the emmeans() function estimated three conditional expectations of trudeau_therm_cps for each value specified for percep_economy_cps_n holding other variables at ideology = 5, union_d = FALSE, and province = AB. Specifically, the conditional expectation of trudeau_therm_cps is 17.8 when percep_economy_cps_n = 1, 33.6 when percep_economy_cps_n = 2, and 49.4 when percep_economy_cps_n = 3, holding other variables at the specified values.

Note that the emmeans() function estimates \(E(Y|X,Z^+)\) for each of the specified values only for the variable specified in the specs argument (\(X\)). For example, the code below lists multiple values for ideology in the at argument3 while percep_economy_cps_n is still suggested in the specs argument. This code does not produce multiple \(E(Y|X,Z)\) for each of the specified values for ideology.

  emmeans( object = model, 
           specs = "percep_economy_cps_n",  
           
              # Now we specify the values from 0 to 10 for "ideology,"
              # but the specs argument is still "percep_economy_cps_n."
           at = list(percep_economy_cps_n = 1, 
                     ideology = c(0:10), union_d = FALSE, province = "AB") )
 percep_economy_cps_n emmean   SE   df lower.CL upper.CL
                    1   17.8 2.04 2333     13.8     21.8

Results are averaged over the levels of: ideology 
Confidence level used: 0.95 

The above command estimated only one \(E(Y|X,Z)\). There is also the message that “Results are averaged over the levels of: ideology”. What this means is that the emmeans() function first estimated multiple \(E(Y|X,Z)\) for each of the specified values from 0 to 10 of ideology with other variables fixed at the specified values and then took the average of \(E(Y|X)\) across all the values of ideology from 0 to 10. If we want to estimate \(E(Y|X,Z)\) for each of these values of ideology separately, we should specify ideology in the specs function as in the code below. Note that ideology is the main independent variable of our interest (\(X\)), and percep_economy_cps_n is one of the control variables (\(Z\)) in this case.

  emmeans( object = model, 
           specs = "ideology",  
              # The specs argument is now changed to "ideology."
           
           at = list(percep_economy_cps_n = 1, 
                     ideology = c(0:10), union_d = FALSE, province = "AB") )
 ideology emmean   SE   df lower.CL upper.CL
        0  31.10 2.54 2333   26.113    36.09
        1  28.44 2.40 2333   23.733    33.14
        2  25.78 2.27 2333   21.316    30.23
        3  23.11 2.17 2333   18.856    27.37
        4  20.45 2.09 2333   16.346    24.55
        5  17.79 2.04 2333   13.779    21.80
        6  15.12 2.03 2333   11.153    19.10
        7  12.46 2.04 2333    8.466    16.46
        8   9.80 2.08 2333    5.718    13.88
        9   7.14 2.15 2333    2.914    11.36
       10   4.47 2.25 2333    0.058     8.89

Confidence level used: 0.95 

Now the emmeans() function returned the estimation of \(E(Y|X,Z^+)\) for all the specified values of ideology (\(X\)), respectively, from 31.10 when ideology = 0 to 4.47 when ideology = 10 holding the other variables constant at the specified values (\(Z^+\)).

So far, we have specified the values for all variables included in our linear regression model (\(X\) and \(Z\)). For our purpose, however, it suffices to specify the values just for the independent variable of our interest (\(X\)). This is because what we are interested in is the difference in \(E(Y|X,Z)\) across different values of \(X\) holding other control variables \(Z\) constant. For this purpose, which values we hold control variables at do not matter. If the values of control variables are not specified, the emmeans() function will use means of these variables. In the code below, I specified the values for percep_economy_cps_n only in the third argument (at).

  emmeans( object = model, 
           specs = "percep_economy_cps_n",  
           
              # Now we specify the values for percep_economy_cps_n only.
           at = list( percep_economy_cps_n = c(1,2,3) )  )
 percep_economy_cps_n emmean    SE   df lower.CL upper.CL
                    1   29.6 1.020 2333     27.6     31.6
                    2   45.5 0.755 2333     44.0     46.9
                    3   61.3 1.160 2333     59.0     63.6

Results are averaged over the levels of: union_d, province 
Confidence level used: 0.95 

The estimated conditional expectations of trudeau_therm_cps for each value of percep_economy_cps_n in this output are different from those reported before. This is because the values of other control variables \(Z\) are held at different values. More specifically, without the values specified in the at argument, the values of the other variables \(Z\) are held at their means, if the variables are numeric. In the current example, as ideology is a numeric variable, its value is held at its mean. If \(Z\) are factor or logical variables, \(E(Y|X,Z)\) is first computed for each category of these variables in \(Z\), and the average of \(E(Y|X,Z)\) is computed across all these categories of \(Z\). Therefore, there is a message in the above output indicating “Results are averaged over the levels of: union_d, province.”

Because what values we hold control variables constant at do not matter when we take the difference in \(E(Y|X)\) by the contrast() function introduced in the next section, we will specify the values only for the independent variable of our interest \(X\) in the at argument of the emmeans() function in the rest of this chapter.

9.2.4 Statistical Inference for Difference in \(E(Y|X,Z^+)\): contrast()

In the second step, we will supply the result of the emmeans() function to the contrast() function to estimate the difference in \(E(Y|X)\) between different values of the independent variable of interest. In other ward, we estimate \(E(Y|X^{**},Z^+) - E(Y|X^{*},Z^+)\) by this method.

Let’s estimate the difference in the conditional expectation of trudeau_therm_cps when percep_economy_cps_n = 1 (the respondent’s perception of economy is “Worse”) and when percep_economy_cps_n = 3 (the respondent’s perception of economy is “Better”), controlling for other variables.

Recall that we can estimate \(E(Y|X,Z^+)\) for each of these two values of percep_economy_cps_n as below. Note that I use the pipe operator (|>) in this code.

  model |>  
        # "model" will be piped to the emmeans() function as its first argument.
  
      emmeans( specs = "percep_economy_cps_n",  
               
               # We now specify two values --- 1 and 3 --- only for percep_economy_cps_n 
               # in the at argument.
               at = list( percep_economy_cps_n = c(1,3) ) )
 percep_economy_cps_n emmean   SE   df lower.CL upper.CL
                    1   29.6 1.02 2333     27.6     31.6
                    3   61.3 1.16 2333     59.0     63.6

Results are averaged over the levels of: union_d, province 
Confidence level used: 0.95 

We supply this output to the contrast() function to take the difference. We set the method argument of the contrast() function to pairwise, so that the contrast() function will estimate a pairwise difference in \(E(Y|X,Z^+)\) between the values of \(X\) specified in the emmeans() function.

  model |>   
      emmeans( specs = "percep_economy_cps_n",  
               at = list( percep_economy_cps_n = c(1,3) ) ) |> 
        # The output from the emmeans() function is piped to the contrast() function.

        # The method argument is set to "pairwise".
      contrast( method = "pairwise" )
 contrast                                      estimate   SE   df t.ratio
 percep_economy_cps_n1 - percep_economy_cps_n3    -31.6 1.58 2333 -19.967
 p.value
  <.0001

Results are averaged over the levels of: union_d, province 

In the above output, the first column, named contrast, lists a pairwise comparison of the values of the independent variable \(X\) that we specified. In this column, we can see percep_economy_cps_n1 - percep_economy_cps_n3, which means that the contrast here is 1 - 3 in the value of percep_economy_cps_n. This indicates that what was estimated here was the conditional expectation of trudeau_therm_cps when percep_economy_cps_n = 1 (“Worse”) minus the conditional expectation of trudeau_therm_cps when percep_economy_cps_n = 3 (“Better”), controlling for other variables. This is the expected change in trudeau_therm_cps when the percep_economy_cps variable changes from “Better” to “Worse”. Because the change in percep_economy_cps considered here corresponds to a two-unit decline of the value of percep_economy_cps_n, and the coefficient estimate on percep_economy_cps_n is positive (15.8163), the expected difference in trudeau_therm_cps in the current scenario is negative, \((-2) \times 15.8 = -31.6\). This is perhaps counter intuitive, because the estimated relationship between trudeu_therm_cps and percep_economy_cps_n is positive.

The problem here is that the pairwise difference computed was 1 - 3 in terms of the values of percep_economy_cps_n, or “Worse - Better” in terms of percep_economy_cps. However, what we want instead is a pairwise difference of 3 - 1 or “Better - Worse”. That is, we want to reverse the order of these values.

We can correct this in two ways. First, we may reverse the order of the values specified for percep_economy_cps_n in the at argument in the emmeans() function as in the following code.

  model |>   
      emmeans( specs = "percep_economy_cps_n",  
               at = list( percep_economy_cps_n = c(3,1) ) ) |> 
               # Now the values are specified as c(3,1) instead of c(1,3).               
  
      contrast( method = "pairwise")
 contrast                                      estimate   SE   df t.ratio
 percep_economy_cps_n3 - percep_economy_cps_n1     31.6 1.58 2333  19.967
 p.value
  <.0001

Results are averaged over the levels of: union_d, province 

Second, we may keep the original order of the values for percep_economy_cps_n in the emmeans() function, but change the method argument to "revpairwise" in the conrast() function. “revpairwise” stands for a reversed pairwise comparison. As its name suggests, if we use “revpairwise,” the contrast() function reverses the order of a pairwise comparison.

  model |>   
      emmeans( specs = "percep_economy_cps_n",  
               at = list(percep_economy_cps_n = c(1,3) ) ) |> 
               # Now the original order of the values --- c(1,3) --- is kept.               
  
        # The method argument is changed to "revpairwise".  
      contrast( method = "revpairwise")
 contrast                                      estimate   SE   df t.ratio
 percep_economy_cps_n3 - percep_economy_cps_n1     31.6 1.58 2333  19.967
 p.value
  <.0001

Results are averaged over the levels of: union_d, province 

Either way, now we can get the difference we want — the change in the conditional expectation of trudeau_therm_cps when percep_economy_cps_n changes from 1 to 3 or percep_economy_cps changes from “Worse” to “Better”, which may be computed as \(2 \times 15.8 = 31.6\)).

In the above outputs, what follows after contrast and estimate are a few statistics for statistical inference about the estimated difference in \(E(Y|X,Z^+)\), such as its standard error (SE) and p-value (p.value). We can also derive the confidence interval for the difference in \(E(Y|X,Z^+)\) by specifying the infer argument to TRUE in the contrast() function as in the code below.

  model |>   
      emmeans( specs = "percep_economy_cps_n",  
               at = list( percep_economy_cps_n = c(1,3) ) ) |> 
  
        # Now we add "infer = TRUE" in the contrast() function.
      contrast( method = "revpairwise", infer = TRUE )
 contrast                                      estimate   SE   df lower.CL
 percep_economy_cps_n3 - percep_economy_cps_n1     31.6 1.58 2333     28.5
 upper.CL t.ratio p.value
     34.7  19.967  <.0001

Results are averaged over the levels of: union_d, province 
Confidence level used: 0.95 

In this output, lower.CL indicates the lower end of the 95% confidence interval for the difference and upper.CL the upper end. According to this result, the 95% confidence interval for the difference in the conditional expectation of trudeau_therm_cps between those who perceived that the national economy had gotten better and those who perceived that it had gotten worse, controlling for the other variables included in the model, is [28.5, 34.7].

We can also change the confidence level by specifying the number between 0 and 1 in the level argument in the contrast() function. For example, if we specify level = 0.99 as in the code below, the contrast() function produces the 99% confidence interval for the difference in \(E(Y|X,Z^+)\).

  model |>   
      emmeans( specs = "percep_economy_cps_n",  
               at = list( percep_economy_cps_n = c(1,3) ) ) |> 
  
        # Now we add the level argument in the contrast() function.  
      contrast( method = "revpairwise", infer = TRUE, level = 0.99 )
 contrast                                      estimate   SE   df lower.CL
 percep_economy_cps_n3 - percep_economy_cps_n1     31.6 1.58 2333     27.5
 upper.CL t.ratio p.value
     35.7  19.967  <.0001

Results are averaged over the levels of: union_d, province 
Confidence level used: 0.99 

In addition, we can also change the number of digits to appear. For this purpose, as in the code below, we need to use the summary() function and the print() function, consecutively, and we specify the number of digits we want in the print() function.

  model |>   
      emmeans( specs = "percep_economy_cps_n",  
               at = list( percep_economy_cps_n = c(1,3) ) ) |> 
      contrast( method = "revpairwise", infer = TRUE, level = 0.99 ) |> 
  
            # The output from the contrast() function is piped to the summary() function.
      summary() |> 
  
            # Then, the output from the summary() function is piped further to the print() 
            # function, in which we can specify the number of digits to appear 
            # in the output on the R Console.
      print( digits = 6 )
 contrast                                      estimate      SE   df lower.CL
 percep_economy_cps_n3 - percep_economy_cps_n1  31.6327 1.58427 2333  27.5485
 upper.CL t.ratio p.value
  35.7168  19.967  <.0001

Results are averaged over the levels of: union_d, province 
Confidence level used: 0.99 

9.2.5 Making Argument for Substantive Significance

I suggested before that you should estimate the difference in \(E(Y|X,Z^+)\) corresponding to a meaningful amount of change in the independent variable of your interest, \(X\). You should carefully consider what would be a meaningful change in your main independent variable \(X\). Note that there are no hard and fast rules to determine what a meaningful amount of change would be. Instead, this is something you need to carefully consider and make a reasonable argument. To think about what a meaningful amount of change would be for your independent variable \(X\), it is advisable to explore the distribution of \(X\) carefully by using visualization and/or its summary statistics.

As an example, I draw a bar chart for percep_economy_cps below. As you can see, although about a half of the responses are concentrated on “Same,” the responses are reasonably spread across all three categories from “Worse” to “Better.” More specifically, of about 2,300 observations, approximately one third of the responses are “Worse” and a quarter are “Better”. This distribution of the responses may justify that the change from “Worse” to “Better would be a meaningful change in percep_economy_cps in this sample, because there are reasonable fractions of observations in all these categories.

If most of the observations were concentrated on two consecutive categories, say “Worse” and “Same”, and only a tiny fraction of responses were in the last category, “Better”, as in the hypothetical bar chart below, then the change from “Worse” to “Better might not have been justified as a meaningful change in percep_economy_cps in this hypothetical sample. In this case, a meaningful change may be from “Worse” to “Same because this change encompasses most of the observations. The estimated change in the conditional expectations of trudeau_therm_cps corresponding to this change in percep_economy_cps is available immediately from the coefficient of percep_economy_cps_n as this is only a one-unit increase in percep_economy_cps_n.

Once you have determined the meaningful amount of change in \(X\), you can estimate the difference in \(E(Y|X,Z^+)\) by the emmeans() and contrast() functions, as discussed in Section 9.2.3 and Section 9.2.4. After that, you need to offer your argument about whether this estimated amount of change in \(E(Y|X,Z^+)\) is of substantively meaningful magnitude. The example above is relatively simple for this purpose because the estimated change in trudeau_therm_cps corresponding to the change in percep_economy from “Worse” to “Better” is \(31.6\), which is one-third — a large proportion — of the entire range of trudeau_therm_cps.

When you evaluate the estimated difference in \(E(Y|X,Z^+)\) with respect to a meaningful change in \(X\), make sure you consider the entire range of its confidence interval. In the example above, the 99% confidence interval is [28.5, 34.7]. Even when we consider the lower end or the upper end of this interval, our conclusion will not change because either one is still about one-third of the entire range of trudeau_therm_cps. If the range of interval is so large that the conclusion may be different depending on whether we consider the point estimate, the lower end of the interval, or the upper end, then we should reflect this difference in our discussion. For example, if the confidence interval for the estimated change in trudeau_therm_cps corresponding to the change in percep_economy from “Worse” to “Better” were hypothetically [5.5, 57.7], then this expected change in trudeau_therm_cps would be as large as more than a half of the entire range (57.5) or as small as one twentieth of the entire range (5.5). Our discussion should reflect this inconclusiveness of the estimated magnitude in this hypothetical scenario.

In the above example, I compared the estimated difference in \(E(Y|X)\) with the entire range of the dependent variable, trudeau_therm_cps. Alternatively, you may compare the estimated difference in \(E(Y|X)\) with the IQR or the standard deviation of the dependent variable. If you use the IQR, then you compare the estimated difference in \(E(Y|X)\) with the amount of variation in \(Y\) in the middle half of your sample in terms of the value of \(Y\). If you use the standard deviation, then you compare it with a typical amount of deviation in the value of \(Y\) from the mean in your sample. You may also draw a histogram of your dependent variable and identify a meaningful range of values from the visual inspection of the distribution of \(Y\). It is of course fine if you can offer a reasonable discussion on the substantive magnitude of the difference in \(E(Y|X)\) without reference to a certain empirical range of the dependent variable. Again, there are no hard and fast rules to determine whether the estimated change in \(E(Y|X)\) is of substantive magnitude. Your creativity, logical and sensible reasoning, and convincing argumentation are called for.

9.3 With Different Types of Independent Variable \(X\)

This section will discuss how we may consider a meaningful amount of change for different types of independent variable \(X\).

9.3.1 Ordinal Categorical Variable

9.3.1.1 Numeric Version

The main independent variable \(X\) in the example in the previous section (Section 9.2), percep_economy_cps_n, was a numeric version of an ordinal categorical variable. Therefore, if the independent variable of your interest is a numeric version of an ordinal categorical variable, then you may consider a meaningful change of this type of variable as we did in the previous section (Section 9.2).

9.3.1.2 Dummified Version

If the independent variable of your interest \(X\) is dummified for all categories except for one (i.e., a factor version of an ordinal categorical variable is included in the right-hand side of the linear regression model in R), you may consider it in a similar way to the example given below in the section on a nominal categorical variable (Section 9.3.4).

9.3.2 Quantitative Variable (numeric)

Now consider the difference in the conditional expectation of trudeau_therm_cps for a meaningful change in ideology controlling for other variables in the model estimated above. That is, now we consider ideology as our main independent variable \(X\).

Note that ideology is a quantitative variable.

9.3.2.1 Entire Range

One idea is to estimate the difference in \(E(Y|X)\) between the minimum and maximum values of ideology as we did for percep_economy_cps_n. As the minimum value of ideology is 0 (most leftist/liberal position), and the maximum is 10 (most rightist/conservative position), let’s specify these values for ideology in the at argument of the emmeans() function.

model |> 
  emmeans( specs = "ideology",  
           at = list( ideology = c(10, 0) ) ) |>
          # We specified the maximum and minimum values of ideology.
  
  contrast( method = "pairwise", infer = TRUE )
 contrast               estimate   SE   df lower.CL upper.CL t.ratio p.value
 ideology10 - ideology0    -26.6 2.52 2333    -31.6    -21.7 -10.549  <.0001

Results are averaged over the levels of: union_d, province 
Confidence level used: 0.95 

The estimated difference in the conditional mean of trudeau_therm_cps between the most conservative respondents (ideology = 10) and the most liberal respondents (ideology = 0), controlling for all other variables included in the model, is \(-26.6\) and its 95% confidence interval is [-31.6, -21.7].

As discussed before, we should carefully consider what a meaningful amount of change of ideology would be. For this purpose, let’s draw a bar chart of ideology below. Note that we draw a bar chart here because “ideology” takes only 11 values. If we examine a continuous variable or a discrete variable with many more values, it may be better to draw a histogram.

According to this bar chart, the respondents are spread across the entire range of ideology from the most leftist position (ideology = 0) to the most rightist position (ideology = 10). This observation may be used to justify the comparison of the conditional expectation of trudeau_therm_cps between the maximum and minimum values of ideology.

9.3.2.2 IQR

However, the bar chart also indicates that there are only a very small number of respondents in the extreme ideological positions, such as 0, 1, 9 and 10. We may want to use an alternative amount of change in ideology, which do not include these extreme ideological positions. For example, we may want to use the IQR of ideology, which represents the extent of variation in the values of ideology in the middle half of the sample. IQR of ideology may be computed by the IQR() function.

  IQR(my_data$ideology, na.rm = TRUE)
[1] 3

For our purpose, however, we need the values of both lower and upper quartiles because we need to specify these values for ideology in the at argument of the emmeans() function. We can find the lower and upper quartiles of a variable by applying the summary() function to the variable.

  summary(my_data$ideology)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   4.000   5.000   5.061   7.000  10.000 

According to the above result, the lower quartile of ideology is 4 and the upper quartile is 7. Let’s specify these values in the at argument of the emmeans() function.

model |> 
  emmeans( specs = "ideology",  
           at = list( ideology = c(7, 4) ) ) |> 
                # We specified the upper and lower quartiles of ideology.  
  
  contrast( method = "pairwise", infer = TRUE )
 contrast              estimate    SE   df lower.CL upper.CL t.ratio p.value
 ideology7 - ideology4    -7.99 0.757 2333    -9.47     -6.5 -10.549  <.0001

Results are averaged over the levels of: union_d, province 
Confidence level used: 0.95 

The estimated difference in the conditional expectation of trudeau_therm_cps between the respondents with the upper quartile in ideology and those with the lower quartile is \(-7.99\) and its 95% confidence interval is [ -9.47, -6.5].

9.3.2.3 One Standard Deviation Above and Below the Mean

Alternatively, we may want to use the standard deviation of ideology as a meaningful amount of change in this variable. We may use the sd() function to compute the standard deviation of a variable.

  sd(my_data$ideology)
[1] 2.21214

As the standard deviation represents a typical amount of deviation from the mean of a variable, we may consider the variation in ideology from one standard deviation below its mean to one standard deviation above its mean. The mean of ideology was reported above as \(5.06\) when we applied the summary() function to ideology. Given that the standard deviation of ideology reported above is \(2.21\), the range between one standard deviation below its mean (\(5.06 - 2.21 = 2.85\)) and one standard deviation above its mean (\(5.06 + 2.21 = 7.27\)) is \(2.85\) and \(7.27\). Let’s specify these values in the at argument of the emmeans() function.

model |> 
  emmeans( specs = "ideology",  
           at = list( ideology = c(7.27, 2.85) ) ) |> 
                # We specified one standard deviation above the mean 
                # and one standard deviation below the mean of ideology.  

  contrast( method = "pairwise", infer = TRUE ) |>       
  summary() |> 
  print( digits = 4 )
 contrast                    estimate    SE   df lower.CL upper.CL t.ratio
 ideology7.27 - ideology2.85   -11.77 1.116 2333   -13.96   -9.581 -10.549
 p.value
  <.0001

Results are averaged over the levels of: union_d, province 
Confidence level used: 0.95 

The estimated difference in the conditional mean of trudeau_therm_cps between the respondents with one standard deviation above the mean in ideology and those with one standard deviation below the mean is \(-11.77\), and its 95% confidence interval is [-13.96, -9.58].

Note that ideology is a discrete variable which takes only integer values. It may not make sense to consider the values such as \(7.27\) and \(2.85\) as ideology never takes these values. In this case, it may be more appropriate to use other types of range discussed in the sections above (Section 9.3.2.1, Section 9.3.2.2) and below (Section 9.3.2.4).

If the independent variable of interest is continuous, then the use of standard deviation makes more sense, as decimals naturally appear in such a variable.

Here we considered a two standard deviation change of the independent variable \(X\), because we use the variation from one standard deviation below the mean to one standard deviation above the mean. Instead, we may also consider a one standard deviation change — e.g., the variation from the mean to one standard deviation above the mean. This amount of change is considered in my lecture when the substantive significance of economic voting in the U.S. presidential elections is discussed based on the national-level aggregate data.

9.3.2.4 From Visual Inspection

Another alternative may be the range chosen from a visual inspection of the distribution of \(X\). In the bar chart drawn above, we can see that the values of ideology in the medium range from 3 to 7, except for the middle point of 5, have a relatively similar number of respondents, which is almost twice the number of the scores right next to this range, 2 and 8. In other words, this medium range of 3 to 7 seems to stand out from the rest as the most popular range of left-right ideology. This observation may be used to justify the use of the difference in conditional expectation of trudeau_therm_cps in this range of ideology. In the code below, I used these values in the at argument of the emmeans() function.

model |> 
  emmeans( specs = "ideology",  
           at = list( ideology = c(7, 3) ) ) |> 
                # I derived these values from a visual inspection of
                # the bar chart of "ideology" drawn above.

  contrast( method = "pairwise", infer = TRUE ) |>       
  summary() |> 
  print( digits = 4 )
 contrast              estimate   SE   df lower.CL upper.CL t.ratio p.value
 ideology7 - ideology3   -10.65 1.01 2333   -12.63   -8.671 -10.549  <.0001

Results are averaged over the levels of: union_d, province 
Confidence level used: 0.95 

The estimated difference in the conditional expectation of trudeau_therm_cps in this visually identified middle range of ideology is \(-10.65\) and its 95% confidence interval is [-12.63, -8.67].

Again, there are no hard and fast rules to determine which one of the above ranges is most appropriate for ideology. You need to offer a reasonable and convincing justification on why you use a particular range. I would perhaps not use the mean plus/minus one standard deviation for ideology because ideology is discrete and never takes decimals. While the entire range of ideology is informative and legitimate in its own right, I would not rely solely on this range because the respondents with extreme values of ideology are very small in number. I would perhaps supplement the result with the entire range of ideology by the result with either the IQR of ideology or the range identified from a visual inspection.

9.3.3 Dummy Variable (logical)

If the independent variable of interest \(X\) is a dummy variable, the change from \(X = 0\) to \(X = 1\) is an appropriate meaningful range of change because it doesn’t take any other values. Its estimate and confidence interval can be found immediately from the coefficient on the dummy independent variable \(X\) reported in the output from the summ() function.

In the output of the multiple linear regression model estimated in Section 9.1.3, the coefficient estimate of union_d is 0.60 and its 95% confidence interval is [-2.03, 3.21]. Therefore, the point estimate of the difference in the conditional expectations of trudeau_therm_cps between those who belong to the unions and those who don’t, holding other variables in the model constant, is 0.60 and its 95% confidence interval is [-2.03, 3.21].

Since this confidence interval includes zero, this is inconclusive evidence about whether there is any relationship between trudeau_therm_cps and union_d controlling for other variables. While it is inconclusive whether there is any positive or negative relationship, we can still say from this estimation result that even if there is either a positive or negative relationship between them, it is likely to be small, because the lowest end of the 95% confidence interval is only \(-2.03\) and the highest end is only \(3.21\). Given that the range of trudeau_therm_cps is between 0 and 100, and the responses are spread over the entire range of this variable, these numbers (\(-2.03\) and \(3.21\)) seem to indicate a very small magnitude of the difference between uion members and non-members.

9.3.4 Nominal Categorical Variable (factor)

If the independent variable of interest \(X\) is a nominal categorical variable, the variable should be dummified on the right-hand side of a linear regression model. Review Section 7.3 and Section 7.5 for detail. For a nominal categorical variable, a meaningful amount of change may be some substantively/theoretically meaningful comparison of the categories.

Consider province included in the multiple linear regression estimated in Section 9.1.3. As we saw in Section 7.3 and Section 7.5, the coefficient on each dummified variable represents the difference between \(E(Y|X,Z^+)\) of the dummified category and \(E(Y|X,Z^+)\) of the reference category, controlling for other variables included in the model. Therefore, if a meaningful comparison of the categories is between a certain category and the reference category, we can estimate the difference in \(E(Y|X,Z^+)\) immediately from the coefficient on the dummy variable for this category.

On the other hand, if a meaningful comparison of the categories is between certain categories other than the reference category, we need to estimate the difference in \(E(Y|X,Z^+)\) between these categories using the emmeans() and the contrast() functions.

Suppose we posit that a meaningful comparison is between Ontario and Saskatchewan. The choice like this, of course, needs to be justified. For example, we may argue that it is meaningful to compare a relatively urban province and a relatively rural province because urban areas tend to be more liberal while rural areas tend to be more conservative in developed democracies, and Ontario is more urban than Saskatchewan because the biggest urban areas in Canada can be found in Ontario, and a larger proportion of the population live in rural areas in Saskatchwan than in Ontario. Since both categories are not the reference category, we should specify both categories in the at argument in the emmeans() function as below.

  model |>   
      emmeans( specs = "province",  
               at = list( province = c("ON","SK") ) ) |> 
              # Comparison is between "ON" and "SK."
  
      contrast( method = "pairwise", infer = TRUE ) |> 
      summary() |> 
      print( digits = 4 )
 contrast estimate   SE   df lower.CL upper.CL t.ratio p.value
 ON - SK      11.9 2.45 2333    7.092     16.7   4.855  <.0001

Results are averaged over the levels of: union_d 
Confidence level used: 0.95 

According to this result, the estimated difference in the conditional expectation of trudeau_therm_cps between Ontario and Saskatchewan is \(11.90\) and its 95% confidence interval is [7.09, 16.70].

Alternatively, suppose we posit that a meaningful comparison is between Ontario and Alberta. Because Alberta is the reference category of province, the difference in the conditional expectation of trudeau_therm_cps between Ontario and Alberta is the same as the coefficient on the dummy variable for Ontario (provinceON). According to the linear regression table reported in Section 9.1.3, this difference or the coefficient of provinceON is \(15.34\), and its 95% confidence interval is [10.70, 19.98].


  1. Review the lecture materials for more details about “statistical significance.”↩︎

  2. It is possible that the estimated relationship seems relatively small in magnitude yet still substantively significant/important depending on its specific context. For example, suppose the magnitude of impact of a certain factor in local elections is found to be only about 10 votes, which seems very small, but if the number of votes needed to win an election is very small and turnout is low, even 10 votes could be decisive about who wins a race. In this case, although the magnitude of the impact seems relatively small, it is still meaningfully large in this particular context.↩︎

  3. c(0:10) produces a vector of consecutive numbers (integers) from 1 to 10.↩︎