library(tigris)
<- states(cb = TRUE,
nyst year = 2010) %>%
filter(NAME == "New York")
<- tracts(state = "NY",
nyst_ct cb = TRUE,
year = 2010)
ggplot(data=nyst)+
geom_sf(color = "red",
lwd = 2)+
geom_sf(data = nyst_ct,
fill = NA,
color = "blue") +
ggtitle(label = "New York State Census Tracts")
4 Macrodemographic Analysis of Places
5 Macro demographic data analysis
Prior to the advent in the 1960’s of large scale social surveys like the General Social Survey (GSS), most demographic research was done not on individuals but on aggregates, because that’s how data were available. If you look at texts such as Keyfitz (1968), all of the examples are for national level calculations, and many nations did not have sufficient data availability to produce quality statistical summaries of their populations, resulting in publications such as the United Nations Population Division’s famous Manual X (1983), which gave pragmatic formulas to measure a wide variety of demographic indicators at the national level using basic inputs, usually available from census summaries.
Paul Voss (2007) describes most demography (and certainly most demographic studies prior to the 1970’s and 1980’s) as Macro demography. Voss also mentions that prior to the availability of individual level microdata, all demography was macro-demography, and most demographic studies were spatial in nature, because demographic data were only available in spatial units corresponding to administrative areas. Typical types of geographic areas would be counties, census tracts, ZIP codes, state or nations.
In the macro-demographic perspective on demography, observations are typically places, areas, or some other aggregate level of individuals. We do not observe the individual people themselves often times. An example of this is if you were to have access to an aggregate count of deaths in a region, even if the deaths were classified by age and sex, you still would be dealing with data that ignores, or has no index to the more nuanced characteristics of the individual decedents themselves. That being said, data such as these are invaluable, and most demographic summaries of individual-level data would aggregate based on the characteristics of the individuals any way. The macro scale principal is illustrated below, where all of the variables we observe are a scale above the individual person.
Such macro-level propositions are hypothesized relationships among variables (\(\rightarrow\)) measured at a macro scale (\(Z\) and \(Y\)), which ignores individual level data, mostly because we don’t observe individuals (\(x\) and \(y\)) in many of these kinds of analysis.
If all we looked at were the individuals within the population, we would be overwhelmed by the variation that we would see, and we wouldn’t be doing statistics anymore, we would be trying to process a million anecdotes, and the plural of anecdote is not data. By aggregating across basic demographic groups, such as age and sex, demographers begin to tease apart the differences that we are interested in. If we go a little further and, data willing, aggregate not only across these fundamental demographic groups, but also across some kind of place-based areal unit, then we adding an extremely important part of human existence: the where part of where we live.
This presents an attractive view of populations and typically data on places are more widely available, but there are caveats we must be aware of. If we are using purely aggregate data in our analysis, meaning that we do not have access to the individual level microdata, then our ability to observe variation within a place is extremely limited, if not impossible.
The goal of this chapter is to illustrate how places are a special unit of analysis, and the types of data we often see at the place level are very different from individual level surveys. Additionally, the analysis of place-based data is similar to survey data in that places are do not necessarily represent random observations, and so analyzing data on places often requires special modifications to statistical models. In this chapter, I show how the the linear regression model can be expanded in several ways and illustrate the generalized linear model as a very useful and extendable tool to analyze data on places and especially when we are analyzing rates as demographers often do.
5.1 Getting data on places
In the macro-demographic perspective on demography, observations are typically places, areas, or some other aggregate level of individuals. We do not observe the individual people themselves often times. An example of this is if you were to have access to an aggregate count of deaths in a region, even if the deaths were classified by age and sex, you still would be dealing with data that ignores, or has no index to the more nuanced characteristics of the individual decedents themselves. That being said, data such as these are invaluable, and most demographic summaries of individual-level data would aggregate based on the characteristics of the individuals any way. The macro scale principal is illustrated below, where all of the variables we observe are a scale above the individual person.
Such macro-level propositions are hypothesized relationships among variables (\(\rightarrow\)) measured at a macro scale (\(Z\) and \(Y\)), which ignores individual level data, mostly because we don’t observe individuals (\(x\) and \(y\)) in many of these kinds of analysis.
If all we looked at were the individuals within the population, we would be overwhelmed by the variation that we would see, and we wouldn’t be doing statistics anymore, we would be trying to process a million anecdotes, and the plural of anecdote is not data. By aggregating across basic demographic groups, such as age and sex, demographers begin to tease apart the differences that we are interested in. If we go a little further and, data willing, aggregate not only across these fundamental demographic groups, but also across some kind of place-based areal unit, then we adding an extremely important part of human existence: the where part of where we live.
This presents an attractive view of populations and typically data on places are more widely available, but there are caveats we must be aware of. If we are using purely aggregate data in our analysis, meaning that we do not have access to the individual level microdata, then our ability to observe variation within a place is extremely limited, if not impossible.
The goal of this chapter is to illustrate how places are a special unit of analysis, and the types of data we often see at the place level are very different from individual level surveys. Additionally, the analysis of place-based data is similar to survey data in that places are do not necessarily represent random observations, and so analyzing data on places often requires special modifications to statistical models. In this chapter, I show how the the linear regression model can be expanded in several ways and illustrate the generalized linear model as a very useful and extendable tool to analyze data on places and especially when we are analyzing rates as demographers often do.
5.2 Getting data on places
Typically when thinking about data on places, we are really referring to some sort of administrative geography, such as nations, states, region, and census tracts. While these are often readily available (and I’ll show some R package that can easily get data from the web), we often have to use these as proxy measures of more interesting social spaces like neighborhoods and other types of activity spaces. These social spaces are harder to get data on, typically because they are more fluid in their definitions, and there is generally not a systematic effort to produce data on socially defined spaces on national scales. This is a big part of doing macro demography, defining the scale and the unit of analysis, both because we need to define the scope of our work, but also we are very much constrained by the availability of data for our projects. For instance, I may want to look at national scale inequality in mortality risk in neighborhoods in the United States, but you immediately face a couple of hurdles. No national data source identifies sub-city residential location for death certificates, also, what are neighborhoods? Again, they’re probably some socially defined space that may not be available from a national scale source. To get around this, we may have to settle for a state-level analysis, because state vital registration systems will often allow researchers to use more fine-scale geographic data on death certificates (such as latitude/longitude of the decedent’s residence), and once we have very fine scale geographic data on the vital events, we could potentially find data on some more socially defined spaces, perhaps from cities who often maintain geographic data on neighborhoods specific to that city. OK, so that’s fine, but then you still run into the “what’s my denominator” problem, where you have no baseline population data on the age and sex breakdown of the population, or even the population size of these places, because federal agencies don’t produce estimates for such small scale areas. This is frustrating. Often when advising students on their dissertation projects, I have to have this moment of truth where I lay out the problems of the mixing of geographic scales for their projects, and the hard reality of the lack of data on so many things they would like to study. Often what happens is that we have to proxy our ideal places with places for which we can find data. You see this a lot in the population health literature, where people want to analyze neighborhoods but all they have are census tracts. Tracts aren’t social spaces! They’re arbitrary areas of 3 to 5 thousand people, that change every 10 years, that the Census uses to count people. Likewise, counties are very rich areas to find data for, but they are not really activity spaces or neighborhoods, but they may be areas that have some policy making authority (such as county health departments) that could be relevant for something. States are also nice geographies, they’re very large, so you loose the ability to contextualize behavior on a fine spatial scale, but states make a lot of decisions that affect the lives of their residents, often more than national decisions. States have become very popular units of analysis in the health literature again, primarily as a result of differential adoption of portions of the Patient Protection and Affordable Care Act of 2010 (Soni, Hendryx, and Simon 2017; Courtemanche et al. 2019). This being said, many times when we do an analysis on places, that analysis has lots of limitations, which we must acknowledge, and analyses such as these are often called ecological analyses because we are examining associations at the macro scale, and we do not observe individual level outcomes.
5.3 US contexts
The US Census bureau produces a wide variety of geographic data products that are the most widely used forms of geographic data for demographic studies in the United States. The TIGER Line Files data consist of geographic data with census bureau GEOIDs attached so they can be linked to any number of federal statistical products. They do not contain demographic data themselves, but are easily linked. The tigris
package in R provides a direct way to download any TIGER line file data type directly in a R session as either a simple feature class or as a Spatial_DataFrame (Walker 2021).
Using the tigris
package is very easy and its functions fit directly into the tidyverse as well. Below, I download two layers of information, first the state polygon for New York state, and the census tracts within the state and overlay the two datasets on each other. The package has a function for each type of geography that you would want, for example states()
downloads state level geographies and tracts()
does the same for census tracts. The functions have some common arguments, including cb = TRUE/FALSE
so you can choose cartographic boundary files or not. Cartographic boundary files are lower resolution, smaller files that are often used for thematic mapping. Also year =
will allow you to get different annual vintages of the data. The tracts()
function also allows you to obtain geographies for specific counties within a state.
5.3.1 Tidycensus
Another package the provides access to the US Census Bureau Decennial census summary file , the American Community Survey, Census population estimates, migration flow data and Census Public Use Microdata Sample (PUMS) data is tidycensus
(Walker and Herman 2021). The tidycensus
package primarily works to allow users to use the Census Bureau’s Application Programming Interface (API) to download Census summary file data for places within an R session. This removes the need to download separate files to your computer, and allows users to produce visualizations of Census data easily. The package is actively maintained and has several online tutorials on how to use it 1. Depending on which data source you are interested in, there are functions that allow extracts from them. The ACS data is accessed through the get_acs()
function, likewise the decennial census data is accessed using the get_decennial()
function. The package also allows users to test for differences in ACS estimates either across time or between areas using the significance()
function.
The package requires users to obtain a developer API key from the Census Bureau’s developer page2 and install it on your local computer. The package has a function that helps you install the key to your .Renviron
file. It is used like this:
census_api_key(key = "yourkeyhere", install = TRUE)
which only needs to be done once.
A basic use of the tidycensus
package is to get data and produce maps of the indicators. This is done easily because tidycensus
fits directly into general dplyr
and ggplot2
workflows. Below is an example of accessing 2019 ACS data on poverty rate estimates for New York census tracts from New York county, New York. The syntax takes several arguments indicating what level of census geography you want, the year of the estimates, the details of states and counties you may want, and which ACS tables you want. Here I use the Data Profile table for the percentage estimate of families with incomes below the poverty line. The output = "wide"
option is useful if you get multiple estimates, as it arranges them into columns, one for each estimate.
library(tidycensus)
<- get_acs(geography = "tract",
nyny year = 2018,
state = "NY",
county = "061",
variables = "DP03_0119PE",
output = "wide",
geometry = TRUE)
Getting data from the 2014-2018 5-year ACS
Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
Using the ACS Data Profile
The tabular output shows the Estimate column ending in E and the ACS margin of error column ending in M.
::kable(x = head(nyny),
knitrformat = "html")
GEOID | NAME | DP03_0119PE | DP03_0119PM | geometry |
---|---|---|---|---|
36061020101 | Census Tract 201.01, New York County, New York | 0.0 | 20.0 | MULTIPOLYGON (((-73.96155 4... |
36061020701 | Census Tract 207.01, New York County, New York | 24.1 | 17.0 | MULTIPOLYGON (((-73.95922 4... |
36061022200 | Census Tract 222, New York County, New York | 18.5 | 9.1 | MULTIPOLYGON (((-73.95068 4... |
36061022600 | Census Tract 226, New York County, New York | 15.5 | 9.4 | MULTIPOLYGON (((-73.94703 4... |
36061000600 | Census Tract 6, New York County, New York | 37.5 | 9.8 | MULTIPOLYGON (((-73.99256 4... |
36061001600 | Census Tract 16, New York County, New York | 22.2 | 8.5 | MULTIPOLYGON (((-73.99606 4... |
The geometry = TRUE
option also download the TIGER line file for the requested geography and merges it to the ACS estimates. This allows you to immediately map the estimates for the requested geographies.
# Create map of estimates
%>%
nyny rename (Poverty_Rt = DP03_0119PE)%>%
ggplot(aes(fill = Poverty_Rt))+
geom_sf()+
scale_fill_viridis_c()+
ggtitle ( label = "Poverty Rate in New York Census Tracts",
subtitle = "2018 ACS Estimates")
The tidycensus
package had a great array of functions and the author Kyle Walker has published a book on using it FILL IN CITATION which covers its many uses.
One common task that we should do when visualizing ACS estimates is to examine the coefficient of variation in the estimates. This gives us an idea of how stable the estimates are. This can be particularly problematic as we use smaller and smaller geographies in our analysis. Below, I calculate the coefficient of variation for the estimates and map it. To get the standard error of the ACS estimate, I divide the margin of error by 1.645, following Census Bureau recommendations (Bureau 2019).
%>%
nyny mutate ( cv =ifelse(test = DP03_0119PE==0,
yes = 0,
no = (DP03_0119PM/1.645) / DP03_0119PE))%>%
ggplot(aes(fill = cv))+
geom_sf()+
scale_fill_viridis_c()+
ggtitle ( label = "Poverty Rate Coefficient of Variation\n in New York Census Tracts",
subtitle = "2018 ACS Estimates")
which shows areas with the highest coefficient of variations mostly adjacent to Central Park and on the lower west side of Manhattan. These are also the areas with the lowest poverty rates in the city, so the estimates have low precision because so few respondents report incomes below the poverty line.
5.3.2 IPUMS NHGIS
The IPUMS NHGIS project 3 is also a great source for demographic data on US places, and allows you to select many demographic tables for census data products going back to the 1790 census (Manson et al. 2021). When you perform an extract from the site, you can get both data tables and ESRI shapefiles for your requested geographies. The IPUMS staff have created several tutorials which go through how to construct a query from their site 4. Below, I use the sf
library to read in the geographic data from IPUMS and the tabular data and join them.
library(sf)
<- read_sf("data/US_county_2020.shp") ipums_co
Error: Cannot open "data/US_county_2020.shp"; The file doesn't seem to exist.
<- readr::read_csv("data/nhgis0025_ds231_2005_county.csv") im_dat
Rows: 3143 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): GISJOIN, AREANAME, STATE, STATEA, COUNTY, COUNTYA, DATAFLAG
dbl (5): YEAR, NOTECODE, AGWE001, AGWI001, AGWJ001
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- left_join(x = ipums_co,
m_dat y = im_dat,
by = c("GISJOIN" = "GISJOIN"))
Error in eval(expr, envir, enclos): object 'ipums_co' not found
%>%
m_dat filter(STATE == "New York" )%>%
ggplot()+
geom_sf(aes (fill = AGWJ001))+
scale_fill_viridis_c()+
ggtitle(label = "Infant Mortality Rate per 10,000 Live Births",
subtitle = "New York, 2005")
Error in eval(expr, envir, enclos): object 'm_dat' not found
5.3.3 International data
Sources of international data exist in numerous sites on the internet. Personally, I frequently will use the DHS Spatial Data repository 5 to access data from DHS sampled countries. This repository allows you to obtain both spatial administrative boundary data, as well as key indicators of maternal and child health at sub-national levels. Additionally, the rdhs
package allows you to perform queries from the spatial data repository and from the DHS microdata as well directly via the DHS API from within an R session (Watson, FitzJohn, and Eaton 2019), assuming you have registered with the DHS and have an approved project with them.
5.4 Statistical models for place-based data
Data on places is often analysed in the same ways as data on individuals, with some notable complications. The remainder of this chapter introduces the regression framework for analyzing data at the macro level, first by a review of the linear model and its associated pitfalls, and then the generalized linear model with a specific focus on the analysis of demographic count outcomes that are commonly observed for places.
In the example below, I use data from the U.S. Health Resources and Services Administration Area Health Resource File (AHRF), which is a produced annually and includes a wealth of information on current and historical data on health infrastructure in U.S. counties, as well as data from the Census Bureau, and the National Center for Health Statistics. The AHRF is publicly available, and we can read the data directly from the HHS website as a SAS format .sas7bdat
data set within a ZIP archive. R can read this file to your local computer then extract it using the commands below. I would strongly encourage you consulting the AHRF codebook available from the HRSA website6.
#create temporary file on your computer
<- tempfile()
temp
#Download the SAS dataset as a ZIP compressed archive
download.file("https://data.hrsa.gov/DataDownload/AHRF/AHRF_2019-2020_SAS.zip", temp)
#Read SAS data into R
<-haven::read_sas(unz(temp,
ahrffilename = "ahrf2020.sas7bdat"))
rm(temp)
Next, I remove many of the variables in the AHRF and recode several others. In the analysis examples that follow in this chapter, I will focus on the outcome of low birth weight births, measured at the county level.
library(tidyverse)
<-ahrf%>%
ahrf2mutate(cofips = f00004,
coname = f00010,
state = f00011,
popn = f1198416,
births1618 = f1254616,
lowbw1618 = f1255316,
fampov14 = f1443214,
lbrate1618 = 1000*(f1255316/f1254616), #Rate per 1000 births
rucc = as.factor(f0002013),
hpsa16 = case_when(.$f0978716 == 0 ~ 'no shortage',
$f0978716 == 1 ~ 'whole county shortage',
.$f0978716 == 2 ~ 'partial county shortage'),
.obgyn15_pc= 1000*( f1168415 / f1198416 ) )%>%
mutate(rucc = droplevels(rucc, ""))%>%
::select(births1618,
dplyr
lowbw1618,
lbrate1618,
state,
cofips,
coname,
popn,
fampov14,
rucc,
hpsa16,%>%
obgyn15_pc)filter(complete.cases(.))%>%
as.data.frame()
In order to make a nice looking map of the outcome, I use the tigris
package to fetch geographic data for US states and counties, then merge the county data to the AHRF data using left_join()
options(tigris_class="sf")
library(tigris)
library(sf)
<-counties(cb = T, year= 2016)
usco
$cofips<-usco$GEOID
usco
<-states(cb = T, year = 2016)
sts
<-st_boundary(sts)%>%
stsfilter(!STATEFP %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
st_transform(crs = 2163)
<-left_join(usco, ahrf2,
ahrf_mby = "cofips")%>%
filter(is.na(lbrate1618)==F,
!STATEFP %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
st_transform(crs = 2163)
glimpse(ahrf_m)
There are a total of 2,418 observations in the data, because the HRSA restricts some counties with small numbers of births from the data.
Here is a ggplot()
histogram of the low birth weight rate for US counties.
%>%
ahrf_mggplot()+
geom_histogram(aes(x = lbrate1618))+
labs(title = "Distribution of Low Birth Weight Rates in US Counties",
subtitle = "2016 - 2018")+
xlab("Rate per 1,000 Live Births")+
ylab ("Frequency")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here, we do a basic map of the outcome variable for the continental US, and see the highest rates of low birth weight births in the US occur in the southeastern areas of the country. Notice, I do not color the boundaries of the counties in order to maximize the reader’s ability to see the variation, instead I show lines between states by overlaying the sts
layer from above. I also add cartographic options of a scale bar and a north arrow, which I personally believe should be on any map shown to the public.
library(tmap)
tm_shape(ahrf_m)+
tm_polygons(col = "lbrate1618",
border.col = NULL,
title="Low Birth Weight Rt",
palette="Blues",
style="quantile",
n=5,
showNA=T, colorNA = "grey50")+
tm_format(format= "World",
main.title="US Low Birth Weight Rate by County",
legend.position = c("left", "bottom"),
main.title.position =c("center"))+
tm_scale_bar(position = c(.1,0))+
tm_compass()+
tm_shape(sts)+
tm_lines( col = "black")
When doing analysis of place-based data, maps are almost a fundamental aspect of the analysis and often convey much more information about the distribution of the outcome than either distribution plots or summary statistics.
5.5 The linear model framework
Probably the most used and often abused statistical model is the linear regression model, sometimes called the OLS model because it is typically estimated by the method of least squares. I do not plan on spending a lot of real estate in this this book talking about the linear model, mostly because lots of times it doesn’t get us very far, and there are much more thorough books on this subject, one of my personal favorites being John Fox’s text on applied regression (Fox 2016).
This model is typically shown as:
\[y_i = \beta_0 +\sum_k \beta_k x_{ki} + \epsilon_i\]
with the \(\beta\)’s being parameters that define the linear relationship between the independent variables \(x_k\), and \(y\), and \(\epsilon_i\) being the unexplained, or residual portion of \(y\) that is included in the model. The model has several assumptions that we need to worry about, first being normality of the residuals, or
\[\epsilon_i \sim Normal(0, \sigma_\epsilon)\]
Where \(\sigma_\epsilon ^2\) is the residual variance, or mean square error of the model. Under the strict assumptions of the linear model, the Gauss-Markov theorem says that the unbiased and minimum variance estimates of the \(\beta\)’s is:
\[ \beta_k = \frac{\sum (y_i - \bar{y})(x_i - \bar{x})}{\sum x_i - \bar{x}^2} = \frac{Cov(x,y)}{Var(x)} \]
Which is often shown in the more compact matrix form:
\[\beta_k = (X'X)^{-1} X'Y\]
We could just as directly write the model in it’s distributional form as:
\[ y_i \sim Normal(\beta_0 +\sum_k \beta_k x_{ki}, \sigma_\epsilon) \]
or even as:
\[ y_i \sim Normal(X' \beta, \sigma_\epsilon) \]
Which I prefer because it sets up the regression equation as the linear predictor, or linear combination (in the linear algebra sense) of the predictors and the model parameters for the mean of the outcome. This term, linear predictor, is a useful one, because as you get more and more accustomed to regression, you will see this same structure in every regression model you ever do. This is the fundamental workhorse of regression, where we get an estimated value for every combination of the observed predictors. Moreover, below when I present the Generalized Linear Model, it will be apparent that the linear predictor can be placed within a number of so-called link functions to ensure that the mean of the outcome agrees with the assumed distribution for the outcome.
5.6 Estimating the linear model in R
The linear model is included in the base R stats
package, and is accessed by the lm()
function. In the example below, I use data from the U.S. Health Resources and Services Administration Area Health Resource File (AHRF) to estimate a model of the associations between the poverty rate, the rurality of the county and whether the county is a healthcare shortage area. This is a mixture of continuous (or partially continuous) and categorical predictors.
The basic lm()
model syntax is specified as lm ( y ~ x_1 + x_2, data = dataname)
with the ~
operator representing the formula for the model equation, with the outcome on the left and the predictors on the right. You also provide the name of the dataframe which contains the variables specified in the formula in a data=
argument. For help on the function and to see the other potential arguments use ?lm
.
<- lm (lbrate1618 ~ fampov14 + rucc + hpsa16, data = ahrf_m) lm1
This stores the model data and parameter estimates in the object called lm1
. You can name the object anything you wish, just try to avoid using other R commands as object names. For instance, I wouldn’t want to call an object mean
or sd
because those are names of functions. The basic way to see the model results is to use the summary()
function on the model fit.
summary(lm1)
Call:
lm(formula = lbrate1618 ~ fampov14 + rucc + hpsa16, data = ahrf_m)
Residuals:
Min 1Q Median 3Q Max
-90.114 -10.345 -1.229 9.287 77.778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.26920 1.18901 52.371 < 2e-16 ***
fampov14 2.25805 0.06846 32.981 < 2e-16 ***
rucc02 -1.91854 1.20082 -1.598 0.110249
rucc03 -3.38658 1.25055 -2.708 0.006818 **
rucc04 -6.65483 1.40639 -4.732 2.36e-06 ***
rucc05 -5.96283 1.93739 -3.078 0.002110 **
rucc06 -3.94180 1.14317 -3.448 0.000575 ***
rucc07 -4.00730 1.28166 -3.127 0.001790 **
rucc08 0.45112 2.11062 0.214 0.830770
rucc09 -3.88365 2.29118 -1.695 0.090202 .
hpsa16partial county shortage -0.66219 1.00767 -0.657 0.511148
hpsa16whole county shortage 2.36214 1.25335 1.885 0.059601 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.36 on 2312 degrees of freedom
Multiple R-squared: 0.3697, Adjusted R-squared: 0.3667
F-statistic: 123.3 on 11 and 2312 DF, p-value: < 2.2e-16
We see that there is a significant positive association between family poverty and the low birth weight rate, we also see that there is a tendency for more rural areas to have lower low birth weight rates than the largest cities (Reference level = rucc01
). There is a marginally significant association between a county being a healthcare shortage area and the low birth weight rate. Overall the model is explaining about a third of the variation in the outcome, as seen in the adjusted R-Square value of .3667.
5.6.0.1 A not on interpretation
It is important to remember, when describing results for place-based data, to avoid using language centered on individuals. For instance, with reference to the fampov14
variable, we cannot say that families living in poverty are more likely to have a low birth weight birth, instead, we must focus on discussion on places with higher rates of poverty having a higher rate of low birth weight births. Ascribing individual risk from an ecological analysis is an example of the ecological fallacy often seen when doing place-based analysis, and we must be aware of it when framing our questions and describing our results.
The gtsummary
package (Sjoberg et al. 2021) provides a very nice interface to produce much better looking summaries of models.
library(gtsummary)
%>%
lm1tbl_regression(add_estimate_to_reference_rows = TRUE)
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
% Families Below Poverty Level 2014-18 | 2.3 | 2.1, 2.4 | <0.001 |
rucc | |||
01 | 0.00 | — | |
02 | -1.9 | -4.3, 0.44 | 0.11 |
03 | -3.4 | -5.8, -0.93 | 0.007 |
04 | -6.7 | -9.4, -3.9 | <0.001 |
05 | -6.0 | -9.8, -2.2 | 0.002 |
06 | -3.9 | -6.2, -1.7 | <0.001 |
07 | -4.0 | -6.5, -1.5 | 0.002 |
08 | 0.45 | -3.7, 4.6 | 0.8 |
09 | -3.9 | -8.4, 0.61 | 0.090 |
hpsa16 | |||
no shortage | 0.00 | — | |
partial county shortage | -0.66 | -2.6, 1.3 | 0.5 |
whole county shortage | 2.4 | -0.10, 4.8 | 0.060 |
1 CI = Confidence Interval |
5.7 Assumptions of the OLS model
The linear model has several assumptions that we need to be concerned with, the big four are
- Normality of residuals,
- Constant variance in residuals, or homoskedasticity, and
- Linearity of the regression function, and
- Independence of observations
The normality assumption is linked to distributional assumption underlying the linear regression model. This states that the model residuals, calculated as \(e_i = (\beta_0 +\sum_k \beta_k x_{ki}) - y_i\), or more compactly as \(e_i =\hat{y_i} - y_i\) follow a normal distribution. If the errors around the mean function are not normally distributed, this can be an indicator that the linear model is not appropriate for the outcome under consideration. A commonly used graphical check of this is the quantile-quantile or Q-Q plot, which plots the residuals from a model against the hypothetical quantiles from a normal distribution.
We can check these for our model above easily:
hist(residuals(lm1))
plot(density(resid(lm1)),
main = "Density plot of the residuals")
curve(dnorm(x,0,sd(resid(lm1))),
col = "blue", lwd =2, add=TRUE)
plot(lm1, which = 2)
While the overall distribution of the residuals is fairly normal based on the histogram and the comparison to the normal density plot, the q-q plot shows that the tails of the distribution are not well modeled by the normal distribution because there are several observations that are too far below or above the theoretical line (dotted line).
The homoskedasticity assumption is also tied to the normal distributional assumption of the model, as see above, if we write the model in its distributional form, \(y_i \sim Normal(X' \beta, \sigma_\epsilon)\), the term $ _$ is a single parameter, meaning that we only have one of these in a linear regression model. This parameter determines the spread of the variation around the mean function. Larger values equal more spread around the mean, and smaller values equal less spread. A commonly used graphical procedure to detect lack of homoskedasticity, or heteroskedasticity, is an envelope plot, or a plot of the residuals against the fitted values from the model. Formal tests also exist including the Breusch-Pagan test and the modified version of this test developed by Cook and Weisberg [cite]
A graphical check of this assumption is easily done from the model fit:
plot(lm1, which = c(1,3))
These plots show some evidence that the error variances are non-constant. The first plot has the very characteristic “fish” or “cone” shape, where the error variances increase as the fitted values increase. We can also do a formal test using functions from the car
package:
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
ncvTest(lm1)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 469.4844, Df = 1, p = < 2.22e-16
Which again shows more statistical evidence of the non-constant variance in residuals.
5.7.0.1 Correction for non-constant variance
The assumption of homoskedasticity is important for two reasons, first is related to prediction from the model, but the second is related to the test statistics derived from the model. In order to test our hypothesis that \(x\) is related to \(y\), we form the ratio of the \(\beta_1\) parameter to its error, this is typically either a \(z\), \(t\) or Wald \(\chi^2\) statistic, depending on which procedure you’re using.
\[t = \frac{\hat{\beta_1}}{se(\beta_1))}\] The term \(se(\beta_1)\) is the estimated standard error of the parameter, and is calculated using the ratio of the residual standard deviation and the square root of the sums of squares of the \(x\):
\[se(\beta_1) = \frac{\sigma_{\epsilon}}{\sqrt{\sum(x - \bar{x})^2}}\]
or in the matrix terms:
\[Var(\beta) = \sigma_{\epsilon}(X'X)^{-1}\]
if the term \(\sigma_{\epsilon}\) is not constant then the standard error of each parameter in the model is incorrect. Corrections for heteroskedasticity are commonplace in the social sciences, and are usually attributed to White (1980) and MacKinnon and White (1985) with many additions since the original publication, notably Long and Ervin (2000). These corrections use the empirically observed error terms and avoid the assumption of common variance in all residuals.
The coeftest()
function in the lmtest
package is one option to correct for heteroskedasticity in regression models. It allows for various correction types, with the “HC3” type (Long and Ervin 2000) being the default for linear models. Below, I show the default tests assuming constant variance and the corrected tests.
library(sandwich)
library(lmtest)
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
coeftest(lm1)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.269198 1.189010 52.3706 < 2.2e-16 ***
fampov14 2.258054 0.068465 32.9813 < 2.2e-16 ***
rucc02 -1.918536 1.200819 -1.5977 0.1102488
rucc03 -3.386577 1.250553 -2.7081 0.0068176 **
rucc04 -6.654825 1.406391 -4.7318 2.359e-06 ***
rucc05 -5.962826 1.937392 -3.0778 0.0021101 **
rucc06 -3.941799 1.143175 -3.4481 0.0005746 ***
rucc07 -4.007304 1.281664 -3.1266 0.0017901 **
rucc08 0.451120 2.110622 0.2137 0.8307703
rucc09 -3.883651 2.291182 -1.6950 0.0902020 .
hpsa16partial county shortage -0.662194 1.007671 -0.6572 0.5111478
hpsa16whole county shortage 2.362138 1.253348 1.8847 0.0596007 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(lm1, vcov = vcovHC(lm1, type = "HC3"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.26920 1.14941 54.1749 < 2.2e-16 ***
fampov14 2.25805 0.11349 19.8966 < 2.2e-16 ***
rucc02 -1.91854 0.97531 -1.9671 0.0492896 *
rucc03 -3.38658 1.06364 -3.1839 0.0014722 **
rucc04 -6.65483 1.20922 -5.5034 4.137e-08 ***
rucc05 -5.96283 1.95622 -3.0481 0.0023288 **
rucc06 -3.94180 1.09704 -3.5931 0.0003336 ***
rucc07 -4.00730 1.30130 -3.0795 0.0020981 **
rucc08 0.45112 2.42803 0.1858 0.8526203
rucc09 -3.88365 3.29696 -1.1779 0.2389381
hpsa16partial county shortage -0.66219 0.85442 -0.7750 0.4384068
hpsa16whole county shortage 2.36214 1.26921 1.8611 0.0628553 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The take away in this example is that non-constant variance can affect the standard errors of the model parameter estimates, and in turn affect the test statistics that we base all of our hypothesis tests on. In the particulars of this model there is not a lot of change, the rucc02
parameter is barley significant once using the corrected standard errors, otherwise we see a very similar pattern in terms of what is significant in the model.
5.7.0.2 Clustered standard errors
Another commonly used correction in regression modeling is the clustered standard error. These are commonplace and almost the default in the Stata programming environment, and are widely used in the field of economics. Clustering of standard errors attempts to correct for clustering in the residuals from the regression model. Clustering can happen for a wide variety of reasons, and as we saw in the previous chapter on survey data analysis, is often an artifact of how the data are collected. With place-based data, we may have clustering because the places are close to each other, or because the share some other characteristic that we have not measured in our regression model. In the case of our regression model, and in our descriptive analysis of our outcome, the map shown prior may indicate some form of spatial correlation in the outcome. While there are models to deal with such non-independence in place-based data, they are not a subject I will touch on here. Instead, we may use the state which each county is in as a proxy for the spatial clustering in the outcome, as one example of potential of a clustering term.
coeftest(lm1,
vcov = vcovCL(lm1, cluster = ahrf_m$state))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.26920 2.43737 25.5477 < 2.2e-16 ***
fampov14 2.25805 0.28830 7.8322 7.243e-15 ***
rucc02 -1.91854 1.15345 -1.6633 0.096387 .
rucc03 -3.38658 1.42381 -2.3785 0.017462 *
rucc04 -6.65483 1.35050 -4.9277 8.912e-07 ***
rucc05 -5.96283 2.60314 -2.2906 0.022075 *
rucc06 -3.94180 1.28117 -3.0767 0.002118 **
rucc07 -4.00730 2.11299 -1.8965 0.058017 .
rucc08 0.45112 2.59621 0.1738 0.862069
rucc09 -3.88365 4.73238 -0.8207 0.411927
hpsa16partial county shortage -0.66219 1.18695 -0.5579 0.576971
hpsa16whole county shortage 2.36214 1.80407 1.3093 0.190549
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, the rucc02
and rucc07
terms become marginally significant and the healthcare shortage areas lose their marginal significance.
5.7.0.3 Weighted Least Squares
Another method of dealing with non-constant error variance is the method of weighted least squares. This method modifies the model somewhat to be:
\[y \sim Normal(\beta_0 +\sum_k \beta_k x_{ki}, \sigma_{\epsilon_i} )\]
Where the term \(\sigma_{\epsilon_i}\) represents the different variances for each observation. The weights in the model are often variables that represent an underlying factor that affects the variance in the estimates. In demography this is often the population size of a place, as places with smaller population sizes often have more volatility to their rate estimates. This approach has been used in the spatial demographic modeling of county mortality rates in the United States by several authors (McLaughlin et al. 2007; Sparks and Sparks 2010).
<- lm(lbrate1618 ~ fampov14 + rucc + hpsa16,
lm2 data = ahrf_m,
weights = popn)
summary(lm2)
Call:
lm(formula = lbrate1618 ~ fampov14 + rucc + hpsa16, data = ahrf_m,
weights = popn)
Weighted Residuals:
Min 1Q Median 3Q Max
-42957 -2122 -106 2333 27618
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.54503 0.98536 66.519 < 2e-16 ***
fampov14 1.83332 0.06338 28.928 < 2e-16 ***
rucc02 -0.63464 0.66623 -0.953 0.340901
rucc03 -1.67371 0.93793 -1.784 0.074479 .
rucc04 -4.38387 1.31520 -3.333 0.000872 ***
rucc05 -3.31398 2.16308 -1.532 0.125642
rucc06 -1.87074 1.35213 -1.384 0.166630
rucc07 -2.12596 1.80385 -1.179 0.238693
rucc08 3.92060 4.33118 0.905 0.365452
rucc09 -0.21442 4.93030 -0.043 0.965315
hpsa16partial county shortage -1.85070 0.93518 -1.979 0.047938 *
hpsa16whole county shortage 0.21786 1.65417 0.132 0.895229
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4620 on 2312 degrees of freedom
Multiple R-squared: 0.2832, Adjusted R-squared: 0.2798
F-statistic: 83.05 on 11 and 2312 DF, p-value: < 2.2e-16
We see that by including the population size weights in the model, most of the parameters are no longer significant in the analysis, but the weights have not dealt with the non-constant variance issue totally:
ncvTest(lm2)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 84.49228, Df = 1, p = < 2.22e-16
But the overall size of the non-constant variance test is much lower than it was for the original model.
5.7.1 Linearity assumption
The linearity assumption of the model assumes that the true underlying relationship in the data can be modeled using a linear combination of the predictors and the parameters. I think a lot of people think this means that you cannot include square or polynomial terms in a regression model, but that is not the case. The assumption is concerned with the linearity of the parameters, not the predictor variables themselves. For example the standard linear model with one predictor, x is written:
\[y = \beta_0 + \beta_1 x_1 + \epsilon\]
Which is clearly the equation for a straight line with y-intercept \(\beta_0\) and slope \(\beta_1\). We also see that the parameters combine in a linear (additive) fashion. This is the assumption of the model, and can also be seen when expressing this equation using vector notation
\[y = x' \beta\]
Because the term \(x' \beta\) is the inner product of the \(\beta\) parameters and the information from \(x\). If we include the square of \(x\) in the model:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \epsilon\] The same inner, additive product of the \(\beta\)s and the \(x\)s is still the same. If we constructed a model that was non-linear function of the \(\beta\)s, such as:
\[y = \beta_0 + \beta_1 x_1 * e^{\beta_2 x_1^2} + \epsilon\] The the model would no longer be linear in the parameters because we introduce a nonlinearity by exponentiating the \(\beta_2\) parameter inside the mean function (not that this is a real model, just as an example).
When this actually comes into our experience is when our data are actually generated by a nonlinear process, such as a time series with seasonality included, which may oscillate between seasons (such as temperature, or rainfall), for instance a situation such as this arises:
Where the variable y is actually generated from a cosine curve with noise. Since the linear regression of y on x forces the y to change linearly with x, the model is absolutely unable to recover the underlying pattern in the data.
A note on splines
If the data clearly do not display a linear trend, I personally automatically look to splines as a means to model the non-linearity in relationship. Splines are a method of constructing a model based on connecting either linear or non-linear functions across a series of breaks or knots along the data in which the form of the function changes.
Mathematically, knots can be written as:
$$ Y(x) = \[\begin{Bmatrix} F_1(x) \text { for } x\in [x_1, x_2]\\ F_2(x) \text { for } x\in [x_2, x_3]\\ \cdots \\ F_{k}(x) \text { for } x\in [x_{k-1}, x_k]\\ \end{Bmatrix}\]$$
Where each of the \(F_k (x)\) functions imply a different form in the interval between \(x\in [x_{k-1}, x_k]\), where the \(k\) breaks are at a given knot in the data. Most splines are nonlinear functions, usually cubic polynomials, and the spline model combines a series of these polynomials to model nonlinearities in the relationship between predictors and outcomes. A relatively recent invention, Generalized Additive Models or GAMs are a way to model an outcome with both linear and non-linear terms together. The GAM model forms the linear predictor of a model can be constructed as:
\[E(y)= \beta_0 + f(x_1) + \beta_1 x_2\] where the \(f(x_1)\) term is a regression spline of one of the variables. The models can be a mixture of linear and smooth terms. Here is an example of using a B-spline within the lm()
model to fit a smooth regression function to the messy nonlinear model above.
library(splines)
<- lm(y2~bs(t, df = 5))
sm
plot(t, y2, t="p",
ylim=range(y2) * c(1, 1.2),
main="Nice Spline Model,\nFit to Nonlinear Outcome",
ylab ="y",
xlab= "x")
<-seq(from = min(t),to = max(t), length.out = length(y2))
t
lines(t, predict( lm (y2 ~ bs(t, df = 5)),
data.frame(t = t), lwd = 1.5),
col=3)
I really think splines, and GAMs are an excellent addition to the modeling world and have started teaching them in my own methods courses. In the world of demography, especially, with age affecting everything, there is no need to constantly assume relationships are purely linear, and splines offer an excellent method to explore such relationships.
If these assumptions are violated, then several things can happen. At best, our interpretation of the model coefficients could be wrong, meaning that, as seen above, our model would suggest one relationship from the data, but in fact because the model was misspecified, the relationship we discover is incorrect. Our poor linear model would predict a decline in the outcome, while the outcome itself is perfectly stationary, as shown by the dashed line.
In a more social science sensibility, the interpretation of the beta coefficients for the effect of \(x\) on \(y\) in this case will provide us a false conclusion of the relationship in the data. This is a really dangerous outcome for us in social science, because that’s why we’re doing statistics in the first place, to answer questions.
The normality assumption above primarily affect predictions from the model, which, since the normal distribution is bound on \(-\infty\) to \(\infty\), can easily lead to a prediction outside of the realm of possibility, say for a dichotomous outcome, or a count, neither of which can have predicted values beyond 0 and 1, or less than 0.
5.7.2 Generalized Least Squares
Once our data start to violate the assumptions of the linear model, the model becomes less and less useful. For instance, why make all of the strict assumptions about homoskedastic (I am contractually required by the statistics union to say this at least once) variances in the model residuals in order to use OLS, when you can use it’s friend and brother, Generalized Least Squares (GLS, of course), which allows you to make useful and pragmatic changes to the OLS model structure to accommodate all of the fun and annoying things about real data, but still use the normal distribution to model our outcomes[^macrodem-9].
Generalized Least Squares adds a lot more flexibility to modeling normally distributed outcomes, basically by allowing us to modify the fundamental equations above to accommodate unequal variances, or the use of covariates or stratification variables on variances. Another way to write the OLS model above would be:
\(\epsilon_i \sim Normal(X'\beta, I\sigma_\epsilon)\) Where \(I\) is the identity matrix, which implies that for each observation, the variances in the residuals are all the same:
\[ \sigma_{\epsilon} = I * \sigma_{\epsilon} = \begin{Bmatrix} 1& 0& 0 \\ 0& 1& 0 \\ 0& 0& 1\\ \end{Bmatrix} *\sigma_{\epsilon} = \begin{Bmatrix} \sigma_{\epsilon}& 0& 0 \\ 0& \sigma_{\epsilon}& 0 \\ 0& 0 & \sigma_{\epsilon} \\ \end{Bmatrix} \]
Which shows the common variance for the three residuals. GLS allows us to relax this constant variance assumption, by at the minimum allowing the variances to be a function of a weighting variable (which produces Weighted Least Squares), or some covariate. In the most basic presentation of this principle, this makes the residuals have some other, non-constant form of:
\(\sigma_{\epsilon} = \Omega\) which in turn modifies the estimation equation for the \(\beta\)’s to:
\(\beta_{k_{GLS}} = (X' \Omega^{-1} X)^{-1} X' \Omega^{-1} Y\) Applications of such models are more commonly seen in time series modeling and longitudinal analysis, where the residuals of the model often have an autoregressive form to allow individuals to be correlated with themselves over time, but when talking about place-based demography, more modifications of the model have been derived that allow for addressing another key assumption of independence among observations and nonconstant variances. This in fact is the realm of an entire field of econometrics, often called spatial econometrics (Anselin 1988; Chi and Zhu 2020; Elhorst 2014; LeSage and Pace 2009).
The gls()
function in the nlme
library is very flexible at modeling heteroskedasticity using several types of variance functions. The general principle of the gls model in terms of modeling heteroskedasticity is that the OLS model residual variance \(\sigma_{\epsilon}\) is now not a constant, but a function of covariates or different variances based on a stratification variable. This generates a model for the residual variances of the form:
\[Var(\epsilon_i |\beta) = \sigma^2 g\]
Where the \(g\) function can be the effect of a covariate, which would allow the variance to increase or decrease as function of that variable, or a set of strata, where the variance can be different in two or more groups. Below, I first show the model above fit using gls()
instead of lm()
, then extend the model to include heteroskedasticity based on the population size of the county, and state-specific variance terms (Pinheiro and Bates 2000).
library(nlme)
Attaching package: 'nlme'
The following object is masked from 'package:dplyr':
collapse
<- gls(lbrate1618 ~ fampov14 + rucc + hpsa16,
lm1gdata = ahrf_m)
summary(lm1g)
Generalized least squares fit by REML
Model: lbrate1618 ~ fampov14 + rucc + hpsa16
Data: ahrf_m
AIC BIC logLik
19580.52 19655.22 -9777.261
Coefficients:
Value Std.Error t-value p-value
(Intercept) 62.26920 1.1890098 52.37063 0.0000
fampov14 2.25805 0.0684647 32.98129 0.0000
rucc02 -1.91854 1.2008189 -1.59769 0.1102
rucc03 -3.38658 1.2505531 -2.70806 0.0068
rucc04 -6.65483 1.4063909 -4.73185 0.0000
rucc05 -5.96283 1.9373921 -3.07776 0.0021
rucc06 -3.94180 1.1431747 -3.44812 0.0006
rucc07 -4.00730 1.2816639 -3.12664 0.0018
rucc08 0.45112 2.1106217 0.21374 0.8308
rucc09 -3.88365 2.2911822 -1.69504 0.0902
hpsa16partial county shortage -0.66219 1.0076706 -0.65715 0.5111
hpsa16whole county shortage 2.36214 1.2533476 1.88466 0.0596
Correlation:
(Intr) fmpv14 rucc02 rucc03 rucc04 rucc05 rucc06
fampov14 -0.381
rucc02 -0.364 -0.121
rucc03 -0.342 -0.114 0.462
rucc04 -0.298 -0.160 0.416 0.399
rucc05 -0.207 -0.144 0.305 0.293 0.269
rucc06 -0.317 -0.245 0.521 0.501 0.453 0.336
rucc07 -0.281 -0.242 0.467 0.449 0.410 0.304 0.525
rucc08 -0.140 -0.163 0.288 0.278 0.248 0.186 0.344
rucc09 -0.101 -0.226 0.274 0.265 0.241 0.182 0.334
hpsa16partial county shortage -0.560 -0.131 -0.068 -0.078 -0.043 -0.024 -0.070
hpsa16whole county shortage -0.390 -0.235 -0.035 -0.045 0.017 0.015 -0.105
rucc07 rucc08 rucc09 hps16pcs
fampov14
rucc02
rucc03
rucc04
rucc05
rucc06
rucc07
rucc08 0.298
rucc09 0.292 0.204
hpsa16partial county shortage -0.058 -0.051 -0.037
hpsa16whole county shortage -0.035 -0.132 -0.097 0.693
Standardized residuals:
Min Q1 Med Q3 Max
-5.50851038 -0.63237431 -0.07515361 0.56768444 4.75446234
Residual standard error: 16.35905
Degrees of freedom: 2324 total; 2312 residual
These are the same results, in terms of the regression coefficients as returned by lm()
. If we include a varFixed()
term, we can regress the residual variance term on a covariate, in this case I select the population size.
<- gls(lbrate1618 ~ fampov14 + rucc + hpsa16,
lm2gdata = ahrf_m,
weights = varFixed(~popn) )
summary(lm2g)
Generalized least squares fit by REML
Model: lbrate1618 ~ fampov14 + rucc + hpsa16
Data: ahrf_m
AIC BIC logLik
21560.43 21635.12 -10767.21
Variance function:
Structure: fixed weights
Formula: ~popn
Coefficients:
Value Std.Error t-value p-value
(Intercept) 64.51684 1.7992819 35.85700 0.0000
fampov14 2.02451 0.0672245 30.11574 0.0000
rucc02 -3.65253 2.0140861 -1.81349 0.0699
rucc03 -3.74951 1.9836516 -1.89021 0.0589
rucc04 -6.48571 2.3336913 -2.77916 0.0055
rucc05 -6.03115 2.9497652 -2.04462 0.0410
rucc06 -2.53166 1.6616043 -1.52362 0.1277
rucc07 -2.30731 1.7300102 -1.33369 0.1824
rucc08 0.54783 2.1293518 0.25727 0.7970
rucc09 -2.85405 2.1820264 -1.30798 0.1910
hpsa16partial county shortage 0.18574 1.3009964 0.14277 0.8865
hpsa16whole county shortage 5.22180 1.4198726 3.67766 0.0002
Correlation:
(Intr) fmpv14 rucc02 rucc03 rucc04 rucc05 rucc06
fampov14 -0.269
rucc02 -0.537 -0.065
rucc03 -0.550 -0.045 0.536
rucc04 -0.466 -0.088 0.463 0.466
rucc05 -0.363 -0.085 0.367 0.370 0.328
rucc06 -0.603 -0.163 0.649 0.655 0.568 0.452
rucc07 -0.591 -0.169 0.627 0.631 0.555 0.441 0.776
rucc08 -0.431 -0.174 0.508 0.513 0.439 0.350 0.637
rucc09 -0.409 -0.234 0.500 0.503 0.435 0.348 0.632
hpsa16partial county shortage -0.464 -0.092 -0.062 -0.062 -0.063 -0.048 -0.109
hpsa16whole county shortage -0.429 -0.215 -0.016 -0.027 0.038 0.031 -0.050
rucc07 rucc08 rucc09 hps16pcs
fampov14
rucc02
rucc03
rucc04
rucc05
rucc06
rucc07
rucc08 0.609
rucc09 0.605 0.512
hpsa16partial county shortage -0.100 -0.101 -0.087
hpsa16whole county shortage 0.001 -0.099 -0.070 0.767
Standardized residuals:
Min Q1 Med Q3 Max
-7.55351050 -0.48060084 -0.06047952 0.31143188 8.68535299
Residual standard error: 0.1103656
Degrees of freedom: 2324 total; 2312 residual
This model shows some very different effects after controlling for non-constant variance. For instance, the whole county shortage
effect is now much more significant in the model, compared to the lm1
model.
The final model includes a separate variance for each state using the varIdent()
term.
<- gls(lbrate1618 ~ fampov14 + rucc + hpsa16,
lm3data = ahrf_m,
weights = varIdent(form = ~1|factor(state) ) )
summary(lm3)
Generalized least squares fit by REML
Model: lbrate1618 ~ fampov14 + rucc + hpsa16
Data: ahrf_m
AIC BIC logLik
19232.57 19583.07 -9555.287
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | factor(state)
Parameter estimates:
21 01 04 05 06 08 09 11
1.0000000 1.6653572 1.4666877 1.3444082 1.4169906 2.3439093 0.4153761 0.9349384
12 13 16 17 18 48 49 50
0.8691605 1.4618211 1.2492319 0.8531463 0.7272325 1.0046612 0.9955712 0.6782617
22 23 24 25 26 27 28 29
1.5206083 0.5562842 1.0764207 0.3632039 0.8797361 0.8694446 2.1388815 1.0671070
30 33 34 35 36 37 38 39
0.8593438 0.3518410 0.5857512 1.6675040 0.8834383 1.2756652 1.3451314 0.7590721
40 41 42 44 45 46 47 51
0.9096844 0.9789355 0.7160118 0.7725377 1.4955639 2.2706999 0.9232718 1.4150226
53 54 55 10 32 19 20 56
1.1651279 1.4044153 0.8102840 0.8493449 1.3739156 0.8028595 0.9380637 1.5713111
31
1.1601422
Coefficients:
Value Std.Error t-value p-value
(Intercept) 61.00371 0.9636746 63.30323 0.0000
fampov14 2.20339 0.0626079 35.19354 0.0000
rucc02 -1.55250 0.9300585 -1.66925 0.0952
rucc03 -2.67599 0.9989933 -2.67869 0.0074
rucc04 -6.75071 1.0541928 -6.40368 0.0000
rucc05 -7.07431 1.6290634 -4.34256 0.0000
rucc06 -5.33031 0.8964666 -5.94591 0.0000
rucc07 -6.20026 1.0294913 -6.02264 0.0000
rucc08 -0.38337 1.8548174 -0.20669 0.8363
rucc09 -3.80880 2.0564547 -1.85212 0.0641
hpsa16partial county shortage -0.40513 0.8049182 -0.50332 0.6148
hpsa16whole county shortage 0.69954 1.0571871 0.66170 0.5082
Correlation:
(Intr) fmpv14 rucc02 rucc03 rucc04 rucc05 rucc06
fampov14 -0.425
rucc02 -0.330 -0.118
rucc03 -0.300 -0.112 0.415
rucc04 -0.281 -0.137 0.396 0.369
rucc05 -0.155 -0.132 0.262 0.244 0.236
rucc06 -0.275 -0.238 0.476 0.446 0.423 0.284
rucc07 -0.232 -0.230 0.417 0.390 0.374 0.253 0.467
rucc08 -0.105 -0.142 0.235 0.221 0.207 0.140 0.280
rucc09 -0.058 -0.215 0.222 0.209 0.199 0.138 0.272
hpsa16partial county shortage -0.592 -0.112 -0.037 -0.042 -0.028 -0.026 -0.037
hpsa16whole county shortage -0.390 -0.189 -0.015 -0.026 0.019 0.016 -0.106
rucc07 rucc08 rucc09 hps16pcs
fampov14
rucc02
rucc03
rucc04
rucc05
rucc06
rucc07
rucc08 0.236
rucc09 0.233 0.147
hpsa16partial county shortage -0.036 -0.032 -0.019
hpsa16whole county shortage -0.031 -0.110 -0.076 0.654
Standardized residuals:
Min Q1 Med Q3 Max
-3.71110304 -0.57404142 0.07995833 0.73280651 3.80484317
Residual standard error: 13.75474
Degrees of freedom: 2324 total; 2312 residual
Which also provides similar results to lm1
. So, which model is better for this particular outcome? One way to examine relative model fit is to compare the Akaike Information Criteria (AIC) for the three models. The AIC consists of two components, one showing overall model deviance, or residual variance and a penalty term for the number of parameters in a model. A general form of it is:
\[ AIC = -2LL(\theta) + 2k \] Where the term \(-2LL(\theta)\) is the model -2 Log likelihood, or deviance, and \(2k\) is a penalty term with \(k\) being the number of parameters.
Since the three models lm1g
, lm2
and lm3
are all fit by gls()
, and fit to the same dataset, we can compare them. We can even add another model with a smooth spline effect of poverty:
<- gls(lbrate1618 ~ bs(fampov14, df=4) + rucc + hpsa16,
lm3sdata = ahrf_m,
weights = varIdent(form = ~1|factor(state) ) )
summary(lm3s)
Generalized least squares fit by REML
Model: lbrate1618 ~ bs(fampov14, df = 4) + rucc + hpsa16
Data: ahrf_m
AIC BIC logLik
19166.97 19534.62 -9519.486
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | factor(state)
Parameter estimates:
21 01 04 05 06 08 09 11
1.0000000 1.8015440 1.5283795 1.4399605 1.5393732 2.5102463 0.4584789 0.8961499
12 13 16 17 18 48 49 50
0.9169369 1.5786894 1.2845194 0.9010857 0.7815679 1.0169676 1.0666628 0.7347100
22 23 24 25 26 27 28 29
1.7020601 0.6180479 1.1587213 0.4065278 0.9556412 0.9113779 2.4411415 1.1501547
30 33 34 35 36 37 38 39
0.9098260 0.3629158 0.6330773 1.6758907 0.9583268 1.3320870 1.3647418 0.8178477
40 41 42 44 45 46 47 51
0.9867867 1.0609522 0.7586694 0.8150356 1.5564278 1.0001279 0.9815430 1.5006645
53 54 55 10 32 19 20 56
1.2497825 1.4843275 0.8554764 0.9154187 1.4561259 0.8426759 1.0163896 1.7112344
31
1.2457365
Coefficients:
Value Std.Error t-value p-value
(Intercept) 65.28605 2.227553 29.308421 0.0000
bs(fampov14, df = 4)1 4.58369 3.031480 1.512031 0.1307
bs(fampov14, df = 4)2 50.98375 3.137696 16.248784 0.0000
bs(fampov14, df = 4)3 72.74204 8.455071 8.603362 0.0000
bs(fampov14, df = 4)4 15.96517 11.310702 1.411510 0.1582
rucc02 -1.78937 0.947673 -1.888176 0.0591
rucc03 -2.99112 1.007407 -2.969130 0.0030
rucc04 -6.96833 1.065192 -6.541851 0.0000
rucc05 -7.95180 1.605224 -4.953700 0.0000
rucc06 -5.72170 0.906461 -6.312131 0.0000
rucc07 -6.38566 1.028234 -6.210321 0.0000
rucc08 -0.84020 1.849541 -0.454273 0.6497
rucc09 -3.84206 2.006058 -1.915231 0.0556
hpsa16partial county shortage -0.54729 0.798950 -0.685016 0.4934
hpsa16whole county shortage 0.60030 1.043145 0.575469 0.5650
Correlation:
(Intr) b(14,d=4)1 b(14,d=4)2 b(14,d=4)3
bs(fampov14, df = 4)1 -0.907
bs(fampov14, df = 4)2 -0.157 -0.005
bs(fampov14, df = 4)3 -0.527 0.664 -0.533
bs(fampov14, df = 4)4 -0.089 0.053 0.318 -0.304
rucc02 0.011 -0.166 -0.133 -0.073
rucc03 -0.008 -0.129 -0.146 -0.034
rucc04 -0.002 -0.132 -0.134 -0.060
rucc05 0.006 -0.088 -0.128 -0.034
rucc06 0.001 -0.142 -0.190 -0.081
rucc07 -0.011 -0.120 -0.139 -0.100
rucc08 -0.015 -0.040 -0.115 -0.019
rucc09 0.006 -0.063 -0.076 -0.095
hpsa16partial county shortage -0.166 -0.091 -0.123 -0.029
hpsa16whole county shortage -0.143 -0.042 -0.096 -0.055
b(14,d=4)4 rucc02 rucc03 rucc04 rucc05 rucc06
bs(fampov14, df = 4)1
bs(fampov14, df = 4)2
bs(fampov14, df = 4)3
bs(fampov14, df = 4)4
rucc02 -0.036
rucc03 -0.038 0.435
rucc04 -0.035 0.414 0.388
rucc05 -0.032 0.282 0.265 0.255
rucc06 -0.043 0.495 0.466 0.441 0.304
rucc07 -0.035 0.433 0.406 0.390 0.269 0.482
rucc08 -0.019 0.243 0.231 0.216 0.150 0.290
rucc09 -0.114 0.231 0.216 0.207 0.146 0.283
hpsa16partial county shortage -0.031 -0.012 -0.023 -0.009 -0.003 -0.013
hpsa16whole county shortage -0.054 -0.006 -0.019 0.026 0.027 -0.096
rucc07 rucc08 rucc09 hps16pcs
bs(fampov14, df = 4)1
bs(fampov14, df = 4)2
bs(fampov14, df = 4)3
bs(fampov14, df = 4)4
rucc02
rucc03
rucc04
rucc05
rucc06
rucc07
rucc08 0.243
rucc09 0.245 0.152
hpsa16partial county shortage -0.015 -0.022 -0.015
hpsa16whole county shortage -0.018 -0.107 -0.073 0.650
Standardized residuals:
Min Q1 Med Q3 Max
-3.66477973 -0.58141552 0.08014607 0.72668723 3.87024262
Residual standard error: 12.88614
Degrees of freedom: 2324 total; 2309 residual
AIC( lm1g, lm2g, lm3, lm3s)
Warning in AIC.default(lm1g, lm2g, lm3, lm3s): models are not all fitted to the
same number of observations
df AIC
lm1g 13 19580.52
lm2g 13 21560.43
lm3 61 19232.57
lm3s 64 19166.97
The AIC()
function calculates AIC for each model, and we see that lm3s
has the lowest of the three, suggesting that the non-constant variance across states is a better representation than the effects of population size on the residual variance, and that the nonlinear effect of poverty is also present in this case.
5.7.2.1 Further model comparisons
R has a general method of comparing models using \(F\) tests or Likelihood Ratio Tests. These are often used when comparing nested models, where one model is a simplified version of another. We have such models above in our gls()
models. The lm1g
model is a simplified version of the lm3
model because it doesn’t contain the extra parameters modeling the unequal variances. The anova()
method can compare the models to see if the extra parameters are explaining the model deviance (or variation) better.
anova(lm1g, lm3)
Model df AIC BIC logLik Test L.Ratio p-value
lm1g 1 13 19580.52 19655.22 -9777.261
lm3 2 61 19232.57 19583.07 -9555.287 1 vs 2 443.9476 <.0001
Here the likelihood ratio test L.Ratio
shows a significant decrease in the logLik
in model lm3
, suggesting that it better explains the data than lm1g
does. The likelihood ratio is calculated as \(2*LL_2 - LL_1\), or in the case above 2*(-9555.287--9777.261)
. The anova
method is very useful for comparing alternative models and can be used on most of the models shown in this book.
5.7.3 Predictions and marginal means
Working with fitted values from a regression is one of the least taught aspects of modeling. Since the models are fundamentally doing very fancy averaging, the estimated, or fitted values from the model can often be very useful to us as we try to explain the results of our models. Remember, the fitted values of the model are just:
\[ \hat{y}_i = \sum \hat{\beta_k} x_{ki} \]
or the linear combination of the estimated model parameters and the \(k\) observed predictors, \(x_{ki}\). Most models in R have a fitted()
method to extract the fitted values of the outcome variable for each observation.
For example, here are the first six fitted values from the original lm1
OLS model, the gls()
heteroskedastic model lm3
model, and the first six values of the outcome:
library(gt)
<- data.frame(
fitsname = head( ahrf_m$NAME ),
lm1 = head( fitted ( lm1 )),
lm3 = head( fitted( lm3 )),
observed = head( ahrf_m$lbrate1618 )
)
%>%
fitsgt()
name | lm1 | lm3 | observed |
---|---|---|---|
Anderson | 83.16599 | 79.91074 | 87.78626 |
Bullitt | 79.44563 | 78.00540 | 84.47205 |
Clark | 85.43028 | 84.16478 | 89.20188 |
Daviess | 86.20507 | 84.98880 | 81.26411 |
Hardin | 82.15580 | 81.27857 | 78.47222 |
Harrison | 83.16599 | 79.91074 | 81.89655 |
We can also easily construct the map of the observed, fitted values and residuals from the model using tmap
.
$fitted_lm3 <- predict(lm3)
ahrf_m
$resid <- resid(lm3)
ahrf_m
<-tm_shape(ahrf_m)+
actual tm_polygons("lbrate1618",
palette = "Blues",
n=6,
style="fisher",
border.col = NULL,
colorNA = "grey50")+
tm_shape(sts)+
tm_lines( col = "black")
<-tm_shape(ahrf_m)+
fittm_polygons("fitted_lm3",
palette = "Blues",
n=6,
style="fisher",
border.col = NULL,
colorNA = "grey50")+
tm_shape(sts)+
tm_lines( col = "black")
<-tm_shape(ahrf_m)+
residstm_polygons("resid",
palette = "RdBu",
n=6,
style="fisher",
midpoint = NA,
border.col = NULL,
colorNA = "grey50")+
tm_shape(sts)+
tm_lines( col = "black")
tmap_arrange(actual, fit, resids, nrow=2)
This map layout shows the observed rates, the fitted rates from the lm3
model and the residuals from the model. Mapping residuals from models run on place-based data can show areas within the data where the model is consistently under or over-estimating the outcome. For instance in the example above, we see consistently high residuals in several of the south eastern states, and the mountain west, and several areas where the residuals are consistently negative, meaning we are over-estimating the rate. This can be instructive as to other variables we may be leaving out of the model that follow similar spatial distributions of the residuals. If such spatially patterned residuals are observed, it is generally a good idea to consider a statistical model that incorporates some kind of spatially explicit model specification.
5.7.3.1 More on predicted values
There are more systematic methods for generating predictions from a model that allow us to marginalize the estimates across other variables and to generate counter factual or hypothetical rates using combinations of \(x\) values that may not be observed in the data. The emmeans
package is very good at this and also accommodates many types of models. For instance if we would like the marginal means for counties by their healthcare shortage area type, after controlling for the other variables in our model, we can request that.
library(emmeans)
<- ref_grid(lm3s)
rg
<- emmeans(rg, specs ="hpsa16" )
mu1 %>%
mu1as.data.frame()%>%
gt()
hpsa16 | emmean | SE | df | lower.CL | upper.CL |
---|---|---|---|---|---|
no shortage | 83.82779 | 0.8441041 | 2261 | 82.17249 | 85.48309 |
partial county shortage | 83.28049 | 0.4890943 | 2261 | 82.32137 | 84.23962 |
whole county shortage | 84.42809 | 0.7923959 | 2261 | 82.87419 | 85.98198 |
Which unsurprisingly does not tell us much, because the health care shortage area variable in lm3
was not a significant predictor of the low birth weight rate. We can include more than one margin in the function, and include specific values we wish to highlight. For instance, let’s compare three hypothetical counties with three different poverty rates, we can use the summary(ahrf_m$fampov14)
to see the first and third quartiles of the distribution and the mean, we will use these as our theoretical values.
<-ref_grid(lm3,
rgat=list( fampov14 = c(1.8,7.8, 11.6, 14.3, 52.1) ) )
<- emmeans(rg, specs = c("fampov14", "rucc"))
means
%>%
meansas.data.frame()%>%
ggplot(aes(x=rucc, y=emmean))+
geom_line(aes(group=factor(fampov14), color=factor(fampov14)))+
theme_classic()
Which illustrates the differences in the poverty rates on the low birth weight rate across the rural-urban continuum. The use of these marginal means is very useful when illustrating the effects of covariates in regression models, and to effectively illustrate interactions in such models. For instance, if we estimate the model below, which interacts the poverty rate with the rural-urban continuum code:
<- gls(lbrate1618 ~ fampov14 * rucc + hpsa16,
lm3idata = ahrf_m,
weights = varIdent(form = ~1|factor(state) ) )
<-ref_grid(lm3i,
rgat=list( fampov14 = c(7.8, 11.6, 14.3) ) )
<- emmeans(rg, specs = c("fampov14", "rucc"))
means
%>%
meansas.data.frame()%>%
ggplot(aes(x=rucc, y=emmean))+
geom_line(aes(group=factor(fampov14), color=factor(fampov14)))+
theme_classic()
We can see that in the most rural areas, (the 09
level of rucc
), the differences by poverty rate are less than in more metropolitan areas, such as the 02
or 03
levels.
5.7.3.2 Use of OLS for place-based models
The OLS model and its extensions are a very useful staring place when analyzing data on places. For nothing more than the interpretive ease of the models estimates, it presents a very attractive choice for ecological modeling. The extension of the model through weighted and generalized least square allows for more flexible modeling to accommodate non-constant variance that often arises. Further extensions of the model by techniques of spatial econometrics further allow for direct incorporation of spatially correlated and lagged effects of covariates and model error terms to better deal with the idiosyncrasies of place-based data (Chi and Zhu 2020; Elhorst 2014; LeSage and Pace 2009). These methods have seen wide use in demographic research over the past twenty years. Despite the fundamental flexibility of the OLS model, it may still not present the best solution when modeling demographic rates. One glaring reason is that if our outcomes are measured as rates, which are effectively probabilities, then the model can easily lead to estimates of predicted values that are either negative or greater than one, either of which presenting an issue for limited outcomes. In fact, this is why I am a strong proponent of not using the linear model for estimating probabilities. The next section of the book turns to the use of the Generalized Linear Model (Nelder and Wedderburn 1972; McCullagh and Nelder 1998) as an alternative modeling strategy, especially when considering place-based data, when data are measured either as rates or as relative risks.
5.8 Basics of Generalized Linear Models
Up until now, we have been relying on linear statistical models which assumed the Normal distribution for our outcomes. A broader class of regression models, are Generalized Linear Models (Nelder and Wedderburn 1972; McCullagh and Nelder 1998), or GLMs, which allow for the estimation of a linear regression specification for outcomes that are not assumed to come from a Normal distribution. GLMs are a class of statistical models with three underlying components: A probability density appropriate to the outcome, a link function and a linear predictor. The link function is some mathematical function that links the mean of the specified probability distribution to the linear predictor of regression parameters and covariates. For example, the Normal distribution used by the OLS model has the mean, \(\mu\), which is typically estimated using the linear mean function : \(\mu = \beta_0 + \beta_1 x_1\) Which describes the line that estimates the mean of the outcome variable as a linear function of \(\beta\) parameters and the predictor variable \(x_1\). The OLS, or Gaussian GLM model uses an identity link meaning there is no transformation of the linear mean function as it is connected to the mean of the outcome. This can be written as:
\[g(u) = g(E(Y)) = \beta_0 + \beta_1 x_1\]
Where \(g()\) is the link function, linking the mean of the Normal distribution to the linear mean function of the model. The equivalent GLM model to the lm1
model from the previous section is:
<- glm(lbrate1618 ~ fampov14 + rucc + hpsa16,
glm1data = ahrf_m,
family =gaussian)
library(texreg)
Version: 1.38.6
Date: 2022-04-06
Author: Philip Leifeld (University of Essex)
Consider submitting praise using the praise or praise_interactive functions.
Please cite the JSS article in your publications -- see citation("texreg").
Attaching package: 'texreg'
The following object is masked from 'package:tidyr':
extract
texreg(list(glm1, lm1), file = "4_1.tex")
The table was written to the file '4_1.tex'.
<-lm1%>%
lm1_ttbl_regression()
<-glm1%>%
glm1_ttbl_regression()
<- tbl_merge(
t_m tbls = list(lm1_t, glm1_t),
tab_spanner = c("**OLS**", "**GLM**")
)
t_m
Characteristic | OLS | GLM | ||||
---|---|---|---|---|---|---|
Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
% Families Below Poverty Level 2014-18 | 2.3 | 2.1, 2.4 | <0.001 | 2.3 | 2.1, 2.4 | <0.001 |
rucc | ||||||
01 | — | — | — | — | ||
02 | -1.9 | -4.3, 0.44 | 0.11 | -1.9 | -4.3, 0.44 | 0.11 |
03 | -3.4 | -5.8, -0.93 | 0.007 | -3.4 | -5.8, -0.94 | 0.007 |
04 | -6.7 | -9.4, -3.9 | <0.001 | -6.7 | -9.4, -3.9 | <0.001 |
05 | -6.0 | -9.8, -2.2 | 0.002 | -6.0 | -9.8, -2.2 | 0.002 |
06 | -3.9 | -6.2, -1.7 | <0.001 | -3.9 | -6.2, -1.7 | <0.001 |
07 | -4.0 | -6.5, -1.5 | 0.002 | -4.0 | -6.5, -1.5 | 0.002 |
08 | 0.45 | -3.7, 4.6 | 0.8 | 0.45 | -3.7, 4.6 | 0.8 |
09 | -3.9 | -8.4, 0.61 | 0.090 | -3.9 | -8.4, 0.61 | 0.090 |
hpsa16 | ||||||
no shortage | — | — | — | — | ||
partial county shortage | -0.66 | -2.6, 1.3 | 0.5 | -0.66 | -2.6, 1.3 | 0.5 |
whole county shortage | 2.4 | -0.10, 4.8 | 0.060 | 2.4 | -0.09, 4.8 | 0.060 |
1 CI = Confidence Interval |
Which shows the exact same output for both models, as it should be. The output shown by summary(lm1)
and summary(glm1)
is different though, but the same results can be recovered.
summary(glm1)
Call:
glm(formula = lbrate1618 ~ fampov14 + rucc + hpsa16, family = gaussian,
data = ahrf_m)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.26920 1.18901 52.371 < 2e-16 ***
fampov14 2.25805 0.06846 32.981 < 2e-16 ***
rucc02 -1.91854 1.20082 -1.598 0.110249
rucc03 -3.38658 1.25055 -2.708 0.006818 **
rucc04 -6.65483 1.40639 -4.732 2.36e-06 ***
rucc05 -5.96283 1.93739 -3.078 0.002110 **
rucc06 -3.94180 1.14317 -3.448 0.000575 ***
rucc07 -4.00730 1.28166 -3.127 0.001790 **
rucc08 0.45112 2.11062 0.214 0.830770
rucc09 -3.88365 2.29118 -1.695 0.090202 .
hpsa16partial county shortage -0.66219 1.00767 -0.657 0.511148
hpsa16whole county shortage 2.36214 1.25335 1.885 0.059601 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 267.6184)
Null deviance: 981607 on 2323 degrees of freedom
Residual deviance: 618734 on 2312 degrees of freedom
AIC: 19599
Number of Fisher Scoring iterations: 2
This output shows the same coefficients, and hypothesis test results compared to summary(lm1)
, but the residual variances are reported differently. The GLM summary reports the Null and Residual deviance instead of the Residual standard errors reported by summary(lm1)
. If we take the residual deviance and divide it by the residual degrees of freedom, and take the square root, we get the residual standard error reported by summary(lm1)
:
sqrt(glm1$deviance/glm1$df.residual)
[1] 16.35905
summary(lm1)$sigma
[1] 16.35905
The deviance in the GLM model is calculated in the same way as the residual sums of squares:
sum((fitted(lm1)-ahrf_m$lbrate1618 )^2)
[1] 618733.7
$deviance glm1
[1] 618733.7
We do not need to assume the identity function is the only one for the Gaussian GLM, for instance, the logarithmic link function can change the model to:
\[ ln(Y) = \beta_0 + \beta_1 x_1 \\ Y = exp(\beta_0 + \beta_1 x_1) \\ Y = exp(\beta_0) * exp(\beta_1 x_1) \] Which changes the model to no longer be additive in terms of the parameters for the logarithmic link function.
<- glm(lbrate1618 ~ fampov14 + rucc + hpsa16,
glm2data = ahrf_m,
family =gaussian(link = "log"))
AIC(glm1, glm2)
df AIC
glm1 13 19599.34
glm2 13 19687.44
In this case, the identity link function is preferred because of the lower AIC.
Different distributions have different link functions….
The identity link function is appropriate for the Normal distribution, because this distribution can take any value from \(- \infty\) to \(\infty\), and so the linear mean function can also take those values, theoretically. Other distributions may not have this wide of a numeric range, so appropriate link functions have to be used to transform the linear mean function to the scale of the mean of a particular distribution. The most common distributions for the generalized linear model and their common link functions are shown below, along with common expressions for the mean and variance of their respective distributions.
Distribution | Mean | Variance | Link Function | Range of Outcome |
---|---|---|---|---|
Gaussian | \(\mu\) | \(\sigma^2\) | Identity | \((-\infty , \infty)\) |
Binomial | \(\pi\) | \(n\pi(1-\pi)\) | \(log \left (\frac{\pi}{1-\pi} \right )\) | \(\frac{0,1,2,...n}{n}\) |
Poisson | \(\lambda\) | \(\lambda\) | \(log (\lambda)\) | \((0,1,2,...)\) |
Gamma | \(\mu\) | \(\phi \mu^2\) | \(log (\mu)\) | \((0, \infty)\) |
Negative Binomial | \(n(1-p)/p\) | \(n(1-p)/p^2\) | \(log (\mu)\) | \((0,1,2,...)\) |
Student-t | \(\mu\) | \(\frac{\sigma^2 \nu}{\nu-2}\) | Identity | \(-\infty , \infty\) |
While these are not all possible distributions for the GLM, these are distributions that are both widely used and commonly present not only in R but in other software as well. The VGAM
package adds a much wider selection of both univariate and bivariate distributions for discrete and continuous outcomes.
5.8.1 Binomial Distribution
You have probably seen the binomial distribution in either a basic statistics course, remember the coin flips? Or in the context of a logistic regression model. There are two ways the binomial distribution is typically used, the first is the context of logistic regression, where a special case of the binomial is used, called the Bernoulli distribution. This is the case of the binomial when there is basically a single coin flip, and you’re trying to estimate the probability that it is heads (or tails). This is said to be a single trial, and the outcome is either 1 or 0 (heads or tails). We will spend time in chapter 5 discussing the logistic regression model in the context of individual level data.
The second way the binomial is used is when you have multiple trials, and you’re trying to estimate the probability of the event occurring over these trials. In this case, your number of trials, \(n\) can be large, and your number of successes, \(y\) is the random variable under consideration. This usage of the binomial has a wide applicability for place-based demographic analysis, as the basic distribution for a demographic rate. I will commonly refer to this as the count-based binomial distribution.
The mean of the binomial distribution is a proportion or a probability, \(\pi\), which tells you the probability of the event of interest occurs. Any model using the binomial distributor will be geared towards estimating this probability. The good thing is that, when we have count data, not just 1’s and 0’s, the same thing happens. The ratio or successes (\(y\)) to trials (\(n\)) is used to estimate \(\pi\) and we build a model for that mean rate:
\[\text{Binomial} \binom{n}{y} = \frac{y}{n} = \pi = \text{some function of predictors}\]
The ratio \(\frac{y}{n}\) is a rate or probability, and as such has very strict bounds. Probabilities cannot be less than 0 or greater than 1, so again, we should not use the Normal distribution here, since it is valid for all real numbers. Instead, we are using the binomial, but we still run into the problem of having a strictly bounded value, \(\pi\) that we are trying to estimate with a linear function.
Enter the link function again.
The binomial distribution typically uses either a logit or probit link function, but others such as the complementary log-log link function are also used in certain circumstances. For now we will use the logit function.
The logit transforms the probability, \(\pi\), which is bound on the interval \([0,1]\) into a new unbounded interval similar to the normal distribution of \([-\infty, \infty]\). The transformation is knows a the log-odds transformation, or logit for short. The odds of an event happening are the probability that something happens, divided by the probability it does not happen, in this case:
\[\text{odds}({\pi}) = \frac{\pi}{(1-\pi)}\]
Which is bound on the interval \([0, \infty]\), when we take the natural log of the odds, the value is transformed into the linear space, of \([-\infty, \infty]\).
\[\text{log-odds }({\pi}) = log \left ( \frac{\pi}{(1-\pi)} \right) \]
This can be modeled using a linear function of covariates now, without worrying about the original boundary problem:
\[log \left ( \frac{\pi}{1-\pi} \right) = \beta_0 +\beta_1 x_1\]
or more compactly:
\[logit (\pi) = \beta_0 +\beta_1 x_1\]
5.8.1.1 Binomial regression
The glm()
function can estimate the count binomial model using the syntax cbind( y, n-y)
in the outcome portion of the model formula.
<- glm(cbind(lowbw1618, births1618-lowbw1618) ~ fampov14 + rucc + hpsa16,
glmbdata = ahrf_m,
family = binomial)
%>%
glmbtbl_regression()
Characteristic | log(OR)1 | 95% CI1 | p-value |
---|---|---|---|
% Families Below Poverty Level 2014-18 | 0.02 | 0.02, 0.02 | <0.001 |
rucc | |||
01 | — | — | |
02 | -0.01 | -0.02, -0.01 | 0.003 |
03 | -0.02 | -0.04, -0.01 | <0.001 |
04 | -0.06 | -0.08, -0.04 | <0.001 |
05 | -0.04 | -0.07, -0.01 | 0.004 |
06 | -0.02 | -0.04, -0.01 | 0.014 |
07 | -0.03 | -0.06, -0.01 | 0.016 |
08 | 0.04 | -0.02, 0.10 | 0.2 |
09 | -0.03 | -0.10, 0.03 | 0.3 |
hpsa16 | |||
no shortage | — | — | |
partial county shortage | -0.02 | -0.03, 0.00 | 0.011 |
whole county shortage | 0.00 | -0.02, 0.03 | 0.8 |
1 OR = Odds Ratio, CI = Confidence Interval |
The output above shows the results of the model. The coefficients are on the log-odds scale, and typically would be converted to an odds-ratio by exponentiating them.
%>%
glmbtbl_regression(exponentiate=TRUE)
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
% Families Below Poverty Level 2014-18 | 1.02 | 1.02, 1.02 | <0.001 |
rucc | |||
01 | — | — | |
02 | 0.99 | 0.98, 1.0 | 0.003 |
03 | 0.98 | 0.96, 0.99 | <0.001 |
04 | 0.95 | 0.93, 0.96 | <0.001 |
05 | 0.96 | 0.93, 0.99 | 0.004 |
06 | 0.98 | 0.96, 1.0 | 0.014 |
07 | 0.97 | 0.94, 0.99 | 0.016 |
08 | 1.05 | 0.98, 1.11 | 0.2 |
09 | 0.97 | 0.90, 1.03 | 0.3 |
hpsa16 | |||
no shortage | — | — | |
partial county shortage | 0.98 | 0.97, 1.00 | 0.011 |
whole county shortage | 1.00 | 0.98, 1.03 | 0.8 |
1 OR = Odds Ratio, CI = Confidence Interval |
In this case, the odds ratio interpretation is not as clear as in the case of the Bernoulli case, in the context of individuals. When I interpret the coefficients for the count binomial, I describe them as percent changes in the mean. For example, the fampov14
odds ratio is 1.02, I describe this result as: The low birth weight rate increases by 2 percent for every 1 percentage point increase in the poverty rate. We can see this by using the fitted values from emmeans
, here I generate two cases where fampov14
is exactly 1 percentage point different, and you can see the difference in the estimated rates.
<- ref_grid(glmb,
rg at=list( fampov14 = c(5, 6) ) )
emmeans(rg,
specs = "fampov14",
type = "response")
fampov14 prob SE df asymp.LCL asymp.UCL
5 0.0735 0.000474 Inf 0.0726 0.0745
6 0.0750 0.000470 Inf 0.0741 0.0760
Results are averaged over the levels of: rucc, hpsa16
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
We can calculate the percentage change in these two estimates:
0750 - .0735)/.0750 (.
[1] 0.02
and confirm that it is 2 percent.
5.8.1.2 Application of the binomial to age standardization
The Binomial is very useful for conducting standardization of rates between groups to measure the differences. To show an example of how to do age standardization, I use data from the CDC Wonder Compressed Mortality file. The data are for the states of Texas and California for the year 2016, and are the numbers of deaths and population at risk in 13 age groups.
<- readr::read_delim("data/CMF_TX_CA_age.txt",
txca delim = "\t",
quote = "\"",
skip=1,
col_names = c("State",
"State_Code",
"Age_Group",
"Age_Group_Code",
"Deaths",
"Population",
"Crude.Rate")
%>%
)filter(Age_Group != "Not Stated")%>%
mutate(Population = as.numeric(Population), Deaths = as.numeric(Deaths))
Rows: 28 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (7): State, State_Code, Age_Group, Age_Group_Code, Deaths, Population, C...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
%>%
txcagt()
State | State_Code | Age_Group | Age_Group_Code | Deaths | Population | Crude.Rate |
---|---|---|---|---|---|---|
California | 06 | < 1 year | 1 | 2057 | 498832 | 412.4 |
California | 06 | 1-4 years | 1-4 | 397 | 1988540 | 20.0 |
California | 06 | 5-9 years | 5-9 | 253 | 2539626 | 10.0 |
California | 06 | 10-14 years | 10-14 | 277 | 2522475 | 11.0 |
California | 06 | 15-19 years | 15-19 | 1076 | 2579986 | 41.7 |
California | 06 | 20-24 years | 20-24 | 2090 | 2812191 | 74.3 |
California | 06 | 25-34 years | 25-34 | 5201 | 5917785 | 87.9 |
California | 06 | 35-44 years | 35-44 | 7032 | 5159932 | 136.3 |
California | 06 | 45-54 years | 45-54 | 16370 | 5195297 | 315.1 |
California | 06 | 55-64 years | 55-64 | 34176 | 4688718 | 728.9 |
California | 06 | 65-74 years | 65-74 | 45834 | 3089002 | 1483.8 |
California | 06 | 75-84 years | 75-84 | 59121 | 1535300 | 3850.8 |
California | 06 | 85+ years | 85+ | 88331 | 722333 | 12228.6 |
Texas | 48 | < 1 year | 1 | 2287 | 405899 | 563.4 |
Texas | 48 | 1-4 years | 1-4 | 433 | 1613272 | 26.8 |
Texas | 48 | 5-9 years | 5-9 | 276 | 2038319 | 13.5 |
Texas | 48 | 10-14 years | 10-14 | 319 | 2029062 | 15.7 |
Texas | 48 | 15-19 years | 15-19 | 999 | 1970588 | 50.7 |
Texas | 48 | 20-24 years | 20-24 | 1829 | 2005169 | 91.2 |
Texas | 48 | 25-34 years | 25-34 | 4405 | 4085728 | 107.8 |
Texas | 48 | 35-44 years | 35-44 | 6292 | 3726287 | 168.9 |
Texas | 48 | 45-54 years | 45-54 | 13636 | 3519013 | 387.5 |
Texas | 48 | 55-64 years | 55-64 | 28663 | 3116019 | 919.9 |
Texas | 48 | 65-74 years | 65-74 | 37506 | 2008449 | 1867.4 |
Texas | 48 | 75-84 years | 75-84 | 44144 | 957001 | 4612.7 |
Texas | 48 | 85+ years | 85+ | 51173 | 387790 | 13196.1 |
%>%
txcagroup_by(State)%>%
summarise(p_pop = Population/sum(Population ))%>%
ungroup()%>%
mutate(age = forcats::fct_relevel(txca$Age_Group,"< 1 year",
"1-4 years",
"5-9 years",
"10-14 years",
"15-19 years",
"20-24 years",
"25-34 years",
"35-44 years",
"45-54 years",
"55-64 years",
"65-74 years",
"75-84 years", "85+ years"
%>%
))ggplot(aes(x = age, y = p_pop,group=State, color= State))+
geom_line(lwd=2)+
ylab("% in Age" )+
xlab ("Age group")+
ggtitle("Age distribution in Texas and California")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'State'. You can override using the
`.groups` argument.
We see that the age distribution of Texas is slightly younger than California, also the large peak at 25-34 year age group is because the data adopt a 10 year age interval after age 25.
To do age standardization using the regression framework, we have to control for the differences in the age structure of the two populations (Texas and California) by regressing the mortality rate on the Age structure, the difference between the states after doing this is the difference in the age-standardized rate.
<- txca%>%
txcamutate(Age_Group = forcats::fct_relevel(txca$Age_Group,"< 1 year",
"1-4 years",
"5-9 years",
"10-14 years",
"15-19 years",
"20-24 years",
"25-34 years",
"35-44 years",
"45-54 years",
"55-64 years",
"65-74 years",
"75-84 years", "85+ years"
))
<- glm(cbind(Deaths, Population-Deaths) ~ factor(Age_Group)+State,
glmb_s data=txca,
family=binomial)
%>%
glmb_stbl_regression(exp = TRUE)
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
factor(Age_Group) | |||
< 1 year | — | — | |
1-4 years | 0.05 | 0.04, 0.05 | <0.001 |
5-9 years | 0.02 | 0.02, 0.03 | <0.001 |
10-14 years | 0.03 | 0.02, 0.03 | <0.001 |
15-19 years | 0.09 | 0.09, 0.10 | <0.001 |
20-24 years | 0.17 | 0.16, 0.18 | <0.001 |
25-34 years | 0.20 | 0.19, 0.21 | <0.001 |
35-44 years | 0.31 | 0.30, 0.32 | <0.001 |
45-54 years | 0.72 | 0.70, 0.75 | <0.001 |
55-64 years | 1.70 | 1.65, 1.75 | <0.001 |
65-74 years | 3.48 | 3.38, 3.59 | <0.001 |
75-84 years | 9.07 | 8.80, 9.35 | <0.001 |
85+ years | 30.4 | 29.5, 31.3 | <0.001 |
State | |||
California | — | — | |
Texas | 1.20 | 1.19, 1.21 | <0.001 |
1 OR = Odds Ratio, CI = Confidence Interval |
Which shows that Texas has a standardized mortality rate 20 percent higher than California.
5.9 Poisson distribution
Another distribution commonly used in the analysis of place-based data is the Poisson distribution. The Poisson is applicable to outcomes that are positive integers, and is commonly used in epidemiology as a model of relative risk of a disease or mortality. The Poisson has a single parameter, the mean, \(\lambda\), and it is really the average count for the outcome (\(y\)). We have several ways of modeling a count outcome with the Poisson
- Pure count model If each area or place has the same total area, risk set, or population size, then we can model the mean as-is. This would lead to a model that looks like:
\[log(\lambda)= \beta_0 + \beta_1 x_1\]
When we see the \(\beta_1\) parameter in this model in computer output, it is on the log-scale, since that is the scale of the outcome for the Poisson model. In order to interpret the \(\beta_1\), we have to exponentiate it. When we do this, the parameter is interpreted as the percentage change in the mean of the outcome, for a 1 unit change in \(x_1\). For instance if we estimate a model and see in the output that \(\beta_1 = \text{.025}\), then \(\exp(\beta_1) = \text{exp}(\text{.025}) = \text{1.025}\), or for a 1 unit increase in \(x_1\), the mean of \(y\) increases by 1.025. So if the mean of \(y\) is 10, when \(x_1\) = 0, then the mean is \(10*(1.025*1)\) or \(10.25\) when \(x_1\) = 1. This application of the Poisson is rare in demographic research because places rarely have either equal populations or areas, so the Rate Model or the Relative Risk Model are much more commonly used.
- Rate model The second type of modeling strategy used in the Poisson model is for a rate of occurrence. This model includes an offset term in the model to incorporate unequal population sizes, this is the most common way the data are analyzed in demographic research. This offset term can be thought of as the denominator for the rate, and we can show how it is included in the model.
If \(n\) is the population size for each place, then, we want to do a regression on the rate of occurrence of our outcome. The rate is typically expressed as a proportion, or probability \(rate = \frac{y}{n}\), as seen in the Binomial distribution earlier:
\[ log(y/n)= \beta_0 + \beta_1 x_1 \\ log(y) - log(n)= \beta_0 + \beta_1 x_1\\ log(y)= \beta_0 + \beta_1 x_1 + log(n) \]
Similar to the example from before, when interpreting the effect of \(\beta_1\) in this model, we also have to exponentiate it. In this case, the interpretation would not be related to the overall count, but to the rate of occurrence. So, if as before, the \(\beta_1 = \text{.025}\), then \(\exp(\beta_1) = \text{exp}(\text{.025}) = \text{1.025}\), or for a 1 unit increase in \(x_1\), the rate of occurrence of \(y\) increases by a factor of 1.025.
This model includes the natural log of the population size in the offset()
function on the right side of the model formula. R will not estimate a regression coefficient for this term, and as in the equation above, the term just represents a scale factor for the outcome.
Note on offsets It is important to ensure that all of the populations in a particular analysis have non-zero counts, because if the model sees \(log(0)\) in data, it will generate an error, because this is not a number.
log(0)
[1] -Inf
The Poisson model with a population offset is specified as:
<- glm(Deaths ~ offset(log(Population)) + factor(Age_Group) + State,
glmp_s data=txca,
family=poisson)
%>%
glmp_stbl_regression(exp = TRUE)
Characteristic | IRR1 | 95% CI1 | p-value |
---|---|---|---|
factor(Age_Group) | |||
< 1 year | — | — | |
1-4 years | 0.05 | 0.04, 0.05 | <0.001 |
5-9 years | 0.02 | 0.02, 0.03 | <0.001 |
10-14 years | 0.03 | 0.03, 0.03 | <0.001 |
15-19 years | 0.10 | 0.09, 0.10 | <0.001 |
20-24 years | 0.17 | 0.16, 0.18 | <0.001 |
25-34 years | 0.20 | 0.19, 0.21 | <0.001 |
35-44 years | 0.31 | 0.30, 0.32 | <0.001 |
45-54 years | 0.72 | 0.70, 0.75 | <0.001 |
55-64 years | 1.69 | 1.64, 1.74 | <0.001 |
65-74 years | 3.44 | 3.33, 3.54 | <0.001 |
75-84 years | 8.73 | 8.47, 9.00 | <0.001 |
85+ years | 26.6 | 25.8, 27.4 | <0.001 |
State | |||
California | — | — | |
Texas | 1.19 | 1.18, 1.19 | <0.001 |
1 IRR = Incidence Rate Ratio, CI = Confidence Interval |
This result is very close to that from the Binomial model, where we see after age standardization, Texas has a 19 percent higher mortality rate overall than California.
5.10 Relative risk analysis
The third type of model for the Poisson distribution focuses on the idea of the relative risk of an event, and uses the Standardized risk ratio as its currency.
- The Standardized risk ratio incorporates differential exposure due to population size as an expected count of the outcome in the offset term, and are typically seen in epidemiological studies. The expected count \(E\), incorporates the different population sizes of each area by estimating the number of events that should occur, if the area followed a given rate of occurrence. The expected count is calculated by multiplying the average rate of occurrence, \(r\), by the population size, \(n\): \(E_i = r * n_i\), where \(r = \frac{\sum y_i}{\sum n_i}\), is the overall rate in the population. This method is commonly referred to as internal standardization because we are using the data at hand to estimate the overall rate of occurrence, versus using a rate from some other published source.
The model for the mean of the outcome would look like this:
\[log(y)= \beta_0 + \beta_1 x_1 + log(E)\].
And is specified very similarly to the rate model above. First, I show how to calculate the expected number of deaths in the data. A naive method of calculating the expected counts is to use the crude death rate as \(r\), or we can use an age-specific death rate.
#crude death rate
$E<- txca$Population*(sum(txca$Deaths/sum(txca$Population))) txca
In this calculation, the Population
variable is multiplied by \(r\), which is the sum of all deaths, divided by all populations.
The age-specific expected count is a little more involved. We first have to sum all deaths and populations by age, then calculate the age specific rate, then join this back to the original data based on the Age_Group
variable. Is this the only way to do this, no, but it works in this example. Alternatively, we could get another age schedule of mortality rates and merge it to these data and standardize our data to that mortality schedule.
#Age specific death rate
<- txca%>%
txca2group_by(Age_Group)%>%
summarise(ndeaths = sum(Deaths), npop=sum(Population))%>%
mutate(r_age = ndeaths/npop)%>%
ungroup()%>%
left_join(., txca, by = "Age_Group")%>%
arrange(State, Age_Group)%>%
mutate(E_age = Population * r_age)
%>%
txca2select(State, Age_Group, Deaths, Population, E, E_age)%>%
gt()
State | Age_Group | Deaths | Population | E | E_age |
---|---|---|---|---|---|
California | < 1 year | 2057 | 498832 | 3375.789 | 2395.1055 |
California | 1-4 years | 397 | 1988540 | 13457.219 | 458.2383 |
California | 5-9 years | 253 | 2539626 | 17186.631 | 293.4640 |
California | 10-14 years | 277 | 2522475 | 17070.564 | 330.3049 |
California | 15-19 years | 1076 | 2579986 | 17459.763 | 1176.4386 |
California | 20-24 years | 2090 | 2812191 | 19031.184 | 2287.7627 |
California | 25-34 years | 5201 | 5917785 | 40047.939 | 5682.6280 |
California | 35-44 years | 7032 | 5159932 | 34919.255 | 7736.8039 |
California | 45-54 years | 16370 | 5195297 | 35158.583 | 17888.9759 |
California | 55-64 years | 34176 | 4688718 | 31730.368 | 37750.7084 |
California | 65-74 years | 45834 | 3089002 | 20904.471 | 50503.1685 |
California | 75-84 years | 59121 | 1535300 | 10389.969 | 63613.0044 |
California | 85+ years | 88331 | 722333 | 4888.307 | 90772.2323 |
Texas | < 1 year | 2287 | 405899 | 2746.875 | 1948.8945 |
Texas | 1-4 years | 433 | 1613272 | 10917.635 | 371.7617 |
Texas | 5-9 years | 276 | 2038319 | 13794.093 | 235.5360 |
Texas | 10-14 years | 319 | 2029062 | 13731.447 | 265.6951 |
Texas | 15-19 years | 999 | 1970588 | 13335.731 | 898.5614 |
Texas | 20-24 years | 1829 | 2005169 | 13569.754 | 1631.2373 |
Texas | 25-34 years | 4405 | 4085728 | 27649.701 | 3923.3720 |
Texas | 35-44 years | 6292 | 3726287 | 25217.225 | 5587.1961 |
Texas | 45-54 years | 13636 | 3519013 | 23814.522 | 12117.0241 |
Texas | 55-64 years | 28663 | 3116019 | 21087.305 | 25088.2916 |
Texas | 65-74 years | 37506 | 2008449 | 13591.951 | 32836.8315 |
Texas | 75-84 years | 44144 | 957001 | 6476.396 | 39651.9956 |
Texas | 85+ years | 51173 | 387790 | 2624.325 | 48731.7677 |
In this table, you can see the age-specific expected counts of deaths are much more in line with the age-specific mortality rate, and the numbers of expected deaths are much more similar to the observed pattern of deaths, when compared to the expected counts derived from the crude death rate.
<- glm(Deaths ~ offset(log(E)) + factor(Age_Group)+State,
glmp_E data=txca2,
family=poisson)
<- glm(Deaths ~ offset(log(E_age)) + factor(Age_Group)+State,
glmp_Eage data=txca2,
family=poisson)
<- glmp_E%>%
m1 tbl_regression(exp = T)
<- glmp_Eage%>%
m2 tbl_regression(exp = T)
<- tbl_merge(list(m1, m2))
m_all
m_all
Characteristic | Table 1 | Table 2 | ||||
---|---|---|---|---|---|---|
IRR1 | 95% CI1 | p-value | IRR1 | 95% CI1 | p-value | |
factor(Age_Group) | ||||||
< 1 year | — | — | — | — | ||
1-4 years | 0.05 | 0.04, 0.05 | <0.001 | 1.00 | 0.93, 1.08 | >0.9 |
5-9 years | 0.02 | 0.02, 0.03 | <0.001 | 1.00 | 0.91, 1.09 | >0.9 |
10-14 years | 0.03 | 0.03, 0.03 | <0.001 | 1.00 | 0.92, 1.09 | >0.9 |
15-19 years | 0.10 | 0.09, 0.10 | <0.001 | 1.00 | 0.95, 1.06 | >0.9 |
20-24 years | 0.17 | 0.16, 0.18 | <0.001 | 1.01 | 0.96, 1.05 | 0.8 |
25-34 years | 0.20 | 0.19, 0.21 | <0.001 | 1.01 | 0.97, 1.04 | 0.7 |
35-44 years | 0.31 | 0.30, 0.32 | <0.001 | 1.01 | 0.97, 1.04 | 0.8 |
45-54 years | 0.72 | 0.70, 0.75 | <0.001 | 1.01 | 0.98, 1.04 | 0.6 |
55-64 years | 1.69 | 1.64, 1.74 | <0.001 | 1.01 | 0.98, 1.04 | 0.6 |
65-74 years | 3.44 | 3.33, 3.54 | <0.001 | 1.01 | 0.98, 1.04 | 0.5 |
75-84 years | 8.73 | 8.47, 9.00 | <0.001 | 1.01 | 0.98, 1.04 | 0.5 |
85+ years | 26.6 | 25.8, 27.4 | <0.001 | 1.02 | 0.99, 1.05 | 0.3 |
State | ||||||
California | — | — | — | — | ||
Texas | 1.19 | 1.18, 1.19 | <0.001 | 1.19 | 1.18, 1.19 | <0.001 |
1 IRR = Incidence Rate Ratio, CI = Confidence Interval |
This result is identical in terms of the difference between states, as the rate model above, and the results are invariant to the choice of the standard used, although when the age-specific expected count is used, the overall mortality pattern becomes insignificant in the model. This is an example of, despite different denominator/offset terms, we can achieve the same comparison from either the rate model or the model for relative risks.
5.11 Overdispersion
When using the Poisson GLM, you often run into overdispersion. What’s overdispersion you might ask? For the Poisson distribution, the mean and the variance are functions of one another (variance = mean for Poisson). So when you have more variability than you expect in your data, you have overdispersion. This basically says that your data do not fit your model, and is a problem because overdispersion leads to standard errors for our model parameters that are too small typically. But, we can fit other models that do not make such assumptions, or allow there to be more variability.
5.11.0.1 Checking for overdispersion
An easy check on this is to compare the residual deviance to the residual degrees of freedom. They ratio should be 1 if the model fits the data.
<-sqrt(glmp_E$deviance/glmp_E$df.residual)
scale scale
Here, we see for the Poisson model, the scale factor is over 6, which shows evidence of overdispersion in the data. The residual deviance can also be used as a goodness of fit test for the model, because the deviance has been shown to be distributed as a \(\chi^2\) distribution, with degrees of freedom equal to the residual d.f. (n-p):
1-pchisq(glmp_E$deviance,
df = glmp_E$df.residual)
So, this p value is 0, which means the model does not fit the data. If the goodness of fit test had a p-value over 5 percent, we could conclude that the model in fact did fit the data.
5.12 Modeling Overdispersion via a Quasi distribution
For the Poisson and the Binomial, we can fit a “quasi” distribution that adds an extra parameter to allow the mean-variance relationship to not be constant. For Poisson we get:
\(Var(Y) = \lambda * \phi\), instead of \(Var(Y) = \lambda\)
This accessory parameter \(\phi\) allows us to include a rough proxy for a dispersion parameter for the distribution. Naturally this is fixed at 1 for the normal Poisson model, and estimated in the quasi models, we can look to see if is much bigger than 1. If overdispersion is present and not accounted for you could identify a relationship as being significant when it is not!
<- glm(Deaths ~ offset(log(E)) + factor(Age_Group)+State,
glmqp data=txca,
family=quasipoisson)
%>%
glmqptbl_regression(exp = TRUE)
Characteristic | IRR1 | 95% CI1 | p-value |
---|---|---|---|
factor(Age_Group) | |||
< 1 year | — | — | |
1-4 years | 0.05 | 0.03, 0.07 | <0.001 |
5-9 years | 0.02 | 0.01, 0.04 | <0.001 |
10-14 years | 0.03 | 0.02, 0.05 | <0.001 |
15-19 years | 0.10 | 0.07, 0.13 | <0.001 |
20-24 years | 0.17 | 0.13, 0.22 | <0.001 |
25-34 years | 0.20 | 0.16, 0.25 | <0.001 |
35-44 years | 0.31 | 0.25, 0.39 | <0.001 |
45-54 years | 0.72 | 0.59, 0.89 | 0.008 |
55-64 years | 1.69 | 1.40, 2.06 | <0.001 |
65-74 years | 3.44 | 2.85, 4.19 | <0.001 |
75-84 years | 8.73 | 7.25, 10.6 | <0.001 |
85+ years | 26.6 | 22.1, 32.4 | <0.001 |
State | |||
California | — | — | |
Texas | 1.19 | 1.14, 1.23 | <0.001 |
1 IRR = Incidence Rate Ratio, CI = Confidence Interval |
While the overall pattern and substantive interpretation of the regular Poisson and the quasiPoisson models are identical, the standard errors of the parameters are not. We can see this by forming their ratios. The standard errors of the model parameters can be extracted from the summary()$coefficients
of a model, specifically the second column of this table.
<- summary(glmp_E)$coef[, 2]
sum1<- summary(glmqp)$coef[, 2]
sum2
data.frame(Poisson = sum1,
QPoisson= sum2,
coef = names(coef(glmp_E)))%>%
ggplot(aes(y = QPoisson/Poisson, x= coef))+
geom_point()+
ylab("Ratio of Standard Errors" )+
xlab ("Parameter")+
ggtitle("Ratio of Standard Errors in Poisson and QuasiPoisson Models")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This highly meaningful plot shows the ratio of all the standard errors is the same and is equal to the square root of the quasiPoisson dispersion parameter \(\phi\), which can be seen in summary(glmqp)$dispersion
, in this case it is 6.3079315, which is extremely close to the scale
we calculated earlier. The quasiPoisson model has a disadvantage however, in that, per the ?glm
documentation, is fit via quasi-likelihood methods and as such cannot be compared using AIC to other models.
5.12.1 Modeling dispersion properly
The dispmod
package implements the method of Breslow (1984) which adds an extra parameter to account for the overdispersion in the Poisson, and is fit via regular maximum likelihood, so it is comparable to other models. Effectively, this takes either a model fit via glm()
with either a Binomial or Poisson family, and performs a re-weighted estimate of the model parameters.
library(dispmod)
<- glm.poisson.disp(glmp_E,
glmp_d verbose = F)
%>%
glmp_dtbl_regression(exp = T)
Characteristic | IRR1 | 95% CI1 | p-value |
---|---|---|---|
factor(Age_Group) | |||
< 1 year | — | — | |
1-4 years | 0.05 | 0.04, 0.05 | <0.001 |
5-9 years | 0.02 | 0.02, 0.03 | <0.001 |
10-14 years | 0.03 | 0.02, 0.03 | <0.001 |
15-19 years | 0.10 | 0.09, 0.10 | <0.001 |
20-24 years | 0.17 | 0.16, 0.19 | <0.001 |
25-34 years | 0.20 | 0.18, 0.22 | <0.001 |
35-44 years | 0.31 | 0.29, 0.34 | <0.001 |
45-54 years | 0.72 | 0.66, 0.79 | <0.001 |
55-64 years | 1.70 | 1.56, 1.85 | <0.001 |
65-74 years | 3.45 | 3.16, 3.76 | <0.001 |
75-84 years | 8.73 | 8.01, 9.52 | <0.001 |
85+ years | 26.4 | 24.2, 28.8 | <0.001 |
State | |||
California | — | — | |
Texas | 1.24 | 1.20, 1.29 | <0.001 |
1 IRR = Incidence Rate Ratio, CI = Confidence Interval |
In this example the difference between California and Texas is now larger at a 24% higher mortality rate in Texas. How does this model compare to the regular Poisson model?
AIC(glmp_E, glmp_d)
df AIC
glmp_E 14 773.7668
glmp_d 14 90.3267
The dispersed model shows a much lower AIC suggesting that the dispersion accounted for in the the Breslow model is likely important. We can see how the dispersed Poisson model’s standard errors compare to those of the regular Poisson using the same plot we saw earlier:
In this case, the standard errors are not just a monotone transformation of the Poisson model errors, they each have their own scaling, which shows the dispersed Poisson model is doing something different compared to the quasiPoisson model by not using a single scaling factor for all the standard errors.
5.13 Other count models
The Poisson and Binomial models are probably the most commonly used models for rates and proportions, but other models exist that offer more flexibility. Specifically the Beta distribution and the Negative Binomial distribution.
5.13.1 Beta distribution model
The Beta distribution is a commonly used model for the prior distribution for the mean of the Binomial, as it has values between 0 and 1, but it is not used as often as an outcome distribution. It is indeed quite flexible as an outcome distribution for proportions because of its range of support, but also because it has an additional dispersion parameter, unlike the Binomial, so the pitfalls of overdispersion often present in the Binomial can be avoided. The betareg
package (Grün, Kosmidis, and Zeileis 2012) implements the Beta regression model. The distribution has density:
\[
f(y:p,q) = \frac{\Gamma(p +
q)}{\Gamma(p)\Gamma(q)}x^{p - 1}(1 - x)^{q - 1}
\] Where \(\Gamma\) is the Gamma function. The distribution, as parameterized in betareg
has mean and variance:
\[ \mu = p/(p+q) \\ Var = \mu (1-\mu)/ (1+\phi) \]
Where \(phi\) is a precision parameter. The model is linked to the linear predictor using one of the same link functions as the Binomial model: Logit, Probit or Complementary Log-Log.
\[g(\mu_i) = X'\beta = \eta_i\] The Logit link is parameterized as:
\[logit(g(\mu)) = log(\mu / (1-\mu))\] As described in their article on the betareg
package (Cribari-Neto and Zeileis 2010), the model can also include a separate model for dispersion, so you can model the variance in the data as well as the mean.
The betareg::betareg()
function implements model, and here is an application of the model to the low birth weight data used earlier:
library(betareg)
<-betareg(I(lbrate1618/1000) ~ fampov14 + rucc + hpsa16,
glm_betadata=ahrf_m)
summary(glm_beta)
Call:
betareg(formula = I(lbrate1618/1000) ~ fampov14 + rucc + hpsa16, data = ahrf_m)
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-7.0393 -0.6234 -0.0110 0.6338 3.7500
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6387868 0.0150612 -175.204 < 2e-16 ***
fampov14 0.0252868 0.0007936 31.862 < 2e-16 ***
rucc02 -0.0188140 0.0150716 -1.248 0.211917
rucc03 -0.0400036 0.0157581 -2.539 0.011129 *
rucc04 -0.0798596 0.0177856 -4.490 7.12e-06 ***
rucc05 -0.0734111 0.0242750 -3.024 0.002493 **
rucc06 -0.0494087 0.0142816 -3.460 0.000541 ***
rucc07 -0.0498176 0.0159746 -3.119 0.001817 **
rucc08 0.0038585 0.0252486 0.153 0.878541
rucc09 -0.0636793 0.0274465 -2.320 0.020334 *
hpsa16partial county shortage -0.0019801 0.0128139 -0.155 0.877196
hpsa16whole county shortage 0.0278359 0.0156183 1.782 0.074706 .
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>|z|)
(phi) 304.733 8.948 34.06 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 6347 on 13 Df
Pseudo R-squared: 0.355
Number of iterations: 23 (BFGS) + 3 (Fisher scoring)
And the model with a separate model for the dispersion can be estimated by including a second formula separated by a |
as:
library(betareg)
<-betareg(I(lbrate1618/1000) ~ fampov14 + rucc + hpsa16 | scale(popn),
glm_beta_odata=ahrf_m)
summary(glm_beta_o)
Call:
betareg(formula = I(lbrate1618/1000) ~ fampov14 + rucc + hpsa16 | scale(popn),
data = ahrf_m)
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-6.9251 -0.6169 -0.0023 0.6500 3.7111
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6421609 0.0149203 -177.085 < 2e-16 ***
fampov14 0.0251141 0.0007938 31.639 < 2e-16 ***
rucc02 -0.0124912 0.0146382 -0.853 0.39348
rucc03 -0.0327516 0.0154238 -2.123 0.03372 *
rucc04 -0.0721414 0.0175194 -4.118 3.83e-05 ***
rucc05 -0.0655256 0.0241470 -2.714 0.00666 **
rucc06 -0.0415704 0.0139537 -2.979 0.00289 **
rucc07 -0.0417441 0.0156940 -2.660 0.00782 **
rucc08 0.0118585 0.0252317 0.470 0.63837
rucc09 -0.0550535 0.0274518 -2.005 0.04491 *
hpsa16partial county shortage -0.0048007 0.0127864 -0.375 0.70733
hpsa16whole county shortage 0.0270952 0.0156837 1.728 0.08406 .
Phi coefficients (precision model with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.72714 0.02936 195.063 < 2e-16 ***
scale(popn) 0.08419 0.02929 2.874 0.00405 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 6355 on 14 Df
Pseudo R-squared: 0.3548
Number of iterations: 21 (BFGS) + 5 (Fisher scoring)
The AIC can be used to determine if the dispersion model is preferred to the regular beta regression.
AIC(glm_beta, glm_beta_o)
df AIC
glm_beta 13 -12668.24
glm_beta_o 14 -12682.02
In this case, the dispersion model is preferred since the AIC is 13.78 points lower for the dispersion model.
5.13.2 Negative binomial model
When overdispersion is present, many researchers will automatically turn to the Negative Binomial as an alternative. The Negative Binomial sounds like it is some alternative to the Binomial distribution, but it is more like the Poisson. It effectively adds a second parameter to a model for integer counts. The model has been parameterized in several ways. The most common are the NB1 and NB2 models. The difference is how the model allows the variance to increase with the mean. The NB1 model allows for variance to increase linearly with the mean:
\[ Y \sim NB (\lambda, \lambda+ \theta \lambda) \\ E(Y) = \lambda \\ \text{ } var(Y) = \lambda+\theta\lambda \\ \lambda = log(\eta) \\ \eta = \beta_0 + \beta_1 x_1+ log(n) \] The NB2 model, which is the most commonly implemented in software, allows the variance to increase as a square with the mean.
\[ Y \sim NB (\lambda, \lambda+ \theta \lambda^2) \\ E(Y) = \lambda \\ \text{ } var(Y) = \lambda+ \theta \lambda^2 \\ \lambda = log(\eta) \\ \eta = \beta_0 + \beta_1 x_1+ log(n) \]
The base R glm()
does not offer the NB1 model by default, but the standard MASS
package (Venables and Ripley 2002) has the glm.nb()
function. An alternative is the highly flexible gamlss
package(Rigby and Stasinopoulos 2005), which includes the NB2 model, as well as a host of other distributions, plus the ability to model mean and variance, similar to how betareg()
was achieving for the Beta distribution.
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:gtsummary':
select
The following object is masked from 'package:dplyr':
select
<- glm.nb(lowbw1618 ~ offset(log(births1618)) + fampov14 + rucc + hpsa16 ,
glmnbdata=ahrf_m)
%>%
glmnbtbl_regression()
Characteristic | log(IRR)1 | 95% CI1 | p-value |
---|---|---|---|
% Families Below Poverty Level 2014-18 | 0.03 | 0.02, 0.03 | <0.001 |
rucc | |||
01 | — | — | |
02 | -0.02 | -0.04, 0.01 | 0.2 |
03 | -0.04 | -0.07, -0.02 | <0.001 |
04 | -0.08 | -0.11, -0.05 | <0.001 |
05 | -0.07 | -0.12, -0.03 | <0.001 |
06 | -0.06 | -0.08, -0.03 | <0.001 |
07 | -0.07 | -0.10, -0.04 | <0.001 |
08 | -0.01 | -0.07, 0.06 | 0.9 |
09 | -0.09 | -0.16, -0.01 | 0.021 |
hpsa16 | |||
no shortage | — | — | |
partial county shortage | -0.01 | -0.03, 0.01 | 0.4 |
whole county shortage | 0.00 | -0.03, 0.03 | >0.9 |
1 IRR = Incidence Rate Ratio, CI = Confidence Interval |
The NB2 version of the model can be estimated using gamlss()
with family=NBII
:
library(gamlss)
Loading required package: gamlss.data
Attaching package: 'gamlss.data'
The following object is masked from 'package:datasets':
sleep
Loading required package: gamlss.dist
Loading required package: parallel
Registered S3 method overwritten by 'gamlss':
method from
print.ri bit
********** GAMLSS Version 5.4-12 **********
For more on GAMLSS look at https://www.gamlss.com/
Type gamlssNews() to see new features/changes/bug fixes.
<-gamlss(lowbw1618 ~ offset(log(births1618)) + fampov14 + rucc + hpsa16,
glmnb2family = NBII,
data=ahrf_m)
GAMLSS-RS iteration 1: Global Deviance = 27778.63
GAMLSS-RS iteration 2: Global Deviance = 26730.85
GAMLSS-RS iteration 3: Global Deviance = 25316.94
GAMLSS-RS iteration 4: Global Deviance = 23326.65
GAMLSS-RS iteration 5: Global Deviance = 20580.7
GAMLSS-RS iteration 6: Global Deviance = 18480.22
GAMLSS-RS iteration 7: Global Deviance = 18277.94
GAMLSS-RS iteration 8: Global Deviance = 18277.03
GAMLSS-RS iteration 9: Global Deviance = 18277.03
GAMLSS-RS iteration 10: Global Deviance = 18277.03
summary(glmnb2)
******************************************************************
Family: c("NBII", "Negative Binomial type II")
Call: gamlss(formula = lowbw1618 ~ offset(log(births1618)) +
fampov14 + rucc + hpsa16, family = NBII, data = ahrf_m)
Fitting method: RS()
------------------------------------------------------------------
Mu link function: log
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.6762604 0.0122536 -218.407 <2e-16 ***
fampov14 0.0189986 0.0006901 27.532 <2e-16 ***
rucc02 -0.0124200 0.0080482 -1.543 0.1229
rucc03 -0.0166503 0.0113856 -1.462 0.1438
rucc04 -0.0401176 0.0163552 -2.453 0.0142 *
rucc05 -0.0291354 0.0254089 -1.147 0.2516
rucc06 0.0058763 0.0161169 0.365 0.7154
rucc07 0.0068020 0.0211064 0.322 0.7473
rucc08 0.0941871 0.0493463 1.909 0.0564 .
rucc09 0.0208654 0.0546114 0.382 0.7024
hpsa16partial county shortage -0.0203470 0.0117608 -1.730 0.0838 .
hpsa16whole county shortage 0.0082867 0.0195254 0.424 0.6713
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.72887 0.04354 16.74 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
------------------------------------------------------------------
No. of observations in the fit: 2324
Degrees of Freedom for the fit: 13
Residual Deg. of Freedom: 2311
at cycle: 10
Global Deviance: 18277.03
AIC: 18303.03
SBC: 18377.79
******************************************************************
Again, between these two different specifications, we see very similar results. We can compare each of the models, starting with the Gaussian model and proceeding through the NB2 model above (although the Beta regression appears to be on a different scale since the outcome is specified as probability, versus a count, so it is not comparable to the rest) using the AIC()
function:
AIC(glm1, glmb, glmp_b, glmnb, glmnb2)
df AIC
glm1 13 19599.34
glmb 12 20980.41
glmp_b 12 20517.93
glmnb 13 16866.26
glmnb2 13 18303.03
We see that among these five model specifications, the Binomial fits the worst and the NB1 model, estimated by glm.nb()
fits the best, with the lowest AIC among these five models.
\(\bar{y}= \text{mean of y}\), \(\bar{x}= \text{mean of x}\)↩︎
http://api.census.gov/data/key_signup.html↩︎
Link to AHRF codebook - “https://data.hrsa.gov/DataDownload/AHRF/AHRF_USER_TECH_2019-2020.zip”↩︎
More on this below↩︎
More on this below↩︎
Link to AHRF codebook - “https://data.hrsa.gov/DataDownload/AHRF/AHRF_USER_TECH_2019-2020.zip”↩︎