casestudy-rural_urban.Rmd
Built with R 4.2.2
This case study compares different ways of calculating catchment area ratios between two counties in Virginia: Loudoun and Fairfax. These counties represent more rural (Loudoun) and more urban (Fairfax) environments.
Demand is defined as the total population in 2020, as estimated in the U.S. Census Bureau’s American Community Survey 5-year summaries.
Supply is defined as the yearly average of weekly average hospital beds, as reported by the U.S. Health and Human Services in their COVID-19 Hospital Capacity dataset.
Cost is defined as travel times between each region’s centroid and each hospital, as calculated within road networks by the Open Source Routing Machine using OpenStreetMap data from 2022, as provided by Geofabrik.
Data were prepared as part of a data commons project; the full script is available at uva-bi-sdad/dc.hifld.hosp/src/retrieve.R. Here, we will be loading in the files created from the original sources.
You can clone the repository, or directly download the files in the
data/working
directory, then point to the directory containing those files with the
dir
variable:
git clone https://github.com/uva-bi-sdad/dc.hifld.hosp.git
# this would be the directory when working from within the dc.hifld.hosp repository
dir <- "data/working/"
Now, we can load the files, which will results in 3 main objects:
hospitals
, population
, and
traveltimes
:
# main objects
hospitals <- read.csv(gzfile(paste0(dir, "hospitals.csv.xz")))
rownames(hospitals) <- hospitals$GEOID
population <- read.csv(gzfile(paste0(dir, "population_bg.csv.xz")))
rownames(population) <- population$GEOID
population$tract <- substring(population$GEOID, 1, 11)
traveltimes <- read.csv(
gzfile(paste0(dir, "traveltimes_2020.csv.xz")),
row.names = 1, check.names = FALSE
)
# tract-level comparisons
population_tract <- read.csv(gzfile(paste0(dir, "population_tr.csv.xz")))
rownames(population_tract) <- population_tract$GEOID
population_tract <- population_tract[unique(population$tract), ]
traveltimes_tract <- read.csv(
gzfile(paste0(dir, "traveltimes_2020_tract.csv.xz")),
row.names = 1, check.names = FALSE
)
We’ll also use a map to show comparisons between the focal regions:
# download map geographies
library(sf)
tracts <- st_read(paste0(
"https://raw.githubusercontent.com/uva-bi-sdad/sdc.geographies/main/VA/Census%20Geographies/",
"Tract/2020/data/distribution/va_geo_census_cb_2020_census_tracts.geojson"
), quiet = TRUE)
tracts <- tracts[grep("^51(?:107|059|600)", tracts$geoid), ]
block_groups <- st_read(paste0(
"https://raw.githubusercontent.com/uva-bi-sdad/sdc.geographies/main/VA/Census%20Geographies/",
"Block%20Group/2020/data/distribution/va_geo_census_cb_2020_census_block_groups.geojson"
), quiet = TRUE)
block_groups <- block_groups[grep("^51(?:107|059|600)", block_groups$geoid), ]
# make base maps
library(leaflet)
map_tracts <- leaflet(tracts, options = leafletOptions(attributionControl = FALSE)) |>
setView(-77.4, 38.99, 9) |>
addProviderTiles("CartoDB.Positron") |>
addScaleBar("bottomleft") |>
addMapPane("lines", zIndex = 410) |>
addMapPane("points", zIndex = 411) |>
addLayersControl(position = "topleft", overlayGroups = "Hospitals") |>
addCircles(
data = hospitals, color = "#000", fillColor = "#000", opacity = .8, lng = ~X, lat = ~Y,
label = ~ paste0("ID: ", GEOID, ", Beds: ", total_beds_7_day_avg),
group = "Hospitals", options = pathOptions(pane = "points")
)
map_block_groups <- leaflet(block_groups, options = leafletOptions(attributionControl = FALSE)) |>
setView(-77.4, 38.99, 9) |>
addProviderTiles("CartoDB.Positron") |>
addScaleBar("bottomleft") |>
addMapPane("lines", zIndex = 410) |>
addMapPane("points", zIndex = 411) |>
addLayersControl(position = "topleft", overlayGroups = "Hospitals") |>
addCircles(
data = hospitals, color = "#000", fillColor = "#000", opacity = .8, lng = ~X, lat = ~Y,
label = ~ paste0("ID: ", GEOID, ", Beds: ", total_beds_7_day_avg),
group = "Hospitals", options = pathOptions(pane = "points")
)
# specify colors
palette <- scico::scico(255, direction = -1, palette = "vik")
plot_colors <- c("#b69349", "#003e7d")
We’ll be looking at several variants of floating catchment areas, so, to better compare them, we’ll start with a baseline that represents the most thorough and up-to-date variant:
library(catchment)
baseline <- catchment_ratio(
population, hospitals, traveltimes,
weight = "gaussian", scale = 18, normalize_weight = TRUE, return_type = 1e5,
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
names(baseline) <- population$GEOID
In all of our catchment area calculations, we are referring to resources per person, but we do not calculate costs for each individual. In this respect, our calculations are always approximations.
To get a feel for how much impact this approximation has, we can compare cost calculations made at the highest resolution with population data available (block groups; for yearly estimates) with those made at a lower resolution (tracts).
# calculate floating catchment area ratios
resolutions <- data.frame(
# starting from block groups and aggregating up to tracts
higher = vapply(split(data.frame(
ratio = baseline,
population = population$population
), population$tract), function(bg) {
total <- sum(bg$population)
if (total) sum(bg$ratio * bg$population) / total else 0
}, 0),
# starting from tracts
lower = catchment_ratio(
population_tract, hospitals, traveltimes_tract,
weight = "gaussian", scale = 18, normalize_weight = TRUE, return_type = 1e5,
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
)
Overall, calculations made at higher and lower resolutions have a Person’s r of 0.769.
Looking specifically at our comparison counties…
# convenience functions to look at the focal counties
library(knitr)
get_focal_subset <- function(data, rowIds, counties = c(Loudoun = "^51107", Fairfax = "^51(?:600|059)")) {
comp <- colnames(data)[1:2]
sub <- do.call(rbind, unname(lapply(counties, function(county) {
d <- data[grep(county, rownames(data)), ]
n <- nrow(d)
d[[paste0("rank_", comp[[1]])]] <- order(-d[, 1]) / n * 100
d[[paste0("rank_", comp[[2]])]] <- order(-d[, 2]) / n * 100
d$County <- names(counties[counties == county])
d$difference <- d[, 1] - d[, 2]
d$rank_difference <- d[[paste0("rank_", comp[[1]])]] - d[[paste0("rank_", comp[[2]])]]
d
})))
data <- sub[rowIds, ]
list(data = data, pal = colorNumeric(palette, rep(max(abs(data$difference)), 2) * c(-1, 1)))
}
summary_contrast <- function(subset) {
data <- subset$data
summary <- vapply(split(data, data$County), function(d) {
c(
colMeans(d[, 1:2]), mean(d$difference), cor(d[, 1], d[, 2]), mean(abs(d$rank_difference))
)
}, numeric(5))
rownames(summary) <- c(
paste0("Average (", colnames(data)[1:2], ")"),
"Average Difference", "Correlation", "Average Change in Rank Percent"
)
kable(summary, digits = 3)
}
res_subset <- get_focal_subset(resolutions, tracts$geoid)
summary_contrast(res_subset)
Fairfax | Loudoun | |
---|---|---|
Average (higher) | 163.106 | 103.403 |
Average (lower) | 170.162 | 123.067 |
Average Difference | -7.056 | -19.663 |
Correlation | 0.896 | 0.897 |
Average Change in Rank Percent | 27.209 | 22.649 |
Average difference is in number of beds per 100k people; higher means more are estimated when starting from tracts.
map_tracts |>
addControl("Difference in Hospital Beds Per 100K People (Block Groups - Tract)", "topright") |>
addLegend("bottomright", res_subset$pal, res_subset$data$difference, opacity = 1) |>
addPolygons(
fillColor = res_subset$pal(res_subset$data$difference), fillOpacity = 1, weight = 1, color = "#000",
highlightOptions = highlightOptions(color = "#fff"), group = "Beds",
label = paste0("Block Groups - Tract: ", round(res_subset$data$difference, 3))
)
Higher values (blue) mean ratios are higher when made at the block group level.
library(splot)
splot(
higher ~ lower * County, res_subset$data,
colors = plot_colors,
title = FALSE, laby = "From Block Groups", labx = "From Tracts"
)
Ratios are even more highly correlated within county.
The only reason to use lower resolution geographies would be if (a) population data are not available at higher resolutions, or (b) there are practical computing limits.
Our next comparison will consider different means of calculating distance / travel time, which represents cost, and define the shape of the catchment areas.
The simplest means of calculating distance would be to measure straight lines between each consumer and provider location (resulting in perfectly circular catchment areas).
We use a routing machine to get estimated travel times within road networks (resulting in catchment areas shaped by the local road topology), which should be more accurate, but requires some setup and computing resources.
distances <- data.frame(
# with cost defined by travel times as routed within road networks
routed = baseline,
# with cost defined by Euclidean distances
straight = catchment_ratio(
population, hospitals,
weight = "gaussian", scale = .13, normalize_weight = TRUE, return_type = 1e5,
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
)
Overall, calculations made with routed and straight costs have a Person’s r of 0.959, though this is fairly sensitive to the weight function.
Looking specifically at our comparison counties…
dist_split <- get_focal_subset(distances, block_groups$geoid)
summary_contrast(dist_split)
Fairfax | Loudoun | |
---|---|---|
Average (routed) | 165.761 | 105.103 |
Average (straight) | 169.796 | 98.473 |
Average Difference | -4.036 | 6.630 |
Correlation | 0.911 | 0.962 |
Average Change in Rank Percent | 32.268 | 29.365 |
map_block_groups |>
addControl("Difference in Hospital Beds Per 100K People (Routed - Straight)", "topright") |>
addLegend("bottomright", dist_split$pal, dist_split$data$difference, opacity = 1) |>
addPolygons(
fillColor = dist_split$pal(dist_split$data$difference), fillOpacity = 1, weight = 1, color = "#000",
highlightOptions = highlightOptions(color = "#fff"), group = "Beds",
label = paste0("Routed - Straight: ", round(dist_split$data$difference, 3))
)
Higher values (blue) mean ratios are higher when costs are based on routed travel time.
splot(
routed ~ straight * County, dist_split$data,
colors = plot_colors,
title = FALSE, laby = "With Route Times", labx = "With Straight Distance"
)
Correlations within subsets remain high, with some interesting skewing.
The only reason to use straight distances in this context would be if good travel time data are not available.
There is a warping effect at the boundaries of catchment area calculation regions if the regions are artificially restricted, such as at state borders when consumers can freely access out-of-state providers. To mitigate such a warping effect, we include surrounding states as a buffer. To see how much of an effect this has, we can compare those overall calculations with ones made just within the focal region (here, the full set of states versus just Virginia).
buffer <- data.frame(
# including the full set of states (VA and surrounding)
full = baseline[population$state == "VA"],
# including just Virginia
none = catchment_ratio(
population[population$state == "VA", ], hospitals[hospitals$state == "VA", ], traveltimes,
weight = "gaussian", scale = 18, normalize_weight = TRUE, return_type = 1e5,
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
)
Overall, calculations made with routed and straight costs have a Person’s r of 0.971, though this is fairly sensitive to the weight function.
Looking specifically at our comparison counties…
buffer_split <- get_focal_subset(buffer, block_groups$geoid)
summary_contrast(buffer_split)
Fairfax | Loudoun | |
---|---|---|
Average (full) | 165.761 | 105.103 |
Average (none) | 146.391 | 104.814 |
Average Difference | 19.369 | 0.289 |
Correlation | 0.645 | 0.992 |
Average Change in Rank Percent | 32.327 | 24.880 |
map_block_groups |>
addControl("Difference in Hospital Beds Per 100K People (Full Buffer - No Buffer)", "topright") |>
addLegend("bottomright", buffer_split$pal, buffer_split$data$difference, opacity = 1) |>
addPolygons(
fillColor = buffer_split$pal(buffer_split$data$difference), fillOpacity = 1, weight = 1, color = "#000",
highlightOptions = highlightOptions(color = "#fff"), group = "Beds",
label = paste0("Full - None: ", round(buffer_split$data$difference, 3))
)
Higher values (blue) mean ratios are higher when the full buffer is used. As is to be expected, we see more differences along the state border.
splot(
full ~ none * County, buffer_split$data,
colors = plot_colors,
title = FALSE, laby = "Including Surrounding States", labx = "Only Virginia"
)
Ratios are very similar in Loudoun, but many block groups in Fairfax have higher ratios due to their proximity to D.C. and Maryland hospitals.
One aspect of weight is the step added to make the 3-step floating catchment area: Normalization adjustment. In this step, weights are multiplied by weights that have been normalized. Normalized weights sum to 1 across a consumer location’s connections, so multiplying by these dampens multiple connections, which works against the priority given to highly interconnected consumer locations.
steps <- data.frame(
# with adjusted weights
three = baseline,
# without adjusted weights
two = catchment_ratio(
population, hospitals, traveltimes,
weight = "gaussian", scale = 18, return_type = 1e5,
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
)
Overall, calculations made with and without norm adjustments have a Person’s r of 0.944.
Looking specifically at our comparison counties…
step_split <- get_focal_subset(steps, block_groups$geoid)
summary_contrast(step_split)
Fairfax | Loudoun | |
---|---|---|
Average (three) | 165.761 | 105.103 |
Average (two) | 171.687 | 82.727 |
Average Difference | -5.926 | 22.376 |
Correlation | 0.935 | 0.827 |
Average Change in Rank Percent | 32.662 | 29.134 |
map_block_groups |>
addControl("Difference in Hospital Beds Per 100K People (3-Step - 2-Step)", "topright") |>
addLegend("bottomright", step_split$pal, step_split$data$difference, opacity = 1) |>
addPolygons(
fillColor = step_split$pal(step_split$data$difference), fillOpacity = 1, weight = 1, color = "#000",
highlightOptions = highlightOptions(color = "#fff"), group = "Beds",
label = paste0("3-Step - 2-Step: ", round(step_split$data$difference, 3))
)
Higher values (blue) mean ratios are higher when weights have been adjusted. As expected, adjusting weights with their normalized versions shifts beds toward less interconnected regions, which particularly affects the more remote parts of Loudoun county.
splot(
three ~ two * County, step_split$data,
colors = plot_colors,
title = FALSE, laby = "With Norm-Adjusted Weights", labx = "With Original Weights"
)
The weight’s scale defines the size and vertical shape of the catchment areas. For example, we can compare the baseline scale (18) with a slightly bigger scale (25) across values representing travel times in minutes:
# just to illustrate
Travel_Time <- 1:70
Weight <- cbind(
catchment_weight(Travel_Time, "gaussian", scale = 18),
catchment_weight(Travel_Time, "gaussian", scale = 25)
)
colnames(Weight) <- c(18, 25)
splot(
Weight ~ Travel_Time,
points = FALSE, line = "con", colors = plot_colors,
note = FALSE, myl = c(-.1, 1), title = FALSE, leg.title = "Scale", add = {
steps <- c(10, 20, 30, 60)
adj <- c(-.07, .07)
for (i in 1:2) {
values <- cdat$`.^^.`[[i]]$y[steps]
points(values ~ steps, col = plot_colors[i])
text(steps, values + adj[i], prettyNum(values, digits = 2))
}
}
)
scale <- data.frame(
# with a weight scale of 18
smaller = baseline,
# with a weight scale of 25
larger = catchment_ratio(
population, hospitals, traveltimes,
weight = "gaussian", scale = 25, normalize_weight = TRUE, return_type = 1e5,
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
)
Overall, calculations made with smaller and larger scales have a Person’s r of 0.958.
Looking specifically at our comparison counties…
scale_split <- get_focal_subset(scale, block_groups$geoid)
summary_contrast(scale_split)
Fairfax | Loudoun | |
---|---|---|
Average (smaller) | 165.761 | 105.103 |
Average (larger) | 171.761 | 119.523 |
Average Difference | -6.001 | -14.420 |
Correlation | 0.977 | 0.934 |
Average Change in Rank Percent | 31.895 | 29.527 |
map_block_groups |>
addControl("Difference in Hospital Beds Per 100K People (Smaller - Larger)", "topright") |>
addLegend("bottomright", scale_split$pal, scale_split$data$difference, opacity = 1) |>
addPolygons(
fillColor = scale_split$pal(scale_split$data$difference), fillOpacity = 1, weight = 1, color = "#000",
highlightOptions = highlightOptions(color = "#fff"), group = "Beds",
label = paste0("Smaller - Larger: ", round(scale_split$data$difference, 3))
)
Higher values (blue) mean ratios are higher with the smaller scale. A larger scale works to more evenly distribute beds, generally shifting them from nearer to farther consumer locations.
splot(
smaller ~ larger * County, scale_split$data,
colors = plot_colors,
title = FALSE, laby = "With A Smaller Scale (18)", labx = "With A Larger Scale (25)"
)
This is incidentally very similar to the effect of norm adjusting weights because consumer locations with more provider connections also tend to be closer to those providers.
The hospital dataset includes average weekly bed use information, which we averaged over the year. This offers a reasonable analogue to the demand side of the catchment area ratios.
used <- hospitals$inpatient_beds_used_7_day_avg + hospitals$icu_beds_used_7_day_avg
used[is.na(used)] <- 0
First, we can look at how this actual use ratio lines up with the number of people that fall within each hospital’s travel-time-based catchment area:
baseline_population <- catchment_ratio(
population, hospitals, traveltimes,
weight = "gaussian", scale = 18, normalize_weight = TRUE, return_type = "demand",
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
We don’t know what portion of the population might actually use a hospital bed, but we would expect the total relevant population to correlate with the number of hospital beds used:
cor(baseline_population, used)
#> [1] 0.4881582
Now we can consider how this correlation is affected by each of the previously considered decisions:
variants <- data.frame(
baseline = baseline_population,
lower_resolution = catchment_ratio(
population_tract, hospitals, traveltimes_tract,
weight = "gaussian", scale = 18, normalize_weight = TRUE, return_type = "demand",
consumers_value = "population", providers_value = "total_beds_7_day_avg"
),
straight_distance = catchment_ratio(
population, hospitals,
weight = "gaussian", scale = .13, normalize_weight = TRUE, return_type = "demand",
consumers_value = "population", providers_value = "total_beds_7_day_avg"
),
no_norm = catchment_ratio(
population, hospitals, traveltimes,
weight = "gaussian", scale = 18, return_type = "demand",
consumers_value = "population", providers_value = "total_beds_7_day_avg"
),
bigger_radius = catchment_ratio(
population, hospitals, traveltimes,
weight = "gaussian", scale = 25, normalize_weight = TRUE, return_type = "demand",
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
)
cor(variants, data.frame(actual = used))
#> actual
#> baseline 0.4881582
#> lower_resolution 0.4981291
#> straight_distance 0.4866997
#> no_norm 0.3167594
#> bigger_radius 0.4754232
Since the unbuffered comparison only considers providers within Virginia, we’ll have to look at that separately:
variants_va <- cbind(
variants[hospitals$state == "VA", ],
unbuffered_population = catchment_ratio(
population[population$state == "VA", ], hospitals[hospitals$state == "VA", ], traveltimes,
weight = "gaussian", scale = 18, normalize_weight = TRUE, return_type = "demand",
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
)
cor(variants_va, data.frame(actual = used[hospitals$state == "VA"]))
#> actual
#> baseline 0.5645549
#> lower_resolution 0.5794173
#> straight_distance 0.5392701
#> no_norm 0.4380161
#> bigger_radius 0.5842672
#> unbuffered_population 0.5400037
We could use this information to tune a parameter, such as scale:
# we'll focus on a subset (just to make calculations faster),
# which we can define by distance from a focal set of providers
## identify an initial set of hospitals -- those close to our focal counties
focal_hospitals <- names(which(
!!colSums(traveltimes[grep("^51(?:107|059|600)", rownames(traveltimes)), ] < 120)
))
## identify regions within a minimal distance of those hospitals
focal_region_buffer <- names(which(!!rowSums(traveltimes[, focal_hospitals] < 120)))
## use that to define our tuning subset
tuning_traveltimes <- traveltimes[focal_region_buffer, ]
tuning_traveltimes <- tuning_traveltimes[, !!colSums(tuning_traveltimes < 120)]
tuning_population <- population[focal_region_buffer, ]
tuning_hospitals <- hospitals[colnames(tuning_traveltimes), ]
tuning_used <- tuning_hospitals$inpatient_beds_used_7_day_avg + tuning_hospitals$icu_beds_used_7_day_avg
tuning_used[is.na(tuning_used)] <- 0
# now we can see how the correlation seems to relate to scale
Scales <- c(1, 3, 5, 10, 15, 20, 40, 50)
scale_cors <- vapply(Scales, function(scale) {
demand <- catchment_ratio(
tuning_population, tuning_hospitals, tuning_traveltimes,
weight = "gaussian", scale = scale, normalize_weight = TRUE, return_type = "demand",
consumers_value = "population", providers_value = "total_beds_7_day_avg"
)
cor(demand, tuning_used)
}, 0)
splot(
scale_cors ~ Scales,
title = "Correlation Between Estimated Demand and Actual Use", laby = "Pearson's r",
sud = "Within a small buffer around the focal regions."
)