Redistribute Data — redistribute • redistribute

Distribute data from a source frame to a target frame.

Usage

redistribute(source, target = NULL, map = list(), source_id = "GEOID",
  target_id = source_id, weight = NULL, source_variable = NULL,
  source_value = NULL, aggregate = NULL, weight_agg_method = "auto",
  rescale = TRUE, drop_extra_sources = FALSE, default_value = NA,
  outFile = NULL, overwrite = FALSE, make_intersect_map = FALSE,
  fill_targets = FALSE, overlaps = "keep", use_all = TRUE,
  return_geometry = TRUE, return_map = FALSE, verbose = FALSE)

Arguments

source: A matrix-like object you want to distribute from; usually this will be the real or more complete dataset, and is often at a lower resolution / higher level.
target: A matrix-like object you want to distribute to: usually this will be the dataset you want but isn't available, and is often at a higher resolution / lower level (for disaggregation). Can also be a single number, representing the number of initial characters of source IDs to derive target IDs from (useful for aggregating up nested groups).
map: A list with entries named with source IDs (or aligning with those IDs), containing vectors of associated target IDs (or indices of those IDs). Entries can also be numeric vectors with IDs as names, which will be used to weigh the relationship. If IDs are related by substrings (the first characters of target IDs are source IDs), then a map can be automatically generated from them. If source and target contain sf geometries, a map will be made with st_intersects (st_intersects(source, target)). If an intersects map is made, and source is being aggregated to target, and map entries contain multiple target IDs, those entries will be weighted by their proportion of overlap with the source area.
source_id, target_id: Name of a column in source / target, or a vector containing IDs. For source, this will default to the first column. For target, columns will be searched through for one that appears to relate to the source IDs, falling back to the first column.
weight: Name of a column, or a vector containing weights (or single value to apply to all cases), which apply to target when disaggregating, and source when aggregating. Defaults to unit weights (all weights are 1).
source_variable, source_value: If source is tall (with variables spread across rows rather than columns), specifies names of columns in source containing variable names and values for conversion.
aggregate: Logical; if specified, will determine whether to aggregate or disaggregate from source to target. Otherwise, this will be TRUE if there are more source observations than target observations.
weight_agg_method: Means of aggregating weight, in the case that target IDs contain duplicates. Options are "sum", "average", or "auto" (default; which will sum if weight is integer-like, and average otherwise).
rescale: Logical; if FALSE, will not adjust target values after redistribution such that they match source totals.
drop_extra_sources: Logical; if TRUE, will remove any source rows that are not mapped to any target rows. Useful when inputting a source with regions outside of the target area, especially when rescale is TRUE.
default_value: Value to set to any unmapped target ID.
outFile: Path to a CSV file in which to save results.
overwrite: Logical; if TRUE, will overwrite an existing outFile.
make_intersect_map: Logical; if TRUE, will opt to calculate an intersect-based map rather than an ID-based map, if both seem possible. If specified as FALSE, will never calculate an intersect-based map.
fill_targets: Logical; if TRUE, will make new target rows for any un-mapped source row.
overlaps: If specified and not TRUE or "keep" (default), will assign target entities that are mapped to multiple source entities to a single source entity. The value determines how entities with the same weight should be assigned, between "first", "last", and "random".
use_all: Logical; if TRUE (default), will redistribute map weights so they sum to 1. Otherwise, entities may be partially weighted.
return_geometry: Logical; if FALSE, will not set the returned data.frame's geometry to that of target, if it exists.
return_map: Logical; if TRUE, will only return the map, without performing the redistribution. Useful if you want to inspect an automatically created map, or use it in a later call.
verbose: Logical; if TRUE, will show status messages.

Value

A data.frame with a row for each target_ids (identified by the first column, id), and a column for each variable from source.

Examples

# minimal example
source <- data.frame(a = 1, b = 2)
target <- 1:5
(redistribute(source, target, verbose = TRUE))
#> ℹ source IDs: 1
#> ℹ target IDs: `target` vector
#> ℹ map: all target IDs for single source
#> ℹ weights: 1
#> ℹ redistributing 2 variables from 1 source to 5 targets:
#> • (numb; 2) a, b
#> ℹ disaggregating...
#> ✔ done disaggregating [12ms]
#> 
#> ℹ checking totals
#> ✔ totals are aligned [7ms]
#> 
#>   id   a   b
#> 1  1 0.2 0.4
#> 2  2 0.2 0.4
#> 3  3 0.2 0.4
#> 4  4 0.2 0.4
#> 5  5 0.2 0.4

# multi-entity example
source <- data.frame(id = c("a", "b"), cat = c("aaa", "bbb"), num = c(1, 2))
target <- data.frame(
  id = sample(paste0(c("a", "b"), rep(1:5, 2))),
  population = sample.int(1e5, 10)
)
(redistribute(source, target, verbose = TRUE))
#> ℹ source IDs: id column of `source`
#> ℹ target IDs: id column of `target`
#> ℹ map: first 1 character of target IDs
#> ℹ weights: 1
#> ℹ redistributing 2 variables from 2 sources to 10 targets:
#> • (numb; 1) num
#> • (char; 1) cat
#> ℹ disaggregating...
#> ✔ done disaggregating [6ms]
#> 
#> ℹ re-converting categorical levels
#> ℹ checking totals
#> ✔ totals are aligned [8ms]
#> 
#>    id cat num
#> 1  b2 bbb 0.4
#> 2  a4 aaa 0.2
#> 3  a5 aaa 0.2
#> 4  b4 bbb 0.4
#> 5  b5 bbb 0.4
#> 6  a1 aaa 0.2
#> 7  b3 bbb 0.4
#> 8  a3 aaa 0.2
#> 9  a2 aaa 0.2
#> 10 b1 bbb 0.4