Skip to contents

Distribute data from a source frame to a target frame.

Usage

redistribute(source, target = NULL, map = list(), source_id = "GEOID",
  target_id = source_id, weight = NULL, source_variable = NULL,
  source_value = NULL, aggregate = NULL, weight_agg_method = "auto",
  rescale = TRUE, drop_extra_sources = FALSE, default_value = NA,
  outFile = NULL, overwrite = FALSE, make_intersect_map = FALSE,
  fill_targets = FALSE, overlaps = "keep", use_all = TRUE,
  return_geometry = TRUE, return_map = FALSE, verbose = FALSE)

Arguments

source

A matrix-like object you want to distribute from; usually this will be the real or more complete dataset, and is often at a lower resolution / higher level.

target

A matrix-like object you want to distribute to: usually this will be the dataset you want but isn't available, and is often at a higher resolution / lower level (for disaggregation). Can also be a single number, representing the number of initial characters of source IDs to derive target IDs from (useful for aggregating up nested groups).

map

A list with entries named with source IDs (or aligning with those IDs), containing vectors of associated target IDs (or indices of those IDs). Entries can also be numeric vectors with IDs as names, which will be used to weigh the relationship. If IDs are related by substrings (the first characters of target IDs are source IDs), then a map can be automatically generated from them. If source and target contain sf geometries, a map will be made with st_intersects (st_intersects(source, target)). If an intersects map is made, and source is being aggregated to target, and map entries contain multiple target IDs, those entries will be weighted by their proportion of overlap with the source area.

source_id, target_id

Name of a column in source / target, or a vector containing IDs. For source, this will default to the first column. For target, columns will be searched through for one that appears to relate to the source IDs, falling back to the first column.

weight

Name of a column, or a vector containing weights (or single value to apply to all cases), which apply to target when disaggregating, and source when aggregating. Defaults to unit weights (all weights are 1).

source_variable, source_value

If source is tall (with variables spread across rows rather than columns), specifies names of columns in source containing variable names and values for conversion.

aggregate

Logical; if specified, will determine whether to aggregate or disaggregate from source to target. Otherwise, this will be TRUE if there are more source observations than target observations.

weight_agg_method

Means of aggregating weight, in the case that target IDs contain duplicates. Options are "sum", "average", or "auto" (default; which will sum if weight is integer-like, and average otherwise).

rescale

Logical; if FALSE, will not adjust target values after redistribution such that they match source totals.

drop_extra_sources

Logical; if TRUE, will remove any source rows that are not mapped to any target rows. Useful when inputting a source with regions outside of the target area, especially when rescale is TRUE.

default_value

Value to set to any unmapped target ID.

outFile

Path to a CSV file in which to save results.

overwrite

Logical; if TRUE, will overwrite an existing outFile.

make_intersect_map

Logical; if TRUE, will opt to calculate an intersect-based map rather than an ID-based map, if both seem possible. If specified as FALSE, will never calculate an intersect-based map.

fill_targets

Logical; if TRUE, will make new target rows for any un-mapped source row.

overlaps

If specified and not TRUE or "keep" (default), will assign target entities that are mapped to multiple source entities to a single source entity. The value determines how entities with the same weight should be assigned, between "first", "last", and "random".

use_all

Logical; if TRUE (default), will redistribute map weights so they sum to 1. Otherwise, entities may be partially weighted.

return_geometry

Logical; if FALSE, will not set the returned data.frame's geometry to that of target, if it exists.

return_map

Logical; if TRUE, will only return the map, without performing the redistribution. Useful if you want to inspect an automatically created map, or use it in a later call.

verbose

Logical; if TRUE, will show status messages.

Value

A data.frame with a row for each target_ids (identified by the first column, id), and a column for each variable from source.

Examples

# minimal example
source <- data.frame(a = 1, b = 2)
target <- 1:5
(redistribute(source, target, verbose = TRUE))
#>  source IDs: 1
#>  target IDs: `target` vector
#>  map: all target IDs for single source
#>  weights: 1
#>  redistributing 2 variables from 1 source to 5 targets:
#>  (numb; 2) a, b
#>  disaggregating...
#>  done disaggregating [12ms]
#> 
#>  checking totals
#>  totals are aligned [7ms]
#> 
#>   id   a   b
#> 1  1 0.2 0.4
#> 2  2 0.2 0.4
#> 3  3 0.2 0.4
#> 4  4 0.2 0.4
#> 5  5 0.2 0.4

# multi-entity example
source <- data.frame(id = c("a", "b"), cat = c("aaa", "bbb"), num = c(1, 2))
target <- data.frame(
  id = sample(paste0(c("a", "b"), rep(1:5, 2))),
  population = sample.int(1e5, 10)
)
(redistribute(source, target, verbose = TRUE))
#>  source IDs: id column of `source`
#>  target IDs: id column of `target`
#>  map: first 1 character of target IDs
#>  weights: 1
#>  redistributing 2 variables from 2 sources to 10 targets:
#>  (numb; 1) num
#>  (char; 1) cat
#>  disaggregating...
#>  done disaggregating [6ms]
#> 
#>  re-converting categorical levels
#>  checking totals
#>  totals are aligned [8ms]
#> 
#>    id cat num
#> 1  b2 bbb 0.4
#> 2  a4 aaa 0.2
#> 3  a5 aaa 0.2
#> 4  b4 bbb 0.4
#> 5  b5 bbb 0.4
#> 6  a1 aaa 0.2
#> 7  b3 bbb 0.4
#> 8  a3 aaa 0.2
#> 9  a2 aaa 0.2
#> 10 b1 bbb 0.4