Reformat an SDAD-formatted dataset — data_reformat

Unify multiple files, which each contain a tall set of variables associated with regions.

Usage

data_reformat_sdad(files, out = NULL, variables = NULL, ids = NULL,
  value = "value", value_name = "measure", id = "geoid", time = "year",
  dataset = "region_type", entity_info = c(type = "region_type", name =
  "region_name"), measure_info = list(), metadata = NULL,
  formatters = NULL, compression = "xz", read_existing = TRUE,
  overwrite = FALSE, get_coverage = TRUE, verbose = TRUE)

Arguments

files: A character vector of file paths, or the path to a directory containing data files.
out: Path to a directory to write files to; if not specified, files will not be written.
variables: Vector of variable names (in the value_name column) to be included.
ids: Vector of IDs (in the id column) to be included.
value: Name of the column containing variable values.
value_name: Name of the column containing variable names; assumed to be a single variable per file if not present.
id: Column name of IDs which uniquely identify entities.
time: Column name of the variable representing time.
dataset: Column name used to separate entity scales.
entity_info: A list containing variable names to extract and create an ids map from ( entity_info.json, created in the output directory). Entries can be named to rename the variables they refer to in entity features.
measure_info: Measure info to add file information to (as origin) to, and write to out.
metadata: A matrix-like object with additional information associated with entities, (such as region types and names) to be merged by id.
formatters: A list of functions to pass columns through, with names identifying those columns (e.g., list(region_name = function(x) sub(",.*$", "", x)) to strip text after a comma in the "region_name" column).
compression: A character specifying the type of compression to use on the created files, between "gzip", "bzip2", and "xz". Set to FALSE to disable compression.
read_existing: Logical; if FALSE, will not read in existing sets.
overwrite: Logical; if TRUE, will overwrite existing reformatted files, even if the source files are older than it.
get_coverage: Logical; if FALSE, will not calculate a summary of variable coverage (coverage.csv).
verbose: Logical; if FALSE, will not print status messages.

Value

An invisible list of the unified variable datasets, split into datasets.

Details

The basic assumption is that there are (a) entities which (b) exist in a hierarchy, and (c1) have a static set of features and (c2) a set of variable features which (d) are assessed at multiple time points.

For example (and generally), entities are (a) regions, with (b) smaller regions making up larger regions, and which (c1) have names, and (c2) population and demographic counts (d) between 2009 and 2019.

Examples

dir <- paste0(tempdir(), "/reformat_example")
dir.create(dir, FALSE)

# minimal example
data <- data.frame(
  geoid = 1:10,
  value = 1
)
write.csv(data, paste0(dir, "/data.csv"), row.names = FALSE)
(data_reformat_sdad(dir))
#> ⠙ reading in 0/1 original file
#> ✔ reading in 1/1 original file [46ms]
#> 
#> ⠙ creating dataset dataset (ID 0/10)
#> ✔ created dataset dataset (10 IDs) [15ms]
#> 
#> $dataset
#>    ID time data
#> 1   1    1    1
#> 2   2    1    1
#> 3   3    1    1
#> 4   4    1    1
#> 5   5    1    1
#> 6   6    1    1
#> 7   7    1    1
#> 8   8    1    1
#> 9   9    1    1
#> 10 10    1    1
#> 

# multiple variables
data <- data.frame(
  geoid = 1:10,
  value = 1,
  measure = paste0("v", 1:2)
)
write.csv(data, paste0(dir, "/data.csv"), row.names = FALSE)
(data_reformat_sdad(dir))
#> ⠙ reading in 0/1 original file
#> ✔ reading in 1/1 original file [14ms]
#> 
#> ⠙ creating dataset dataset (ID 0/10)
#> ✔ created dataset dataset (10 IDs) [16ms]
#> 
#> $dataset
#>    ID time v1 v2
#> 1   1    1  1 NA
#> 2   2    1 NA  1
#> 3   3    1  1 NA
#> 4   4    1 NA  1
#> 5   5    1  1 NA
#> 6   6    1 NA  1
#> 7   7    1  1 NA
#> 8   8    1 NA  1
#> 9   9    1  1 NA
#> 10 10    1 NA  1
#> 

# multiple datasets
data <- data.frame(
  geoid = 1:10,
  value = 1,
  measure = paste0("v", 1:2),
  region_type = rep(c("a", "b"), each = 5)
)
write.csv(data, paste0(dir, "/data.csv"), row.names = FALSE)
(data_reformat_sdad(dir))
#> ⠙ reading in 0/1 original file
#> ✔ reading in 1/1 original file [15ms]
#> 
#> ⠙ creating a dataset (ID 0/5)
#> ✔ created a dataset (5 IDs) [12ms]
#> 
#> ⠙ creating b dataset (ID 0/5)
#> ✔ created b dataset (5 IDs) [12ms]
#> 
#> $a
#>   ID time v1 v2
#> 1  1    1  1 NA
#> 2  2    1 NA  1
#> 3  3    1  1 NA
#> 4  4    1 NA  1
#> 5  5    1  1 NA
#> 
#> $b
#>   ID time v1 v2
#> 1  6    1 NA  1
#> 2  7    1  1 NA
#> 3  8    1 NA  1
#> 4  9    1  1 NA
#> 5 10    1 NA  1
#>