Skip to contents

Unify multiple files, which each contain a tall set of variables associated with regions.

Usage

data_reformat_sdad(files, out = NULL, variables = NULL, ids = NULL,
  value = "value", value_name = "measure", id = "geoid", time = "year",
  dataset = "region_type", entity_info = c(type = "region_type", name =
  "region_name"), measure_info = list(), metadata = NULL,
  formatters = NULL, compression = "xz", read_existing = TRUE,
  overwrite = FALSE, get_coverage = TRUE, verbose = TRUE)

Arguments

files

A character vector of file paths, or the path to a directory containing data files.

out

Path to a directory to write files to; if not specified, files will not be written.

variables

Vector of variable names (in the value_name column) to be included.

ids

Vector of IDs (in the id column) to be included.

value

Name of the column containing variable values.

value_name

Name of the column containing variable names; assumed to be a single variable per file if not present.

id

Column name of IDs which uniquely identify entities.

time

Column name of the variable representing time.

dataset

Column name used to separate entity scales.

entity_info

A list containing variable names to extract and create an ids map from ( entity_info.json, created in the output directory). Entries can be named to rename the variables they refer to in entity features.

measure_info

Measure info to add file information to (as origin) to, and write to out.

metadata

A matrix-like object with additional information associated with entities, (such as region types and names) to be merged by id.

formatters

A list of functions to pass columns through, with names identifying those columns (e.g., list(region_name = function(x) sub(",.*$", "", x)) to strip text after a comma in the "region_name" column).

compression

A character specifying the type of compression to use on the created files, between "gzip", "bzip2", and "xz". Set to FALSE to disable compression.

read_existing

Logical; if FALSE, will not read in existing sets.

overwrite

Logical; if TRUE, will overwrite existing reformatted files, even if the source files are older than it.

get_coverage

Logical; if FALSE, will not calculate a summary of variable coverage (coverage.csv).

verbose

Logical; if FALSE, will not print status messages.

Value

An invisible list of the unified variable datasets, split into datasets.

Details

The basic assumption is that there are (a) entities which (b) exist in a hierarchy, and (c1) have a static set of features and (c2) a set of variable features which (d) are assessed at multiple time points.

For example (and generally), entities are (a) regions, with (b) smaller regions making up larger regions, and which (c1) have names, and (c2) population and demographic counts (d) between 2009 and 2019.

Examples

dir <- paste0(tempdir(), "/reformat_example")
dir.create(dir, FALSE)

# minimal example
data <- data.frame(
  geoid = 1:10,
  value = 1
)
write.csv(data, paste0(dir, "/data.csv"), row.names = FALSE)
(data_reformat_sdad(dir))
#> ⠙ reading in 0/1 original file
#>  reading in 1/1 original file [46ms]
#> 
#> ⠙ creating dataset dataset (ID 0/10)
#>  created dataset dataset (10 IDs) [15ms]
#> 
#> $dataset
#>    ID time data
#> 1   1    1    1
#> 2   2    1    1
#> 3   3    1    1
#> 4   4    1    1
#> 5   5    1    1
#> 6   6    1    1
#> 7   7    1    1
#> 8   8    1    1
#> 9   9    1    1
#> 10 10    1    1
#> 

# multiple variables
data <- data.frame(
  geoid = 1:10,
  value = 1,
  measure = paste0("v", 1:2)
)
write.csv(data, paste0(dir, "/data.csv"), row.names = FALSE)
(data_reformat_sdad(dir))
#> ⠙ reading in 0/1 original file
#>  reading in 1/1 original file [14ms]
#> 
#> ⠙ creating dataset dataset (ID 0/10)
#>  created dataset dataset (10 IDs) [16ms]
#> 
#> $dataset
#>    ID time v1 v2
#> 1   1    1  1 NA
#> 2   2    1 NA  1
#> 3   3    1  1 NA
#> 4   4    1 NA  1
#> 5   5    1  1 NA
#> 6   6    1 NA  1
#> 7   7    1  1 NA
#> 8   8    1 NA  1
#> 9   9    1  1 NA
#> 10 10    1 NA  1
#> 

# multiple datasets
data <- data.frame(
  geoid = 1:10,
  value = 1,
  measure = paste0("v", 1:2),
  region_type = rep(c("a", "b"), each = 5)
)
write.csv(data, paste0(dir, "/data.csv"), row.names = FALSE)
(data_reformat_sdad(dir))
#> ⠙ reading in 0/1 original file
#>  reading in 1/1 original file [15ms]
#> 
#> ⠙ creating a dataset (ID 0/5)
#>  created a dataset (5 IDs) [12ms]
#> 
#> ⠙ creating b dataset (ID 0/5)
#>  created b dataset (5 IDs) [12ms]
#> 
#> $a
#>   ID time v1 v2
#> 1  1    1  1 NA
#> 2  2    1 NA  1
#> 3  3    1  1 NA
#> 4  4    1 NA  1
#> 5  5    1  1 NA
#> 
#> $b
#>   ID time v1 v2
#> 1  6    1 NA  1
#> 2  7    1  1 NA
#> 3  8    1 NA  1
#> 4  9    1  1 NA
#> 5 10    1 NA  1
#>