Unify multiple files, which each contain a tall set of variables associated with regions.
Usage
data_reformat_sdad(files, out = NULL, variables = NULL, ids = NULL,
value = "value", value_name = "measure", id = "geoid", time = "year",
dataset = "region_type", entity_info = c(type = "region_type", name =
"region_name"), measure_info = list(), metadata = NULL,
formatters = NULL, compression = "xz", read_existing = TRUE,
overwrite = FALSE, get_coverage = TRUE, verbose = TRUE)
Arguments
- files
A character vector of file paths, or the path to a directory containing data files.
- out
Path to a directory to write files to; if not specified, files will not be written.
- variables
Vector of variable names (in the
value_name
column) to be included.- ids
Vector of IDs (in the
id
column) to be included.- value
Name of the column containing variable values.
- value_name
Name of the column containing variable names; assumed to be a single variable per file if not present.
- id
Column name of IDs which uniquely identify entities.
- time
Column name of the variable representing time.
- dataset
Column name used to separate entity scales.
- entity_info
A list containing variable names to extract and create an ids map from (
entity_info.json
, created in the output directory). Entries can be named to rename the variables they refer to in entity features.- measure_info
Measure info to add file information to (as
origin
) to, and write toout
.- metadata
A matrix-like object with additional information associated with entities, (such as region types and names) to be merged by
id
.- formatters
A list of functions to pass columns through, with names identifying those columns (e.g.,
list(region_name = function(x) sub(",.*$", "", x))
to strip text after a comma in the "region_name" column).- compression
A character specifying the type of compression to use on the created files, between
"gzip"
,"bzip2"
, and"xz"
. Set toFALSE
to disable compression.- read_existing
Logical; if
FALSE
, will not read in existing sets.- overwrite
Logical; if
TRUE
, will overwrite existing reformatted files, even if the source files are older than it.- get_coverage
Logical; if
FALSE
, will not calculate a summary of variable coverage (coverage.csv
).- verbose
Logical; if
FALSE
, will not print status messages.
Details
The basic assumption is that there are (a) entities which (b) exist in a hierarchy, and (c1) have a static set of features and (c2) a set of variable features which (d) are assessed at multiple time points.
For example (and generally), entities are (a) regions, with (b) smaller regions making up larger regions, and which (c1) have names, and (c2) population and demographic counts (d) between 2009 and 2019.
Examples
dir <- paste0(tempdir(), "/reformat_example")
dir.create(dir, FALSE)
# minimal example
data <- data.frame(
geoid = 1:10,
value = 1
)
write.csv(data, paste0(dir, "/data.csv"), row.names = FALSE)
(data_reformat_sdad(dir))
#> ⠙ reading in 0/1 original file
#> ✔ reading in 1/1 original file [46ms]
#>
#> ⠙ creating dataset dataset (ID 0/10)
#> ✔ created dataset dataset (10 IDs) [15ms]
#>
#> $dataset
#> ID time data
#> 1 1 1 1
#> 2 2 1 1
#> 3 3 1 1
#> 4 4 1 1
#> 5 5 1 1
#> 6 6 1 1
#> 7 7 1 1
#> 8 8 1 1
#> 9 9 1 1
#> 10 10 1 1
#>
# multiple variables
data <- data.frame(
geoid = 1:10,
value = 1,
measure = paste0("v", 1:2)
)
write.csv(data, paste0(dir, "/data.csv"), row.names = FALSE)
(data_reformat_sdad(dir))
#> ⠙ reading in 0/1 original file
#> ✔ reading in 1/1 original file [14ms]
#>
#> ⠙ creating dataset dataset (ID 0/10)
#> ✔ created dataset dataset (10 IDs) [16ms]
#>
#> $dataset
#> ID time v1 v2
#> 1 1 1 1 NA
#> 2 2 1 NA 1
#> 3 3 1 1 NA
#> 4 4 1 NA 1
#> 5 5 1 1 NA
#> 6 6 1 NA 1
#> 7 7 1 1 NA
#> 8 8 1 NA 1
#> 9 9 1 1 NA
#> 10 10 1 NA 1
#>
# multiple datasets
data <- data.frame(
geoid = 1:10,
value = 1,
measure = paste0("v", 1:2),
region_type = rep(c("a", "b"), each = 5)
)
write.csv(data, paste0(dir, "/data.csv"), row.names = FALSE)
(data_reformat_sdad(dir))
#> ⠙ reading in 0/1 original file
#> ✔ reading in 1/1 original file [15ms]
#>
#> ⠙ creating a dataset (ID 0/5)
#> ✔ created a dataset (5 IDs) [12ms]
#>
#> ⠙ creating b dataset (ID 0/5)
#> ✔ created b dataset (5 IDs) [12ms]
#>
#> $a
#> ID time v1 v2
#> 1 1 1 1 NA
#> 2 2 1 NA 1
#> 3 3 1 1 NA
#> 4 4 1 NA 1
#> 5 5 1 1 NA
#>
#> $b
#> ID time v1 v2
#> 1 6 1 NA 1
#> 2 7 1 1 NA
#> 3 8 1 NA 1
#> 4 9 1 1 NA
#> 5 10 1 NA 1
#>