Performs a series of checks to see if data in a given repository can be ingested by a datacommons project.
Usage
check_repository(dir = ".", search_pattern = "\\.csv(?:\\.[gbx]z2?)?$",
exclude = NULL, value = "value", value_name = "measure",
id = "geoid", time = "year", dataset = "region_type",
entity_info = c("region_type", "region_name"), check_values = TRUE,
attempt_repair = FALSE, write_infos = FALSE, verbose = TRUE)
Arguments
- dir
Root directory of the data repository.
- search_pattern
Regular expression used to search for data files.
- exclude
Subdirectories to exclude from the file search.
- value
Name of the column containing variable values.
- value_name
Name of the column containing variable names.
- id
Column name of IDs that uniquely identify entities.
- time
Column name of the variable representing time.
- dataset
Column name used to separate data into sets (such as by region), or a vector of datasets, with
id
s as names, used to map IDs to datasets.- entity_info
A vector of variable names to go into making
entity_info.json
.- check_values
Logical; if
FALSE
, will perform more intensive checks on values. If not specified, these are skipped if there are more that 5 million rows in the given dataset, in which caseTRUE
will force checks.- attempt_repair
Logical; if
TRUE
, will attempt to fix most warnings in data files. Use with caution, as this will often remove rows (givenNA
s) and rewrite the file.- write_infos
Logical; if
TRUE
, will save standardized and rendered versions of each measure info file.- verbose
Logical; If
FALSE
, will not print status messages or check results.
Value
An invisible list of check results, in the form of paths to files and/or measure name. These may include general entries:
info
(always): All measurement information (measure_info.json
) files found.data
(always): All data files found.not_considered
: Subset of data files that do not contain the minimal columns (id
andvalue
), and so are not checked further.summary
(always): Summary of results.
or those relating to issues with measure information (see data_measure_info
) files:
info_malformed
: Files that are not in the expected format (a single object with named entries for each measure), but can be converted automatically.info_incomplete
: Measure entries that are missing some of the required fields.info_invalid
: Files that could not be read in (probably because they do not contain valid JSON).info_refs_names
: Files with a_references
entry with no names (where it should be a named list).info_refs_missing
: Files with an entry in its_references
entry that is missing one or more required entries (author
,year
, and/ortitle
).info_refs_*
: Files with an entry in its_references
entry that has an entry (*
) that is a list (where they should all be strings).info_refs_author_entry
: Files with an entry in its_references
entry that has anauthor
entry that is missing afamily
entry.info_source_missing
: Measures with an entry in itssource
entry that is missing a required entry (name
and/ordate_accessed
).info_source_*
: Measures with an entry (*
) in itssource
entry that is a list (where they should all be strings).info_citation
: Measures with acitation
entry that cannot be found in any_references
entries across measure info files within the repository.info_layer_source
: Measures with an entry in itslayer
entry that is missing asource
entry.info_layer_source_url
: Measures with an entry in itslayer
entry that has a listsource
entry that is missing aurl
entry.source
entries can either be a string containing a URL, or a list with aurl
entry.info_layer_filter
: Measures with an entry in itslayer
entry that has afilter
entry that is missing required entries (feature
,operator
, and/orvalue
).
or relating to data files with warnings:
warn_compressed
: Files that do not have compression extensions (.gz
,.bz2
, or.xz
).warn_blank_colnames
: Files with blank column names (often due to saving files with row names).warn_value_nas
: Files that haveNA
s in theirvalue
columns;NA
s here are redundant with the tall format, and so, should be removed.warn_double_ints
: Variable names that have anint
type, but with values that have remainders.warn_small_percents
: Variable names that have apercent
type, but that are all under1
(which are assumed to be raw proportions).warn_small_values
: Variable names with many values (over 40%) that are under.00001
, and no values under0
or over1
. These values should be scaled in some way to be displayed reliably.warn_value_name_nas
: Files that haveNA
s in theirname
column.warn_entity_info_nas
: Files that haveNA
s in any of theirentity_info
columns.warn_dataset_nas
: Files that haveNA
s in theirdataset
column.warn_time_nas
: Files that haveNA
s in theirtime
column.warn_id_nas
: Files that haveNA
s in theirid
column.warn_scientific
: Files with IDs that appear to have scientific notation (e.g.,1e+5
); likely introduced when the ID column was converted from numbers to characters -- IDs should always be saved as characters.warn_bg_agg
: Files with IDs that appear to be census block group GEOIDs, that do not include their tract parents (i.e., IDs consisting of 12 digits, and there are no IDs consisting of their first 11 digits). These are automatically aggregated bysite_build
, but they should be manually aggregated.warn_tr_agg
: Files with IDs that appear to be census tract GEOIDs, that do not include their county parents (i.e., IDs consisting of 11 digits, and there are no IDs consisting of their first 5 digits). These are automatically aggregated bysite_build
, but they should be manually aggregated.warn_missing_info
: Measures in files that do not have a correspondingmeasure_info.json
entry. Sometimes this happens because the entry includes a prefix that cannot be derived from the file name (which is the part after a year, such ascategory
fromset_geo_2015_category.csv.xz
). It is recommended that entries not include prefixes, and that measure names be specific (e.g.,category_count
rather thancount
with acategory:count
entry).
or relating to data files with failures:
fail_read
: Files that could not be read in.fail_rows
: Files containing no rows.fail_time
: Files with notime
column.fail_idlen_county
: Files with "county"dataset
s with corresponding IDs that are not consistently 5 characters long.fail_idlen_tract
: Files with "tract"dataset
s with corresponding IDs that are not consistently 11 characters long.fail_idlen_block_group
: Files with "block group"dataset
s with corresponding IDs that are not consistently 12 characters long.