Check Data Repositories — check_repository • community

Performs a series of checks to see if data in a given repository can be ingested by a datacommons project.

Usage

check_repository(dir = ".", search_pattern = "\\.csv(?:\\.[gbx]z2?)?$",
  exclude = NULL, value = "value", value_name = "measure",
  id = "geoid", time = "year", dataset = "region_type",
  entity_info = c("region_type", "region_name"), check_values = TRUE,
  attempt_repair = FALSE, write_infos = FALSE, verbose = TRUE)

Arguments

dir: Root directory of the data repository.
search_pattern: Regular expression used to search for data files.
exclude: Subdirectories to exclude from the file search.
value: Name of the column containing variable values.
value_name: Name of the column containing variable names.
id: Column name of IDs that uniquely identify entities.
time: Column name of the variable representing time.
dataset: Column name used to separate data into sets (such as by region), or a vector of datasets, with ids as names, used to map IDs to datasets.
entity_info: A vector of variable names to go into making entity_info.json.
check_values: Logical; if FALSE, will perform more intensive checks on values. If not specified, these are skipped if there are more that 5 million rows in the given dataset, in which case TRUE will force checks.
attempt_repair: Logical; if TRUE, will attempt to fix most warnings in data files. Use with caution, as this will often remove rows (given NAs) and rewrite the file.
write_infos: Logical; if TRUE, will save standardized and rendered versions of each measure info file.
verbose: Logical; If FALSE, will not print status messages or check results.

Value

An invisible list of check results, in the form of paths to files and/or measure name. These may include general entries:

info (always): All measurement information (measure_info.json) files found.
data (always): All data files found.
not_considered: Subset of data files that do not contain the minimal columns (id and value), and so are not checked further.
summary (always): Summary of results.

or those relating to issues with measure information (see data_measure_info) files:

info_malformed: Files that are not in the expected format (a single object with named entries for each measure), but can be converted automatically.
info_incomplete: Measure entries that are missing some of the required fields.
info_invalid: Files that could not be read in (probably because they do not contain valid JSON).
info_refs_names: Files with a _references entry with no names (where it should be a named list).
info_refs_missing: Files with an entry in its _references entry that is missing one or more required entries (author, year, and/or title).
info_refs_*: Files with an entry in its _references entry that has an entry (*) that is a list (where they should all be strings).
info_refs_author_entry: Files with an entry in its _references entry that has an author entry that is missing a family entry.
info_source_missing: Measures with an entry in its source entry that is missing a required entry (name and/or date_accessed).
info_source_*: Measures with an entry (*) in its source entry that is a list (where they should all be strings).
info_citation: Measures with a citation entry that cannot be found in any _references entries across measure info files within the repository.
info_layer_source: Measures with an entry in its layer entry that is missing a source entry.
info_layer_source_url: Measures with an entry in its layer entry that has a list source entry that is missing a url entry. source entries can either be a string containing a URL, or a list with a url entry.
info_layer_filter: Measures with an entry in its layer entry that has a filter entry that is missing required entries (feature, operator, and/or value).

or relating to data files with warnings:

warn_compressed: Files that do not have compression extensions (.gz, .bz2, or .xz).
warn_blank_colnames: Files with blank column names (often due to saving files with row names).
warn_value_nas: Files that have NAs in their value columns; NAs here are redundant with the tall format, and so, should be removed.
warn_double_ints: Variable names that have an int type, but with values that have remainders.
warn_small_percents: Variable names that have a percent type, but that are all under 1 (which are assumed to be raw proportions).
warn_small_values: Variable names with many values (over 40%) that are under .00001, and no values under 0 or over 1. These values should be scaled in some way to be displayed reliably.
warn_value_name_nas: Files that have NAs in their name column.
warn_entity_info_nas: Files that have NAs in any of their entity_info columns.
warn_dataset_nas: Files that have NAs in their dataset column.
warn_time_nas: Files that have NAs in their time column.
warn_id_nas: Files that have NAs in their id column.
warn_scientific: Files with IDs that appear to have scientific notation (e.g., 1e+5); likely introduced when the ID column was converted from numbers to characters -- IDs should always be saved as characters.
warn_bg_agg: Files with IDs that appear to be census block group GEOIDs, that do not include their tract parents (i.e., IDs consisting of 12 digits, and there are no IDs consisting of their first 11 digits). These are automatically aggregated by site_build, but they should be manually aggregated.
warn_tr_agg: Files with IDs that appear to be census tract GEOIDs, that do not include their county parents (i.e., IDs consisting of 11 digits, and there are no IDs consisting of their first 5 digits). These are automatically aggregated by site_build, but they should be manually aggregated.
warn_missing_info: Measures in files that do not have a corresponding measure_info.json entry. Sometimes this happens because the entry includes a prefix that cannot be derived from the file name (which is the part after a year, such as category from set_geo_2015_category.csv.xz). It is recommended that entries not include prefixes, and that measure names be specific (e.g., category_count rather than count with a category:count entry).

or relating to data files with failures:

fail_read: Files that could not be read in.
fail_rows: Files containing no rows.
fail_time: Files with no time column.
fail_idlen_county: Files with "county" datasets with corresponding IDs that are not consistently 5 characters long.
fail_idlen_tract: Files with "tract" datasets with corresponding IDs that are not consistently 11 characters long.
fail_idlen_block_group: Files with "block group" datasets with corresponding IDs that are not consistently 12 characters long.

Examples

if (FALSE) {
# from a data repository
check_repository()

# to automatically fix most warnings
check_repository(attempt_repair = TRUE)
}