Performs a series of checks to see if data in a given repository can be ingested by a datacommons project.
Usage
check_repository(dir = ".", search_pattern = "\\.csv(?:\\.[gbx]z2?)?$",
exclude = NULL, value = "value", value_name = "measure",
id = "geoid", time = "year", dataset = "region_type",
entity_info = c("region_type", "region_name"), check_values = TRUE,
attempt_repair = FALSE, write_infos = FALSE, verbose = TRUE)Arguments
- dir
Root directory of the data repository.
- search_pattern
Regular expression used to search for data files.
- exclude
Subdirectories to exclude from the file search.
- value
Name of the column containing variable values.
- value_name
Name of the column containing variable names.
- id
Column name of IDs that uniquely identify entities.
- time
Column name of the variable representing time.
- dataset
Column name used to separate data into sets (such as by region), or a vector of datasets, with
ids as names, used to map IDs to datasets.- entity_info
A vector of variable names to go into making
entity_info.json.- check_values
Logical; if
FALSE, will perform more intensive checks on values. If not specified, these are skipped if there are more that 5 million rows in the given dataset, in which caseTRUEwill force checks.- attempt_repair
Logical; if
TRUE, will attempt to fix most warnings in data files. Use with caution, as this will often remove rows (givenNAs) and rewrite the file.- write_infos
Logical; if
TRUE, will save standardized and rendered versions of each measure info file.- verbose
Logical; If
FALSE, will not print status messages or check results.
Value
An invisible list of check results, in the form of paths to files and/or measure name. These may include general entries:
info(always): All measurement information (measure_info.json) files found.data(always): All data files found.not_considered: Subset of data files that do not contain the minimal columns (idandvalue), and so are not checked further.summary(always): Summary of results.
or those relating to issues with measure information (see data_measure_info) files:
info_malformed: Files that are not in the expected format (a single object with named entries for each measure), but can be converted automatically.info_incomplete: Measure entries that are missing some of the required fields.info_invalid: Files that could not be read in (probably because they do not contain valid JSON).info_refs_names: Files with a_referencesentry with no names (where it should be a named list).info_refs_missing: Files with an entry in its_referencesentry that is missing one or more required entries (author,year, and/ortitle).info_refs_*: Files with an entry in its_referencesentry that has an entry (*) that is a list (where they should all be strings).info_refs_author_entry: Files with an entry in its_referencesentry that has anauthorentry that is missing afamilyentry.info_source_missing: Measures with an entry in itssourceentry that is missing a required entry (nameand/ordate_accessed).info_source_*: Measures with an entry (*) in itssourceentry that is a list (where they should all be strings).info_citation: Measures with acitationentry that cannot be found in any_referencesentries across measure info files within the repository.info_layer_source: Measures with an entry in itslayerentry that is missing asourceentry.info_layer_source_url: Measures with an entry in itslayerentry that has a listsourceentry that is missing aurlentry.sourceentries can either be a string containing a URL, or a list with aurlentry.info_layer_filter: Measures with an entry in itslayerentry that has afilterentry that is missing required entries (feature,operator, and/orvalue).
or relating to data files with warnings:
warn_compressed: Files that do not have compression extensions (.gz,.bz2, or.xz).warn_blank_colnames: Files with blank column names (often due to saving files with row names).warn_value_nas: Files that haveNAs in theirvaluecolumns;NAs here are redundant with the tall format, and so, should be removed.warn_double_ints: Variable names that have aninttype, but with values that have remainders.warn_small_percents: Variable names that have apercenttype, but that are all under1(which are assumed to be raw proportions).warn_small_values: Variable names with many values (over 40%) that are under.00001, and no values under0or over1. These values should be scaled in some way to be displayed reliably.warn_value_name_nas: Files that haveNAs in theirnamecolumn.warn_entity_info_nas: Files that haveNAs in any of theirentity_infocolumns.warn_dataset_nas: Files that haveNAs in theirdatasetcolumn.warn_time_nas: Files that haveNAs in theirtimecolumn.warn_id_nas: Files that haveNAs in theiridcolumn.warn_scientific: Files with IDs that appear to have scientific notation (e.g.,1e+5); likely introduced when the ID column was converted from numbers to characters -- IDs should always be saved as characters.warn_bg_agg: Files with IDs that appear to be census block group GEOIDs, that do not include their tract parents (i.e., IDs consisting of 12 digits, and there are no IDs consisting of their first 11 digits). These are automatically aggregated bysite_build, but they should be manually aggregated.warn_tr_agg: Files with IDs that appear to be census tract GEOIDs, that do not include their county parents (i.e., IDs consisting of 11 digits, and there are no IDs consisting of their first 5 digits). These are automatically aggregated bysite_build, but they should be manually aggregated.warn_missing_info: Measures in files that do not have a correspondingmeasure_info.jsonentry. Sometimes this happens because the entry includes a prefix that cannot be derived from the file name (which is the part after a year, such ascategoryfromset_geo_2015_category.csv.xz). It is recommended that entries not include prefixes, and that measure names be specific (e.g.,category_countrather thancountwith acategory:countentry).
or relating to data files with failures:
fail_read: Files that could not be read in.fail_rows: Files containing no rows.fail_time: Files with notimecolumn.fail_idlen_county: Files with "county"datasets with corresponding IDs that are not consistently 5 characters long.fail_idlen_tract: Files with "tract"datasets with corresponding IDs that are not consistently 11 characters long.fail_idlen_block_group: Files with "block group"datasets with corresponding IDs that are not consistently 12 characters long.