GitHub materials

Standards
Repo organization
- We have 11 data repositories, divided thematically by general data
topics
- sdc.broadband
- sdc.business_climate
- sdc.demographics
- sdc.education
- sdc.environment
- sdc.financial_well_being
- sdc.food
- sdc.health
- sdc.housing
- sdc.transportation
- sdc.public_safety
- Within a repository, data is organized in thematic topic
folders
- Within a most specific top folder, we have (as necessary)
- code
- all code used to replicate a distribution dataset
- Use the following naming conventions
ingest
files contain steps to acquire the data, write
to /original
prepare
files contain data manipulation, write to
/working or /distribution
- data
- docs
- Supporting documentation for data or methods (e.g. literature or
technical reports)
- Within /code, /data, /docs, we have (as necessary)
- original
- working
- distribution
- /distribution datasets need to be compressed
- /distribution datasets need to follow the column
names
- /distribution datasets need a
measure_info file
- All /distribution datasets need /distribution
code
Table naming guidance
The naming convention for data tables is as follows:
<coverage_area>_<resolution>_<data source>_<time period>_<title>
For example, a table created from ACS 5 year data on health insurance
could look as follows:
va_bg_acs5_2015_adults_health_insured_by_sex
- Abbreviation Standards
- Coverage Area (2 characters for state/province or country; 3 fips
characters for sub-state/province)
- us, United States
- va, Virginia
- va013, Virginia, Arlington County
- Resolutions (2 characters)
- bl, census block
- bg, census block group
- tr, census tract
- nb, neighborhood
- ct, county
- hd, health district
- co, country
- pl, place locations
- pr, person data
- bz, business data
- Data Sources (up to 5 characters; this list will continually grow)
- acs5, American Community Survey 5-Year Data
- lodes, LEHD Origin-Destination Employment Statistics
- pseo, Post-Secondary Employment Outcomes
- qwi, Quarterly Workforce Indicators
- mcig, Mastercard Inclusive Growth Score
- hifld, Homeland Infrastructure Foundation-Level Data
- ookla, OOKLA for Good
- webmd, Web MD
- sdad, (items that we have calculated)
- abc, census address block counts
Measure naming guidance
- Measures should be named to balance human and
machine-readability.
- Generally, the format for measures should be
topic_method
.
- Underscores should be used to separate words in a measure.
- Measures should be renamed to SDC style guidelines after we have
manipulated them.
- The living list of abbreviations is UNDER CONSTRUCTION.
Writing measure_info
- When writing measure_info, I would suggest starting with a copy of
an exemplar measure_info or a closely related measure_info
(e.g. describing data from the same source).
- You can edit measure_info from RStudio, you’re preferred code
editor, or the GitHub GUI (really where ever you like)
- It is important to avoid syntactical mistakes in your measure_info
- Use an editor that is smart for json syntax
- Use a json linter library
(e.g.
jsonlite::validate()
)
- Use an online json linter
How to set up environmental variables
In your home directory, create a file named “.Renviron”. Write the
names of your secrets and their value, like this to this file
# Environmental variables can be in quotes or not in quotes #
CENSUS_API_KEY="secret"
db_usr="secret"
db_pwd="secret"
DATAVERSE_KEY="secret"
DATAVERSE_SERVER="secret"
OSRM_SERVER="secret"
BEA_API_KEY="secret"
my_secret="secret"
This file will execute in the terminal when your R session starts. To
retrieve an environmental variable, execute this command in R
Sys.getenv("my_secret")
## [1] ""
In action, you might use environmental variables like this
options(osrm.server = Sys.getenv("OSRM_SERVER"))
You
can also install your census API key through tidycensus
library(tidycensus)
census_api_key("111111abc", install = TRUE, overwrite = TRUE)
## Your original .Renviron will be backed up and stored in your R HOME directory if needed.
## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY").
## To use now, restart R or run `readRenviron("~/.Renviron")`
## [1] "111111abc"
# First time, reload your environment so you can use the key without restarting R.
readRenviron("~/.Renviron")
# You can check it with:
Sys.getenv("CENSUS_API_KEY")
## [1] "111111abc"
Environmental variables are not only useful time savers, but they
prevent us from commiting secrets to our public repositories!