A data commons project is a means of handing data that is distributed across multiple repositories. This article covers components of these projects, and describes how to build and maintain them.
See the Community Wiki for more technical detail.
The Social Data Commons project is a working example, where its build.R script lists the steps to build/update it.
Starting a Project
The init_datacommons function can be used to start a data commons project.
library(community)
dir <- tempdir()
init_datacommons(dir)
init_datacommons
also builds a monitor site, so you can
rerun it after updating a project for an updated monitor site (like the
Social Data
Commons Monitor).
After initialization, creating or updating a data commons project involves 3 primary steps:
- Specify repositories, and clone/pull them in with datacommons_refresh.
- Index those repositories with datacommons_map_files.
- Specify a view, and run it with datacommons_view.
Repositories
The most basic component of a data commons project is the repository list, which points to the data repositories that make up the data commons.
You can specify these repositories through
init_datacommons
, or add them to either the
commons.json
or scripts/repos.txt
list.
The listed repositories are then kept in the created
repos
subdirectory, as managed by the datacommons_refresh
function.
For this example, we can add a single repository:
init_datacommons(dir, repos = "uva-bi-sdad/sdc.education")
datacommons_refresh(dir, verbose = FALSE)
Files
The most basic requirement of a data repository is that it contain a
data file in a tall format, as initially handled by data_reformat_sdad
– these should at least have columns containing IDs (default is
geoid
) and values (default is value
).
The datacommons_map_files function searches for files in each repository to make an index, which is used to create the views.
datacommons_map_files
also looks for measure info files
(such as those created by data_measure_info),
which are collected and saved to cache/measure_info.json
for use by the monitor site.
Once files are indexed, you can use the datacommons_find_variables function to search for variables within them:
datacommons_find_variables("2year", dir)[[1]][, c(1, 6)]
#> variable similarity
#> 11 schools_2year_all 0.5000000
#> 9 schools_2year_min_drivetime 0.4472136
#> 13 schools_2year_with_biomedical_program 0.4082483
Views
So far, the data commons project has collected and indexed existing repositories, but the goal of these projects is to build unified datasets from these repositories. This is done with views.
Views are essentially lists of variables and IDs, which form a subset of the broader data commons.
Views can be specified and run with the datacommons_view function.
The product of a view is a set of unified data files (containing the
requested variables and IDs, if found), and a collected measures info
file containing information about any of the included variables found.
These need to be directed to an output
directory:
output <- paste0(dir, "/view1")
datacommons_view(
commons = dir, name = "view1", output = output,
variables = "schools_2year_all", ids = "51059",
verbose = FALSE
)
Now, we can see what was added to the output directory:
list.files(output)
#> [1] "coverage.csv" "dataset.csv.xz" "manifest.json"
#> [4] "measure_info.json"
cat(readLines(paste0(output, "/manifest.json")), sep = "\n")
#> {
#> "uva-bi-sdad/sdc.education": {
#> "files": {
#> "Postsecondary/data/distribution/nces.csv.xz": {
#> "size": 11384852,
#> "sha": "bab886dcb05a3a7a681f8385fac86e19b247ce43",
#> "md5": "081bb20b8e6c7ba94f693715642102f4"
#> }
#> }
#> }
#> }
read.csv(paste0(output, "/dataset.csv.xz"))
#> ID time schools_2year_all
#> 1 51059 2013 0.4832574
#> 2 51059 2014 0.5187128
#> 3 51059 2015 0.5159749
#> 4 51059 2016 0.4286843
#> 5 51059 2017 0.4092479
#> 6 51059 2018 0.4107240
#> 7 51059 2019 0.4049785
#> 8 51059 2020 0.3249166
#> 9 51059 2021 0.2870052
Usually the output would be a data site project, with a build script
that documents the new datasets and rebuilds the site (such as community_example/build.R).
Such a script can be set as the view’s run_after
, so after
the datasets get rebuilt, the site is also rebuilt.