Datacommons Projects • community

A data commons project is a means of handing data that is distributed across multiple repositories. This article covers components of these projects, and describes how to build and maintain them.

See the Community Wiki for more technical detail.

The Social Data Commons project is a working example, where its build.R script lists the steps to build/update it.

Starting a Project

The init_datacommons function can be used to start a data commons project.

library(community)

dir <- tempdir()
init_datacommons(dir)

init_datacommons also builds a monitor site, so you can rerun it after updating a project for an updated monitor site (like the Social Data Commons Monitor).

After initialization, creating or updating a data commons project involves 3 primary steps:

Specify repositories, and clone/pull them in with datacommons_refresh.
Index those repositories with datacommons_map_files.
Specify a view, and run it with datacommons_view.

Repositories

The most basic component of a data commons project is the repository list, which points to the data repositories that make up the data commons.

You can specify these repositories through init_datacommons, or add them to either the commons.json or scripts/repos.txt list.

The listed repositories are then kept in the created repos subdirectory, as managed by the datacommons_refresh function.

For this example, we can add a single repository:

init_datacommons(dir, repos = "uva-bi-sdad/sdc.education")
datacommons_refresh(dir, verbose = FALSE)

Files

The most basic requirement of a data repository is that it contain a data file in a tall format, as initially handled by data_reformat_sdad – these should at least have columns containing IDs (default is geoid) and values (default is value).

The datacommons_map_files function searches for files in each repository to make an index, which is used to create the views.

datacommons_map_files also looks for measure info files (such as those created by data_measure_info), which are collected and saved to cache/measure_info.json for use by the monitor site.

Once files are indexed, you can use the datacommons_find_variables function to search for variables within them:

datacommons_find_variables("2year", dir)[[1]][, c(1, 6)]
#>                                 variable similarity
#> 11                     schools_2year_all  0.5000000
#> 9            schools_2year_min_drivetime  0.4472136
#> 13 schools_2year_with_biomedical_program  0.4082483

Views

So far, the data commons project has collected and indexed existing repositories, but the goal of these projects is to build unified datasets from these repositories. This is done with views.

Views are essentially lists of variables and IDs, which form a subset of the broader data commons.

Views can be specified and run with the datacommons_view function.

The product of a view is a set of unified data files (containing the requested variables and IDs, if found), and a collected measures info file containing information about any of the included variables found. These need to be directed to an output directory:

output <- paste0(dir, "/view1")
datacommons_view(
  commons = dir, name = "view1", output = output,
  variables = "schools_2year_all", ids = "51059",
  verbose = FALSE
)

Now, we can see what was added to the output directory:

list.files(output)
#> [1] "coverage.csv"      "dataset.csv.xz"    "manifest.json"    
#> [4] "measure_info.json"
cat(readLines(paste0(output, "/manifest.json")), sep = "\n")
#> {
#>   "uva-bi-sdad/sdc.education": {
#>     "files": {
#>       "Postsecondary/data/distribution/nces.csv.xz": {
#>         "size": 11384852,
#>         "sha": "bab886dcb05a3a7a681f8385fac86e19b247ce43",
#>         "md5": "081bb20b8e6c7ba94f693715642102f4"
#>       }
#>     }
#>   }
#> }
read.csv(paste0(output, "/dataset.csv.xz"))
#>      ID time schools_2year_all
#> 1 51059 2013         0.4832574
#> 2 51059 2014         0.5187128
#> 3 51059 2015         0.5159749
#> 4 51059 2016         0.4286843
#> 5 51059 2017         0.4092479
#> 6 51059 2018         0.4107240
#> 7 51059 2019         0.4049785
#> 8 51059 2020         0.3249166
#> 9 51059 2021         0.2870052

Usually the output would be a data site project, with a build script that documents the new datasets and rebuilds the site (such as community_example/build.R). Such a script can be set as the view’s run_after, so after the datasets get rebuilt, the site is also rebuilt.