
Getting Started

GHOST.jl can be installed from the repository through:

using Pkg
Pkg.add(url = "")

to load the package, use

using GHOST


GitHub can recognize certain licenses for repositories per their documentation. We filter out the machine-detectable licenses that are approved by the Open Source Initiative based on the SPDX Working Group SPDX License List data.


SPDX stands for Software Package Data Exchange open standard for communicating software bill of material information (including components, licenses, copyrights, and security references).

The following licenses are machine-detectable OSI-approved licenses.

0BSDBSD Zero Clause License
AFL-3.0Academic Free License v3.0
AGPL-3.0GNU Affero General Public License v3.0
Apache-2.0Apache License 2.0
Artistic-2.0Artistic License 2.0
BSD-2-ClauseBSD 2-Clause "Simplified" License
BSD-3-ClauseBSD 3-Clause "New" or "Revised" License
BSL-1.0Boost Software License 1.0
CECILL-2.1CeCILL Free Software License Agreement v2.1
ECL-2.0Educational Community License v2.0
EPL-1.0Eclipse Public License 1.0
EPL-2.0Eclipse Public License 2.0
EUPL-1.1European Union Public License 1.1
EUPL-1.2European Union Public License 1.2
GPL-2.0GNU General Public License v2.0 only
GPL-3.0GNU General Public License v3.0 only
ISCISC License
LGPL-2.1GNU Lesser General Public License v2.1 only
LGPL-3.0GNU Lesser General Public License v3.0 only
LPPL-1.3cLaTeX Project Public License v1.3c
MITMIT License
MPL-2.0Mozilla Public License 2.0
MS-PLMicrosoft Public License
MS-RLMicrosoft Reciprocal License
NCSAUniversity of Illinois/NCSA Open Source License
OFL-1.1SIL Open Font License 1.1
OSL-3.0Open Software License 3.0
PostgreSQLPostgreSQL License
UPL-1.0Universal Permissive License v1.0
UnlicenseThe Unlicense
Zlibzlib License

Collection Strategy


We are interested in finding every repository on GitHub that fits the following criteria:

  • Is public
  • Has a machine detectable OSI-approved license
  • Is not a fork
  • Is not a mirror
  • Is not archived

The oldest repository by creation time on GitHub dates back to 2007-10-29T14:37:16+00.

In the GitHub search syntax the following criteria is denoted by

  search(query: "is:public fork:false mirror:false archived:false license:$spdx created:2007-10-29T14:37:16+00..2020-01-01T00:00:00+00", type: REPOSITORY) {

where $spdx a license keyword (e.g., mit).


GitHub only allows to query up to 1,000 results per search connection result. If a query returns over 1,000 results, only the first 1,000 are accessible. In order to be able to collect every repository of interest we query based on: - license (e.g., spdx:mit) - when it was created (e.g., created:2010-01-01T00:00:00+00..2010-02-01T00:00:00+00) We shrink intervals until the result count is 1,000 or fewer.

created:2010-01-01T00:00:00+00..2010-01-01T12:00:00+00 1,850

created:2010-01-01T00:00:00+00..2010-01-01T12:00:00+00 998
created:2010-01-01T12:00:00+00..2010-01-02T00:00:00+00 952

We then prune intervals to obtain the least amount of valid intervals that cover the full time period.

For example,

zlib["2007-10-29 00:00:00","2014-09-04 00:00:00")9992020-05-14 18:48:03FALSE
zlib["2014-09-04 00:00:00","2016-12-09 00:00:00")9982020-05-14 18:48:03FALSE
zlib["2016-12-09 00:00:00","2018-12-21 00:00:00")9982020-05-14 18:48:03FALSE
zlib["2018-12-21 00:00:00","2020-01-01 00:00:00")5622020-05-14 18:48:03FALSE

This is table gh_2007_2021.queries.

The queries table is used to store the queries and track their status. Once all the records have been obtained for the repos table their done status becomes TRUE.

Repository base branch

The commit data for a Git repository is dependent on the base branch.

The repos table contains the GitHub repository global node ID and the global node ID for the base branch of the repository.

MDEwOlJlcG9zaXRvcnkyMzgzNTcxMTI=MDM6UmVmMjM4MzU3MTEyOm1hc3Rlcg==2020-05-14 19:49:10+00Ready

This is table gh_2007_2021.repos.

The various status values include:

  • Ready: We will commence collecting commit data from it.
  • Unavailable: Repository is not accessible (e.g., deleted of made private NOT_FOUND, DMCA takedown)
  • Error: Something weird happened such as someone Git force pushing and changing the history during the scrape process.


For each repository, we query the commit data based on the time coverage of the data collection.

The commits table contains this data and is used to update the status of the repository commit data at the repos table.


Commit users may show with a NULL login which indicates that the commit email does not match those associated with any GitHub account.


Commit timestamps sometimes may have have strange dates dating back before the creation of version control (usually the Epoch time). For those commits, we replace the value with the earliest commit date in that repository that seems valid.

Relational Database

licensesspdxSoftware Package Data Exchange License ID
licensesnameName of the license
queriesspdxThe SPDX license ID
queriescreatedThe time interval for the query
queriescountHow many results for the query
queriesasofWhen was GitHub queried about the information.
queriesdoneHas the repositories been collected?
reposidRepository ID
reposspdxSPDX license ID
reposslugLocation of the repository
reposcreatedatWhen was the repository created on GitHub?
reposdescriptionDescription of the repository
reposprimarylanguagePrimary language of the repository
reposbranchBase branch ID
reposcommitsNumber of commits in the branch until the end of the observation period
reposasofWhen was GitHub queried?
reposstatusStatus of collection effort
commitsbranchBase Branch ID (foreign key)
commitsidCommit ID
commitsoidGit Object ID (SHA1)
commitscommittedatWhen was it committed?
commitsauthors_emailThe email in the Git commit.
commitsauthors_nameThe name in the Git commit.
commitsauthors_idGitHub Author
commitsadditionsThe number of additions in this commit.
commitsdeletionsThe number of deletions in this commit.
commitsasofWhen was GitHub queried.

How To Use

In order to use this package, refer to API section in the documentation, the examples in the test suite, the CI and pipeline scripts.


Additional documentation is forthcoming once the API interface is stabilized.


  • GitHub Personal Access Tokens with public access
  • Julia v1 (current release v1.5.3)
  • A PostgreSQL database connection (tested with v11-v13)