- 18F’s Melody Kramer interviewed me about this project and I went deeper into some of my takeaways.
- Open data strategist Rayna Stamboliyska translated parts of the blog into French for her DataColada newsletter.
Governments have been flocking to GitHub.
Their reasons are plenty: the promise of “private sector” tools, a conviction that publicly-funded code should be public, the company’s evangelism (and stickers), etc. Whatever the case, GitHub now hosts at least 600 government organizations, with over 9,000 public repositories between them.
I had a notion of the global ecosystem this activity has sprouted—the players and their interactions—but wanted to back it up with data.
So, using GitHub’s API, I compiled a database of government GitHub organizations, their repositories, members, and contributors and dove in.
Overall, reuse within the government GitHub “ecosystem” is uneven and limited.
Nearly all popular repositories (inside and outside of government) were created by US and UK national organizations. The bulk are standards or frameworks. Modular products, like data.gov.uk’s CKAN extensions, also seem relatively reusable.
Collaborative work and reuse is most concentrated within the large US and UK national-level networks. This may point to the importance of scale, “real world” interactions (e.g. talks, meet-ups, employees switching between organizations), and the alignment of policy priorities, timelines, licensing, and tech stacks.
14% of repositories have no further activity after being posted to GitHub. 46% remain under development a year after they were created.
I didn’t find a license file for half of the repositories. At least 13% use the MIT license. At least 8% use some version of the GPL. License choice varies geographically.
Government GitHub organizations are bringing some new users to the platform along with them. But 45% of the users predate the government organizations they contribute to.
Estonia has the most government repositories per capita at 72.8 per million residents (hover over and click to zoom in on the map up top).
Notes and Caveats
- The code to generate the database is on GitHub.
- The list of government GitHub organizations is certainly incomplete—add more if you know any!
- Unless otherwise specified, ‘repositories’ refers to repositories that are not themselves forks.
- All repository, membership, and contribution statistics include only public information. I assume most repositories are kept public. Organizational membership, however, is private by default.
- The GitHub API only checks for license files at the root of the repository. Some may be embedded in the README or placed in sub-folders.
The Government GitHub Network
Government development teams interact and influence one another through various channels. These include:
- sharing members or repository contributors;
- forking, starring, or cloning each others’ repositories (and possibly submitting pull requests)
- contributing issues;
- reading each others’ code, READMEs, and blogs; and
- talking to each other in person or on other platforms.
Because the GitHub API gives us access to membership, contribution, and forking relationships, I’ll focus on those.
This graph shows 277 nodes (or organizations) connected to one another with 1270 edges. If two organizations share an edge, they have at least one contributor in common.
Overall, 941 unique users account for 2751 individual contributor connections between the organizations (see user statistics).
323 organizations are “loners,” with no contributors shared with other organizations. These don’t appear in the graph.
The thicker edges represent more contributors in common. Nodes are sized by the number of other nodes they’re connected to (their “degree”). If you view the full size graph, each node links to its GitHub site.
Two main clusters stick out. Up top in bright green are UK national organizations. On the right in purple are the US federal organizations.
The City of Philadelphia has the most prominent non-national organization. You might also notice the DC Government in the mix, as well the USGS/NOAA, Brazil, Canada, and Australia sub-networks.
It’s likely some of the connections aren’t real, but are the artifacts of cloned repositories. These may retain the original repository’s contribution history in addition to any new commits, but the API won’t mark them as forks.
This turns the contributor graph into a mish-mash of genuine collaboration of one organization’s members with another’s, non-members who contribute code to multiple organizations, and reuse.
We can do the same thing with shared members.
This graph of 96 nodes (organizations) is tied together using 148 edges. Behind these edges are 327 individual member connections from 137 unique users (see user statistics).
Again, we find the highly inter-linked US federal sub-network. There are also the smaller UK and Canada membership sub-networks.
GitHub makes membership private by default, so there are likely more member connections in reality. But, in general, it makes sense that this graph would be much sparser than the contribution graph.
Why is the US federal sub-network comparatively dense? Many of them have large memberships, so there are more potential connections to be made. Some (like 18F) have a policy requiring that staff make their membership public. A number operate as consultancies to other federal agencies. And, from what I’ve seen, many “techies” enter the US federal government through one of these agencies and then hop around.
This graph shows organization forking connections. Arrows connect the forked repository’s source → to its destination.
121 organizations have forked other organizations’ repositories 223 times (138 edges). For comparison, 1858 forks come from non-government organizations or users and 9032 repositories are not forks.
Most of the graph’s forks go unreciprocated; only 8 government organizations (mainly US federal) have forked one another’s repositories (↔).
data.gov.uk is in an unusual position, being disconnected from the rest of the UK, while the source for organizations in other countries (Romania, Estonia, Paraguay, Canada, and US).
Open data projects, like data.gov.uk’s CKAN contributions, seem poised for cross-border reuse. This may be because use cases are quite standardized and modular extensions address any differences.
Organization, Repository & User Statistics
There are certainly many other questions to look into. Check out this repository to generate your own database (or reuse the one there).
The list I used included 600 government GitHub organizations. You can see their geographic distribution on the map up top.
Of note, not only does UK's Government Digital Service make the top 10, but so does its GitHub organization for retired repositories! Neat appearances from the Norwegian Meterological Institute, the Gemeinsamer Bibliotheksverbund in Germany, and the National Library of Finland. This includes only repositories that are not themselves forks.
|2||Government Digital Service||345|
|3||Ministry of Justice||344|
|5||Consumer Financial Protection Bureau||169|
|1||Government Digital Service||577|
|3||Ministry of Justice||212|
|4||National Geospatial-Intelligence Agency||189|
|5||U.S. General Services Administration||187|
Member counts are likely quite a bit higher in reality. Because GitHub defaults to private membership, in my experience many users don't switch their preferences to be public. I bet that this isn't intentional in most cases. GitHub would do well to make this option more obvious when you first join an organization.
|2||U.S. Geological Survey||131|
|3||Web Experience Toolkit (WET)||99|
|4||Consumer Financial Protection Bureau||63|
|5||Presidential Innovation Fellows||61|
No. times forked by others
|2||Government Digital Service||1515|
|3||Consumer Financial Protection Bureau||1418|
|4||The White House||1327|
|5||US Army Research Laboratory||1106|
No. repositories that are forks
|1||Government Digital Service||129|
|3||U.S. General Services Administration||75|
|4||Ministry of Justice||67|
The data showed 11113 public repositories—2081 forked repositories and 9032 non-forked.
Listed below are each license type, their frequency, the regions that most frequently use the license, and the percentage of each region's repositories with the license. I only include regions with at least 10 repositories and 2 organizations. Note: The GitHub API only looks for a license file at the root of the repository, so licenses embedded in the README or stored in a subfolder are marked as having no license. Italy's not really that bad!
Italy 90.0% • Ecuador 88.89% • The Netherlands 84.43% • Bolivia 83.33% • Germany 74.38%MIT License | 1192 repositories
U.K. 34.68% • International 30.77% • Belgium 28.36% • Lithuania 26.47% • Switzerland 20.27%Other | 1179 repositories
Chile 27.08% • International 23.08% • Canada 21.33% • U.S. 18.91% • France 17.69%Apache License 2.0 | 507 repositories
Australia 11.68% • Canada 10.14% • Japan 8.74% • U.K. 8.5% • U.S. 4.99%Creative Commons Zero v1.0 Universal | 289 repositories
Japan 22.33% • Estonia 5.88% • U.S. 5.84% • Sweden 3.74% • Germany 1.65%GNU General Public License v2.0 | 262 repositories
Norway 26.26% • Brazil 22.29% • Chile 20.83% • Venezuela 13.64% • Colombia 12.5%GNU General Public License v3.0 | 249 repositories
Mexico 71.74% • Venezuela 43.18% • Belgium 20.9% • Colombia 12.5% • Finland 10.23%GNU Affero General Public License v3.0 | 198 repositories
Sweden 33.64% • Switzerland 20.27% • France 16.15% • Finland 15.34% • Ecuador 11.11%The Unlicense | 168 repositories
U.S. 3.88% • Argentina 1.19% • Sweden 0.93% • The Netherlands 0.82% • Brazil 0.6%BSD 3-clause "New" or "Revised" License | 116 repositories
New Zealand 28.32% • Australia 10.22% • Estonia 7.35% • Chile 4.17% • The Netherlands 3.28%GNU Lesser General Public License v3.0 | 65 repositories
Venezuela 6.82% • Colombia 6.25% • Singapore 5.26% • Belgium 1.49% • U.S. 1.11%BSD 2-clause "Simplified" License | 27 repositories
Belgium 4.48% • Canada 1.05% • Japan 0.97% • Sweden 0.93% • The Netherlands 0.82%GNU Lesser General Public License v2.1 | 22 repositories
Estonia 17.65% • Brazil 1.2% • Norway 0.56% • France 0.38% • U.S. 0.07%Mozilla Public License 2.0 | 9 repositories
Chile 2.08% • Australia 0.36% • Canada 0.35% • U.K. 0.25% • U.S. 0.02%ISC License | 7 repositories
Canada 0.35% • U.S. 0.14%Eclipse Public License 1.0 | 4 repositories
Germany 0.83% • U.K. 0.05% • U.S. 0.05%Artistic License 2.0 | 2 repositories
Finland 0.57% • U.S. 0.02%Do What The F*ck You Want To Public License | 1 repositories
France 0.38%Microsoft Public License | 1 repositories
U.K. 0.05%Open Software License 3.0 | 1 repositories
Repository count by date created
Things have certainly picked up.
|1||nysenate/Newsclips||June 10, 2009|
|2||sfcta/androidtracks||November 30, 2009|
|3||HHS/pillbox_docs||December 12, 2009|
|4||erasme/check_cciss||March 03, 2010|
|5||usnationalarchives/fr2||April 12, 2010|
Of the 10 oldest, 4 have had commits in the past month.
Many government GitHub repositories have a fairly short development lifespan. 14% show no further development after after they were first pushed to GitHub. However, 46% were under development a year in, 18% two years in, and 6% three years in. 11 repositories have Git histories earlier than their initial push date. Seems like a lot of development happens locally and without version control, then the repository is dumped on GitHub.
Understandably, people are most interested in tools, standards, and frameworks. US Army's net-sec tool beating 18F was a surprise, though.
No. forks (gov → gov)
No. forks (gov → all)
No. forks (non-gov → gov)
Government forks little from without and little from without. The government.github.com repository is likely from organizations adding themselves to the "official" list.
The data showed 7887 public contributors to government repositories (that are not forks) and 1512 public members of government organizations.
User join date vs. organization creation date
Are government GitHub organizations spurring new users to join the platform? We don't know the date users joined an organization, so we'll have to proxy. The histograms below show the difference (in days) between 1) when a user joined GitHub and 2) the earliest creation date of all the government organizations to which they contribute or belong. There's clearly a bump in the center. That spike shows users who joined GitHub at the same time as the government organization with which they're involved. Some of those users who came to the platform later may also have joined a government organization the same day as their arrival. In each case, about half of users predate their organization. This says to me that public, social coding is generally new at an individual level—not just at an institutional one.
No. repositories contributed to
The top contributors (in terms of government repository count, at least) are all members of UK's Government Digital Service, 18F, or the Consumer Financial Protection Bureau. A majority of users are one-off contributors (see percentiles).
- 25th percentile: 1
- 50th percentile: 1
- 75th percentile: 3
No. organizations a member of
All but one (mattbostock of the UK) of the top ten members are part of the web of U.S. federal programs—the USDS-18F-PIF-CFPB-White House connection. It's quite rare for a user to be a member of more than one organization.
- 25th percentile: 1
- 50th percentile: 1
- 75th percentile: 1
Made it down this far?
Huzzah! You deserve a cookie.