A new study of the global open-source platform, GitHub, offers key lessons on blockchain development—how projects have grown, what's likely to come next, and the implications for financial services firms.
We cannot predict the exact trajectory and impact of blockchain technology. But we also should not ignore its early stages of development and successes along with failures. Tracking this young technology’s development could potentially maximize its potential to best serve us.
Explore the interactive
Figuring out how foundational technologies, such as the Internet or mobile, morph and grow is not easy. New technologies often attract a wide variety of developers, including many freelancers from around the world. The sheer number of developers, the types of problems they are trying to solve, and the geographic spread can make it difficult to anticipate where any new technology is headed.
But perhaps the fundamental difference with blockchain development is that it has largely been orchestrated in the open-source environment. Bitcoin, the original blockchain system, was birthed in open source.
Accordingly, in an effort to better understand the development of blockchain and its ecosystem, we have conducted an extensive data analysis of blockchain projects in an open-source environment. Our study appears to be the first empirical attempt to understand the evolution of blockchain using metadata available on GitHub, a global software collaboration platform.
We chose GitHub because it is the largest known software collaboration platform in the world, with more than 68 million projects and 24 million participants (figure 1).1 GitHub also appears to host the most important projects for the blockchain community.2 The activity on GitHub provides a unique opportunity to identify who is behind blockchain’s development, what type of programming is powering it, where the talent resides, how networks and communities of projects and developers are organized, and what risk factors exist for investing resources into repositories.
Financial services firms seem to be leading the way in blockchain applicability; they currently have the most commercial use cases of blockchain in the marketplace. Our findings could help firms improve their ability to identify successful projects and opportunities based on how the blockchain ecosystem is evolving.
Unless otherwise cited, all data and statistics that we report on blockchain activity on GitHub in this paper is a result of our analysis of the GH Torrent project and the GitHub API (see the sidebar, “Study methodology”).
To conduct our study on GitHub, we utilized data collected by the GH Torrent project, a research initiative led by Georgios Gousios of Delft University of Technology, which monitors the GitHub public event timeline where all of the projects’ activities and modifications are recorded.3 After this initial process was completed, the information was stored in a relational database. The database compiled by GH Torrent comprises more than 4.7 billion rows of information. To identify relevant projects, we queried the GitHub API about keywords associated to blockchain projects. We used both data sources to identify and build our blockchain projects universe. While our data is not exhaustive, it represents a very large sample of all the blockchain activity registered on GitHub.
To identify the most relevant projects in the blockchain space, we took all the different fields provided by GitHub though their API, such as project creation date, type of author who created the project, number of copies (forks), and number of watchers. For the analysis, we developed our own set of metrics using both GH Torrent and GitHub API data.
While sharing software code in a public forum can be traced to the 1950s, open-source platforms have only become hubs for software development within the last 30 years (figure 2).4 The Internet was a great enabler for scaling: Earlier, open-source activity had been mainly the realm of academia, but the Internet made it accessible to aficionados and experts of all stripes, amateur and professional, individual and commercial.5 That said, there was a dip in relevance of software development on open source for a period when commercial entities that secured licenses and patents placed high fences around software code.6 However, disruptive innovation has fostered an ever-increasing sharing economy, which has shifted important software development back to open-source platforms.7
Open source could be the ideal petri dish for attracting a critical mass of blockchain coding efforts, talent, and overlapping objectives that accelerate an ecosystem with common standards.8 It may also mitigate the cost that firms would pay to dedicate resources to a still largely experimental technology. Developing proofs of concept in an “intranet” blockchain learning platform does not seem as efficient as learning how to develop business solutions on an “Internet” blockchain.9 At the current evolutionary stage of blockchain technology, it is likely to be in a developer’s best interest to develop, or watch the development of, blockchain solutions on open source. Blockchain appears to have a better chance to more quickly achieve rigorous protocols and standardization through open-source collaboration, which could make developing permissioned blockchains easier and better.
Our primary unit of analysis on GitHub is the repository. A repository contains the relevant code and files behind projects, where the actual protocol and implementation of programs reside. Throughout this report we use the term “repository” and “project” interchangeably. We will also be looking at the two types of project authors: users—individuals with no known affiliation to an institution (users); and organizations—accounts associated with financial services firms, start-ups, research centers, or software foundations.10
In the next three sections, we look at repositories—their authors, their chances of survival, and how they fit into communities and networks of communities; which programming languages are prevalent and why; and where talent resides. (See our interactive dashboard, where you can explore the GitHub ecosystem’s repositories, coding, and geographies in detail.)
The core code supporting Bitcoin was published in April 2009. Since then, the number of projects on GitHub related to blockchain has grown significantly, averaging more than 8,600 new projects a year. In 2016 alone, there were almost 27,000 new projects (figure 3).
The growth in the number of projects has been matched by the rapid growth of content produced to develop these blockchain technologies. Please see figure 4, and Repositories by year in our interactive dashboard.
In analyzing blockchain repositories and their content, we noticed that increasingly more organizations appear to be getting involved. In 2010, organizations developed less than 1 percent of all projects. By 2017, their blockchain projects accounted for 11 percent (organizations currently account for 7 percent of total—not just blockchain—software development on GitHub). And recent data about the rate at which commercial organizations can find success with blockchain initiatives through open source seems promising; some high-profile, large commercial entities are already doing so. (Please refer to Repositories by organization in our interactive dashboard.)
Of particular significance, some projects that organizations have developed have resulted in new platforms (such as Ethereum, Ripple, Corda, and Quorum) which some developers now use to build applications. Organization-owned projects also tend to be updated more frequently than those developed by users, and are reportedly five times more likely to be copied, implying that the blockchain community has deemed them most relevant.
When a project is copied, all of the content becomes available to the account that copied the project, thus working as a de facto knowledge-transfer mechanism. This process is commonly referred to as a citation network (see Appendix for network definitions),11 where projects that are most often copied occupy a more central role in the network of projects, which we refer to as project centrality. Under this rubric, some of the most central projects have been developed and maintained by organizations: Bitcoin core, the C++ and Go implementation of Ethereum, Python clients for Ethereum, and the Bitcoin Improvements Proposal. To interactively explore a depiction of the various networks in GitHub, please see Network visualization in our interactive dashboard.
When exploring the aforementioned interactive graph, keep in mind that the initial projects of Ethereum and Bitcoin are maintained by organizations (foundations), and that a vast amount of blockchain projects and applications in GitHub are actually built on top of these two projects. In short, organization-led projects are the backbone code for thousands of other projects. Out of the 20 most central projects in the blockchain space measured by popularity, citation, and collaboration (see Appendix for network definitions), 18 were created and maintained by organizations (see table 1).
Organizational commitment in open source appears to be dominating the core development of blockchain because it is most likely more demanding and purposeful than individual participation in development. Once resources are put into place by an organization, there is generally more incentive to ensure that the project is successful. Given that organization participants are tied to one another beyond any particular project, there is often greater accountability to one another, which also drives ongoing development.
A community on open source is a group of developers with shared interests that develops and improves existing content. We identified 772 different blockchain communities on GitHub. Each community is typically defined by patterns of collaboration between these projects that can give rise to new applications. For example, the Ethereum platform was initially developed by two central figures in the Bitcoin project; their project has since evolved into the largest blockchain community, measured by active projects, on GitHub (see the sidebar, “Understanding the Ethereum ecosystem”).
In the blockchain space, communities of projects comprise at least 25 projects, with some large clusters including hundreds of projects (see Communities of repositories in our interactive dashboard).
By studying communities, we can explore how projects that have developed a specialization enable the creation of new applications. For example, we found that tools for enabling crowdsales and initial coin offerings (ICOs) are often connected to projects in large blockchain subcommunities: projects developing content for smart contracts, escrow accounts, and the core code behind Ethereum in the Go language. Not surprisingly, this seems to align with the predilection of many ICOs being offered on top of the Ethereum blockchain (for more information on ICOs, please read “Initial coin offering: A new paradigm,” Deloitte12). Ethereum allows developers and start-ups to issue their own currency, including in the form of an ICO, on the Etherum blockchain through smart contracts, which can seriously reduce the token and cryptocurrency barrier to entry.13
An interesting example of how seemingly disparate communities connect is the Monero cryptocurrency, created in 2014. Monero has concertedly different attributes than Bitcoin regarding its level of privacy (no reuse of addresses allowed), scalability (no blocksize limit), and security (more forced decentralization).14 Still, the community that contains Monero and related projects has a strong connection to the community that contains the main Bitcoin repository.
It could potentially be especially important for blockchain developers to pay close attention to communities. Our analysis reveals that many projects that specialize in particular industries or types of applications in the blockchain space that are enriching the ecosystem have strong community affiliations.
The stark reality of open-source projects is that most are abandoned or do not achieve meaningful scale. Unfortunately, blockchain is not immune to this reality. Our analysis found that only 8 percent of projects are active, which we define as being updated at least once in the last six months. Here, organizations are a positive differentiator; while 7 percent of projects developed by users are active, 15 percent of projects developed by organizations are active.
The mortality rate of projects is often an essential factor in understanding project centrality and the emergence of protocols and best practices. For commercial purposes, since few projects will likely survive, understanding the factors that contribute to a project’s mortality may be an essential skill for firms wishing to piggyback on a successful code, emulate successful projects, or build in-house capabilities.15 Note that about 90 percent of projects developed on GitHub become idle, and the average life span of a project is about one year, with the highest mortality rate occurring within the first six months. Our analysis revealed 11 variables associated to a project becoming inactive. Of these variables, organizations should consider the following three in particular:
For potential developers, the question that often surfaces first is, “How should we start?” To help answer that question, it can be important to find out what’s under the hood of existing projects. Although not the most popular language when measured by number of blockchain repositories, we found that C++ was used most in the ecosystems’ central repositories. This was not surprising, given that C++ has been used for some time in the financial services industry to develop applications that demand efficient memory management, speed, and reliability. For the heavy lifting behind cryptocurrency projects (Bitcoin included), C++ is still the favored language. And for the most central repositories on GitHub, C++ accounts for almost one-half of all the content (see Most popular languages in our interactive dashboard.).
However, we also discovered that Go, the programming language developed by Google in 2009, appears to be gaining traction. It is now the second largest language used for blockchain-related projects. Go seems to have rapidly evolved from a fringe language to one of the centerpieces of the GitHub blockchain ecosystem. Just two years ago, in 2015, less than 2 percent of all of the content of projects in the blockchain space were developed in Go. Programmers attribute the meteoric rise of Go to its simplicity and ability to scale.16 And while financial services firms do reportedly rely on the memory management, speed, and reliability of C++, scalability appears to also be an exceptionally high priority for financial services firms that interact and transact with multiple and diffuse stakeholders. It also seems telling that Ethereum and Hyperledger projects, which both involve integrating other technologies into blockchain to expand its use beyond cryptocurrencies, reportedly favor Go.
Given that an important issue that financial institutions face is hiring the necessary talent to develop, implement, or maintain new technologies, we thought it would be helpful to know where top blockchain talent contributing to GitHub resides. Most GitHub project owners—developers who start repositories—live in North America or Europe, with San Francisco having the largest concentration. Interestingly, the next two most popular cities to find project owners are two traditional financial services hubs: London and New York. (See figure 5 and Repositories by geography in our interactive dashboard.)
We found that projects coming from San Francisco are diverse; they include solutions for exchanges, wallets for cryptocurrencies, interfaces for different blockchains (for example, Ripple, Hyperledger, and Ethereum), and payment tools for cryptocurrencies, to name a few. The ecosystem in London is also varied, but features more projects connected to the Ethereum community, which would also imply more projects around accompanying technologies, such as digital identities, smart contracts, and open APIs. Participants in New York appear to be specializing in projects that are geared toward traditional financial services. It is also worth noting the high level of activity in China, specifically, Shanghai and Beijing. In both of these cities, most of the projects pertain to cryptocurrencies and cryptocurrency exchanges, with an emphasis on scalability.
The Ethereum project is a decentralized platform for blockchain applications based on smart contracts. In 2013, Vitalik Buterin, an active Bitcoin developer, proposed the idea that became Ethereum; his goal was to help create applications that use blockchain technology beyond the cryptocurrency sphere. From its inception, Ethereum was designed to be a blockchain protocol that could enable any application to be written on top of it.18 The Ethereum platform is composed of a virtual machine that executes smart contracts (for an explanation of a smart contract, see “Getting smart about smart contracts,” Deloitte.com). The Ethereum Virtual Machine (EVM), also has a language used to write the instructions of the smart contracts (Solidity), and a token (Ether, or ETH) is used to pay for transaction fees and computational services on the Ethereum network.19 The fact that Ethereum is not centered in cryptocurrency could partly explain why this project became one of the cornerstones of the evolving broader blockchain ecosystem.
The Ethereum project was originally hosted, developed, and distributed through GitHub. To put the growth of Ethereum into perspective, in 2013, there were only three projects on GitHub related to Ethereum; in 2015, that number was 1,439; by mid-2017, it grew to 9,970. These projects have given rise to a wide variety of applications, such as identity management, crowdfunding and investment platforms, payments and remittances, new cryptocurrencies, and decentralized lending platforms.
Given the variety of financial and business applications developed from the Ethereum protocol, financial institutions, along with firms in other industries, have agreed to foster the development of applications and innovations around Ethereum.20 As interest in Ethereum continues to grow, the development of additional open-source solutions, coupled with the support of Fortune 500 firms, could result in a boom for Ethereum-based applications.
The data scientists of Deloitte developed and honed a methodology to analyze and organize GitHub data in order to better understand the evolution of a young, possibly transformative technology and its ecosystem. Our overall objective is to provide insights that help financial institutions make better, more informed decisions and avoid pitfalls.
From this effort, we have learned that financial services firms are involved in blockchain development on GitHub. There are essentially two types of participators on GitHub: the committer and the watcher. The committer makes commits, or contributions to code, while the watcher follows the development of a project without making code contributions. So far, few financial services firms’ employees are committers to projects on the firms’ behalf. There are, however, a few high-profile financial services firms that not only commit, but actually have their own projects running under their brand with significant commits.
Nonetheless, financial services firms seem predominantly engaged as watchers of projects in GitHub. It is difficult to get an actual number on these watchers as they can be watching under handles or private email addresses. Regardless, our analysis can equip both financial services committers and watchers with perhaps a unique opportunity to gain access to a large and nuanced view of the blockchain ecosystem. Leveraging our analytical methodology, firms can now target multiple projects for possible involvement or learning, identify talent using a variety of metrics, see how changes in protocol and trends can point toward standardization and interoperability, and, finally, all of this and more can increase their understanding of blockchain’s evolution.
Specifically, our analysis may enable financial institutions, and other firms, to:
It is our hope that these findings can arm the financial services industry with the data it may need to not only better identify successful projects and opportunities based on how the blockchain ecosystem is evolving, but to become influential participants, themselves, in how blockchain evolves.
We use several metrics commonly used in the field of network analysis, such as number of connections (degree), centrality (PageRank score), and clustering (community detection).21 We defined three types of network connections in our analysis:
Collaboration measures the contribution of projects to each other. To build this network, we identified the repositories that received collaborations from each other in our universe of blockchain projects rather than the entire GitHub set.
Citation measures the use of a project’s content by another project. Projects that are highly cited tend to have a high centrality score (see the next section). To build the network, we identified users who copied a repository, joining his or her projects with the project that he or she copied.
Followers measures the popularity of a project within other projects. To build the network, we identified users who followed a given repository in our universe and joined these users’ projects.
To identify the most central repositories in our network, we used the PageRank (PR) score. Developed by Google, PageRank is a common metric to identify centrality in a network and has been widely used in several fields. We calculated the PR score for each of our three networks.22 Once we obtained the PR score, we ranked the projects based on the value of that metric. We repeated the process for the three networks and created a composite score defined accordingly:
Centrality Score values that are closer to 1 indicate a more central role in the network.
To identify communities in our network, we implemented a commonly used community detection algorithm for large graphs known as fast greedy community detection algorithm.23 The algorithm iterates through the different network connections, adding projects to a community until a local optimal value is reached. The algorithm repeats this process until there are no further improvements. We implemented the algorithm in the collaboration network.
To identify factors associated to a project becoming inactive, we implemented two classification models: a logistic regression and a random forest.24 The logistic regression was used to identify meaningful variables while the random forest is used to identify which projects become idle.