Open discussion: data science and open source
Short Takes...on Analytics
A blog by James Guszcza, US chief data scientist, Deloitte Consulting LLP
As tools like R, Python, and Hadoop continue their march into the mainstream of enterprise-scale data science, it is evident that open source is becoming a feature of today's enterprise computing landscape.
Once viewed as risky and unreliable, open source tools are now recognized to bring tangible value. Today, open source tools afford data scientists and organizations new levels of power and agility, and are sometimes able to meet their demands in ways traditional tools can’t.
Is it an accident that big data, analytics, and open source have matured at the same time? Or are their linkages more fundamental? To address these questions, it helps to consider open source from a variety of angles.
The culture factor
When trying to understand why organizations and networks of people do what they do, it often pays to consider culture. And indeed aspects of the modern data science community’s culture mesh with that of the open source movement.
D. J. Patil, the US Government’s first Chief Data Scientist, remarked that: “All the top data scientists share an innate sense of curiosity. Their curiosity is broad, and extends well beyond their day-to-day activities. They are interested in understanding many different areas of the company, business, industry, and technology. As a result, they are often able to bring disparate areas together in a novel way.”1
Using and contributing to open source tools goes hand-in-hand with the collaboration, curiosity, associative thinking, and cross-disciplinary orientation characteristics of data scientists.
Exponential growth: Sometimes it’s not just hype
When business people invoke the term “exponential growth,” skepticism is typically in order. But one form of exponential growth that has passed the test of time is Moore’s Law. Moore’s law states that the number of transistors that can fit on a microchip doubles roughly every 18 months. The continually plummeting costs of data storage and computer processing power can be viewed as corollaries of Moore’s Law.
Less often noticed is the (approximate) exponential growth in the functionality of certain open source data analysis tools. For example, the R statistical computing environment is widely regarded as the lingua franca of the statistics and data science communities. Thanks to the Internet as well as R’s modular design, the number of contributed functions to R has grown exponentially over time. While this fact might strike one as obscure, many of these open source add-on packages implement cutting-edge methodologies (think regularized regression, random forest and boosted tree models, support vector machines, and publication-quality data visualization facilities) that better enable data scientists to tackle tough problems and ever more complex datasets.
When value begets value
Anyone who has used social media or shopped online has experienced the phenomenon of network externalities. Network externalities are arguably one of the defining features of the Internet age. The concept captures the tendency of certain tools or products to become more valuable the more people use them: the network of users is inherent to the value of the tool.
For example, R and Python have become popular in both university curricula and business environments both because they are free and because they rapidly incorporate the latest ideas and methodologies. As a result the communities of R and Python users continue to grow; R and Python-focused support groups emerge and grow stronger; books and journal articles incorporate R and Python code, and so on. All of this potentially enhances the value of R and Python; thus likely prompting additional university programs, businesses, and data scientists to adopt them; and the cycle continues.
Given the kaleidoscopically evolving methods and applications of data science in the business, medical, and scientific domains, such commensurately evolving tools are fit to purpose.
This could be the beginning of a beautiful friendship
While the coupling of open source software and data science is relatively new, the above considerations suggest that it will stand the test of time. It is, therefore, likely worth considering future implications.
First, database and analytical software vendors will likely integrate open source functionality with their products. For example many commercial database and statistical computing environments now support the integration of R code.
Second, open source will likely offer new ways of educating students and training employees. Open source software lends itself to, and enables, such education and training innovations as Massively Online Open Courses [MOOCs], the “flipped classroom” (in which online training replaces university lectures and precious classroom time can be used for interactive computer lab-style learning), local user groups and meetups, and online support networks.
Third, open source could enable new modes of collaboration and innovation. For example, data science crowdsourcing platforms such as kaggle and Topcoder demonstrate the possibilities of using open-source software to bring collective intelligence (“the wisdom of crowds”) to thorny analytical and engineering problems. It is reasonable to expect further innovations in organizing, recognizing, and rewarding data science talent.