Come for the science, stay for the network

When "Science as a Service" products graduate into worldwide networks, science itself will change

April 4, 2016
Jonathan Libov

Here’s the late Ian Murdock, founder of the Debian project, former CTO of the Linux Foundation, back in 2007, on the rise of the package manager:

It used to be that operating systems were big, monolithic products, and applications were big, monolithic products you put on top of them. If you wanted to deploy, say, a web application, you sourced the middleware stack (which itself was probably several big products too), you sourced the operating system, and you (often painfully) had to integrate the two yourself (or pay a big company lots of money to do it for you).

These days, you increasingly just “apt-get install whatever”.

...With a componentized operating system rather than a wad of stuff, it becomes far easier to push new innovations out into the marketplace and generally evolve the OS over time.

Of course, in 2007 and the years prior, it was rarely just “you” building the web application, because writing and maintaining that middleware stack might have been a full-time job unto itself. Murdock was right about how monumental this shift was: By sharing resources via package managers and providing “unbundled” API’s, we vastly produced productivity in programming. Without it, it would have been impossible in 2014 for a group of 32 engineers to build an app that reached 450mm MAU’s.

Here in 2016 we are still waiting for package manager- and API-like innovations in science. Today’s scientific laboratories largely work in silos like the engineering organizations of yesteryear, duplicating work and preserving data on local hard drives. When laboratories do share resources, it’s often in on-premise, intra-organization “core facilities”.

There are a number of startups today which facilitate the transition from “intra” to “inter”. A 2013 post O’Reilly post, “Science as a Service”, covers a number of them, including Science Exchange (USV, the venture capital firm I work for, is an investor), which is a marketplace for researchers to find labs to perform experiments, and Benchling, a collaboration tool for life scientists. HistoWiz, which enables histopathology as a service and may already have the world’s largest database of cancerous tissue.

As genetics becomes a more translational, applicable science through technologies like CRISPR, genetics is a market we’ll see many startups try to tackle. Recently I’ve seen Molecula Maxima, an IDE and service for genetic programming, Quilt Data for sharing, discovery, and manipualtion of genetic information, and Stirplate, which promises to drastically reduce and deduplicate work done by geneticists. Perhaps there’s one to be built on top of the programming language that MIT engineers just announced.

I learned from Stirplate’s founder, Keith Gonzales, that next generation sequencing machines deliver results, those data are packaged as 500GB+ FASTQ files, massive text files which are useless to researchers without the aid of a handful of data scientists and engineers employed by the lab. This sounds an awful lot like the pre-package manager, pre-API era of programming. The near-term playbook for someone like Stirplate is to de-duplicate all the work done by data scientists and engineers across laboratories, much as Sift Science has done for fraud engineers across e-commerce and and other organizations.

The Science as a Service startups identified in the 2013 O’Reilly post may have produced gains in efficiency, as occurred with the advent package managers in programming, but we have yet to see science’s Github, the realization of the network that can happen when disparate organizations share resources. Much as package managers and API’s ushered in a set of tools (like Git) that engineers would want to collaborate and comment on—a dynamic which democratized and meritocracized engineering itself—a Science as a Service business could conceivably follow the “come for the tool, stay for the network” playbook that would greatly democratize and meritocracize science itself. In a world where hedge fund quants are leveraging data science techniques to beat cardiologists at their own game, a Github for science could greatly increase productivity by expanding the realm of who can practice science. On a less radical tack, one can imagine how Histowiz’s database of cancer tissue could do a lot of good for a lot of histopathologists that was never possible before. If and when the network effects kick in, it would do wonders for scientific progress, much as those network effects have for programming.

This would also have an enormous impact on science publishing. I spent the first year of my career in a neuroscience lab, running EEG experiments and processing data for a series of experiments that would hopefully eventually get published within a few years. Perhaps the reason I didn’t have the disposition for science research—I only lasted one year—was that I was too impatient to have the fruits of my labor only maybe get realized by a gatekeeper of a scientific journal on the timescale of years. And it’s not just antsy first-year research technicians that the science publishing system perturbs; it’s also at the root of the industry’s problem with reproducibility—reproducibility studies don’t get published, which reduces incentive to perform reproducibility studies because getting published is the way labs get funded.

The demand amongst scientists for reforming the science publishing industry is there. In fact the #ASAPBio movement, advocating for of a “pre-print” model that would enable biologists to share findings prior to publication in a journal, is alive and well. And support for Sci-Hub, which pirated millions of papers and made them freely accessible, speaks to to what we’re missing out on by withholding scientific papers behind extremely expensive subscriptions.

As an analyst at USV I’ve looked at a number of models that would seek to disrupt the science publishing industry, and it now seems clear to me that science publishing sits on top of the entire Science stack. Much as accreditations in engineering are a thing of the past now that programmers, not organizations like Oracle, own the network in programming, the advent of science and scientific data networks should serve as a wedge to break that apart. Which is a long way of saying that if you want to change the science publishing system that is holding science back, you’d have to change science. And I do believe that a model of science that is collaborative — where data is shared and accessed at rates unseen and unthinkable today, where science is done, like pull requests and commits in Github, is done incrementally on the network rather than in an isolated lab and through a paper — would render today’s science publishing model superfluous.

For the sake of scientific progress, we’d be best off doing science out in open. If you’re working on that I’d love to talk.