Titus recently posted some thoughts on the "UNIX" scripting/pipeline model for analysis. The model is essentially one of discrete components that produce and consume standard (or at least easily parsed) data formats. The components are naturally reusable, and are chained together to perform more complex analysis. Higher level facilities such as make) facilitate reproducibility.

This model has a lot of nice features. In particular, it makes reuse easy and thus encourages the development of reusable components. In fact, this is exactly the analysis model that Galaxy makes available to non-computational users. In doing so we tackle several of the "deficiencies" of this model that Titus mentions.

Scalibility. Is this an inherent issue with pipelined analysis. It certainly is a problem at the unix shell, where each utility must linearly process the output of the previous utility. We, however, relax this restriction. A tool can examine the entirety of the output produced by a previous tool, but more importantly, these intermediate files can be indexed, allowing for more efficient data access.

Testing. Always a tricky issue. I agree that testing modules and libraries is easier than testing scripts, however most scripts I write are actually just thin wrappers over libraries. However, if we want to take advantage of the software that is already out there, we can't mandate that they be written in a certain way. Our compromise is to provide support at the framework level to make it easy to build functional tests for command line tools.

Hard to reuse. This is only the case if the components of your pipeline violate the initial assumptions -- do one well defined thing. Here I really see no difference between a library and a command line tool, either the program unit is documented and does what the documentation says, or it doesn't. I also have no evidence that libraries tend to be better behaved in this sense then command line tools. On the other hand, the command line is a great abstraction for encouraging reuse, since it a language/platform agnostic way for components to be connected.

Poor metadata support. This is absolutely true for the standard command line model, and something that we attack directly by layering structured metadata on top of not just every dataset, but also every tool. In other words, this is not a problem with the model, but a problem with implementing this model within the framework of the unix shell. A better framework can provide much better support for metadata, and conversely having better metadata makes the framework more powerful.

Data format abstraction. A problem if you limit yourself to CSV, but if you allow for richer formats this is much less of a problem. If you have a rich way to describe data (both how it is stored and how tools expect to see it), then as long as tools and agree on a way to share data or the framework can adapt between formats for you, all is well.

Poor user interface. Perhaps not applicable here since working with Galaxy is by design more user friendly than working with the unix command line. (But of course, what is hiding under most of those Galaxy "tools" is a command line program, in other words, making it easy to put an interface on a script is exactly what we do).

Poor reproducibility. It is all too true that people fail to keep track of the parameters when the run scripts. Working at the command line interactively, it is all too easy to lose track of exactly what you've done. Here again, a better framework can help. In Galaxy, the framework automatically keeps track of all of the parameters and all of the data flows, so for any dataset you can always see the entire provenance. Not only that, the exact procedure can be extracted as a workflow at any time in the future an be re-run.

So what's the point? I don't believe these are deficiencies of the pipeline model of analysis, but with the environment in which the analysis is performed. The key piece that we introduce is structured metadata, not just about datasets, but tools and data formats. With this deeper understanding of how each component works and what the data looks like, the framework can actually turn these weaknesses into strengths: creating an easy-to-use environment for interactive analysis that facilitates reuse and reproducibility.

So in a sense Galaxy is one part of a complete solution stack. It provides the top layer, a user interface where users without computational expertise can put components together to perform analysis and create workflows. Those components are themselves command line tools (scripting layer), ideally implemented using libraries and modules (object layer), and sharing common data and index formats (data layer).

However, this solution is clearly targeted at the non-computational user. What is needed is a complementary solution for the bioinformatics power user. Perhaps a command line analysis environment that is much better at keeping track of the provenance of data. Perhaps a scripting environment. But ideally, it would be possible to work with the same analysis from either perspective.

For example, we already have some support in Galaxy for taking a workflow created through the web interface and running it from a command line. The workflow then becomes a common format that allows the power user to provide their analysis to an individual with less computational expertise in a way that they can still inspect, understand, and even modify and rerun the analysis.

Of course, the question of data storage and representation is still a very tricky one here. Galaxy uses a variety of special purpose index formats, along with adapting from standard to tool specific formats and doing automatic datatype conversion at the framework level, to smooth over these problems. However, all of these things result in either data redundancy or inefficiency for some cases. This is an inherent difficulty when you start from the premise (as we do) that the tools don't have to conform to our ideal representation of the data. However identifying the best way to store data so that it can be efficiently viewed through a variety of different representations is a very interesting problem.