Our lab's research is in genome informatics, the use of computational and statistical approaches to understand genomes. Our ultimate goal is to achieve a complete understanding of the structure and function of genomes. Specifically, how information is encoded in genomes and how this encoding allows for precise reproducible biological processes and developmental programs, yet is harnessed by evolution to generate remarkable diversity. We work toward this goal both through the study of genome function and evolution, and through the development of tools that support the broader genomics community. Our work can be divided into two main areas.

First, we develop software and infrastructure to support data-intensive biomedical research}. The transformative power of genomic techniques will not be fully realized until all researchers can take advantage of them. Our group researches, develops, and implements approaches to eliminate the informatics challenges that impede genomic research. The results of our work are provided as part of the Galaxy framework (http://galaxyproject.org), which has become one of the most widely used tools for genome analysis.

Second, we use of genomic, comparative genomic, and functional genomic approaches to understand genome structure and gene regulation. Our group studies regulatory element structure and mechanism in evolutionary and functional context through the development of primary analysis, data mining and integration methods. To do this we use data generated by a variety of experimental techniques, leveraging collaborations with experimentalists working in multiple model systems, and developing new analysis algorithms and models.

Making computational biology more accessible, transparent and reproducible

High-throughput data production technologies are revolutionizing modern biology, but progress is frequently impeded by computational details completely unrelated to the scientific questions being investigated. It is our goal to remove these impediments and make complex computational analysis more accessible. Along with the Nekrutenko Lab at Penn State, we develop Galaxy, which allows computational tools to be trivially integrated into an analysis environment in which experimental biologists can construct complex analyses. Some specific recent projects include:

Developing an approach for transparent and reproducible publication of data-intensive analysis

Reproducibility of published results is fundamental to the scientific process. However there are serious challenges to ensuring reproducibility as results become increasingly reliant on complex computational methods. We have developed an integrated approach for publishing analysis called Galaxy Pages. Pages are interactive, web-based documents that describe a complete experiment, allowing computational experiments to be documented and published with all analysis processes and results directly connected. Readers can then view the experiment at any level of detail, inspect intermediate data and analysis steps, reproduce some or all of the experiment, and extract methods to be modified and reused. The resulting framework was published in Genome Biology (Goecks et al. 2010).

Enabling scalable genomic analysis using cloud infrastructure

With ARRA support from NHGRI, We developed an approach for composing analysis environments using cloud resources that is extremely flexible while requiring minimal expertise from users. Building on this framework we developed Galaxy CloudMan, allowing users to create their own Galaxy environment within a cloud computing service (e.g. Amazon Web Services) which automatically acquires and releases compute resources as needed, resulting in more efficient and cost-effective resource use.

Developing a visual analytics framework for genomic data

We have built a new, extensible, framework for web-based genome browsing within the context of Galaxy. Genome browsers have existed as long as there have been genome assemblies. However, wide availability of sequencing technologies has created a need not just to browse publicly curated data on genomes; browsers now need to support researchers who are generating their own data. Because the results of sequencing based experiments are often very different depending on how they are analyzed, browsers should support dynamic and interactive reanalysis of data. Since the Galaxy framework already integrates analysis tools, it provides a natural substrate for building interactive visual analysis. The goal of this framework is to explore new ways to visualize and visually analyze and integrate genomic datasets. The first result of this work is the Galaxy Track Browser. The Galaxy Track Browser presents a new paradigm for rapid visual exploration of the complex parameter spaces associated with high-throughput sequence analysis tools.

Supporting de-centralization of analysis resources

The rapid increase in Galaxy use has created new challenges and opportunities. Many Galaxy users are now running their own dedicated instances, either on local or cloud resources, and other labs have started to build suites of analysis tools on top of Galaxy. Galaxy has become an important resource for the genomics community. In addition to continuing our work on best-practice analyses and visual analytics, a major focus of our future work will be building infrastructure to allow and increasingly decentralized Galaxy community. The core of this work is the Galaxy Tool Shed, which will allow tools and best-practice workflows to be shared between Galaxy instances in a way that preserves reproducibility.

Genomics and epigenomics of gene regulation

My lab is currently engaged in several projects to better understand the structure, function, and evolution of the genomic elements that regulate gene expression, called cis-regulatory modules (CRMs), and the epigenomic features associated with gene regulation.

Establishing the genomic and epigenomic determinants of CRM activity

As part of a collaborative project with Ross Hardison at Penn State we are studying features associated with CRM activity in erythroid differentiation. Using an inducible differentiation model cell line, our collaborators have been mapping the locations of key histone modifications and transcription factors at multiple time points, as well as assaying transcription levels using RNA-seq. A major interest of the lab is the use of machine learning approaches to understand the relationship between these epigenomic features and changes in expression.

Understanding the relationship between evolutionary constraint and function

The extent to which CRMs are under evolutionary constraint remains controversial, and tissue specific cross species comparisons show surprisingly little overlap in transcription factor occupancy. As part of the Mouse ENCODE project we are studying the extent to which TF bound regions and other putative regulatory elements are conserved between human and mouse.

High-resolution mapping of chromatin structure

In a collaboration with Victor Corces at Emory University, we are developing novel analysis methods to allow high-resolution analysis of chromatin structure data. We have developed methods to model and correct for biases in 5C and Hi-C experiments resulting in high resolution measurement of 3D chromatin interactions. We have applied these approaches in both Dropsophila and mouse to produce high-resolution locus specific interaction maps that have revealed new insights into chromatin structure. In ongoing work we are continuing to develop methods to incorporate this information with other functional genomic datasets.