Locked History Actions

Events/GCC2014/Abstracts



Abstracts



See the GCC2014 Program for the complete schedule.

Talk Abstracts

Contents

  1. Session 1, Tuesday, July 1, 9:15-10:30
    1. Transcriptomes and Exomes: Computational Challenges of NGS Data
    2. The Galaxy framework as a unifying bioinformatics solution for multi-omic data analysis
    3. iReport: HTML Reporting in Galaxy
  2. Session 2, Tuesday, July 1, 11:00-12:15
    1. Galaxy Deployment on Heterogenous Hardware
    2. Connecting Galaxy to tools with alternative storage and compute models
    3. A journal’s experiences of reproducing published data analyses using Galaxy
    4. Enabling Dynamic Science with Flexible Infrastructure
  3. Session 3, Tuesday, July 1, 1:15-2:30
    1. State of the Galaxy
    2. Update on Ion Torrent Sequencing – Accurate, Long Reads
  4. Session 4, Tuesday, July 1, 4:00-5:30
    1. The Galaxy Tool Shed: A Framework for Building Galaxy Tools
    2. Integrating the NCBI BLAST+ suite into Galaxy
    3. deepTools: a flexible platform for exploring deep-sequencing data
  5. Session 5, Wednesday, July 2, 9:10-10:25
    1. The GCC2014 Hackathon
    2. More Options, Less Time: Streamlining Access to Reference Datasets
    3. Building More Powerful Galaxy Workflows with Dataset Collections
    4. An Appliance for Life Science Research: Isilon, Penguin and Galaxy
  6. Session 6, Wednesday, July 2, 10:55-12:15
    1. Lab Specimen Tracking with Galaxy
    2. The Munich NGS-FabLab for medical sequence data
    3. Galaxydx - A Web-server dedicated to diagnosis data analysis
    4. Using Galaxy and Globus to deliver Science as a Service
    5. SGI UV: Harnessing the Big Brain Platform for Galaxy
  7. Session 7, Wednesday, July 2, 1:15-2:35
    1. Building a virtual research environment with Galaxy
    2. The Australian Genomics Virtual Laboratory
    3. Galaxy on the GenomeCloud : Yet another on-demand Galaxy cloud, but only powered by Apache CloudStack
    4. Test-driven Evaluation of Galaxy Scalability on the Cloud
    5. Bioinformatics on AWS: New and Noteworthy Features

Session 1, Tuesday, July 1, 9:15-10:30

Transcriptomes and Exomes: Computational Challenges of NGS Data

Steven Salzberg

Steven Salzberg1

  • 1 Johns Hopkins University

Steven Salzberg is a Professor of Medicine, Biostatistics, and Computer Science at the Johns Hopkins University School of Medicine where he is also Director of the Center for Computational Biology at the McKusick-Nathans Institute of Genetic Medicine. Steven has made many prominent contributions to open source software, including several of the most popular tools used on Galaxy Platforms. Recently he was awarded the 2013 Benjamin Franklin Award for Open Access in the Life Sciences, and the 2012 Balles Prize in Critical Thinking for his science column at Forbes.


The Galaxy framework as a unifying bioinformatics solution for multi-omic data analysis

PratikJagtap

Pratik D. Jagtap1,3, James Johnson2, Getiria Onsongo2, Bart Gottschalk2, Timothy J. Griffin1,3

Slides, Video

Integration and correlation of multiple areas of 'omics' datasets (genomic, transcriptomic, proteomic) has potential to provide novel biological insights. Integration of these datasets is challenging however, involving use of multiple, domain-specific software in a sequential manner.

We describe extending the use of Galaxy for proteomics software, enabling novel, advanced multi-omic applications in proteogenomics and metaproteomics. Focusing on the perspective of a biological user, we will demonstrate the benefits of Galaxy for these analyses, as well as its value for software developers seeking to publish new software. We will also report on our experience in training non-expert biologists to use Galaxy for these advanced, multi-omic applications.

Working with biological collaborators, multiple proteogenomics and metaproteomics datasets representing a broad array of biological applications were used to develop workflows. Software required for sequential analytical steps such as database generation (RNA-Seq derived and others), database search and genome visualization were deployed, tested and optimized for use in workflows.

Novel proteoforms (proteogenomic workflows, e.g., Galaxy Workflow: Integrated ProteoGenomics Workflow (ProteinPilot)) and microorganisms (metaproteomic workflows, e.g., Workflow for metaproteomics analysis - ProteinPilot' ) were reliably identified using shareable workflows. Tandem proteogenomic and metaproteomic analysis of datasets will be discussed using modular workflows. Sharing of datasets, workflows and histories on the usegalaxyp.org website and proteomic public repositories will also be discussed.

We demonstrate the use of Galaxy for integrated analysis of multi-omic data, in an accessible, transparent and reproducible manner. Our results and experiences using this framework demonstrate the potential for Galaxy to be a unifying bioinformatics solution for multi-omic data analysis.


iReport: HTML Reporting in Galaxy

Saskia Hiltemann

Saskia Hiltemann1, Youri Hoogstrate1, Hailiang Mei2, Guido Jenster1, Andrew Stubbs1

  • 1 ErasmusMC, Rotterdam, The Netherlands
    2 LUMC, Leiden, The Netherlands

Slides, Video

Galaxy offers a number of great visualisation tools (Trackster, Circster), but currently lacks the ability to easily summarise the various outputs of a workflow into a single view. iReport is a Galaxy tool for the easy creation of HTML reports from Galaxy outputs. Rather than having a static HTML output, iFUSE2 uses javascript and jQuery to allow for interactivity in the form of searching and sorting of tables, automatic zooming of image data, tabbed view for organisation of outputs, etc. Users define the number and names of tabs for their report, and can add different types of content-items to these tabs (e.g. text, tabular data, image data, PDF files, links to datasets, and more).

We have previously implemented Galaxy-based data processing pipelines for next-generation sequencing (NGS) and for array based allelic copy number determination named CGtag (Hiltemann et al. 2014) and developed a web based fusion gene visualizer, iFUSE (Hiltemann 2013). We used the iReport tool to make iFUSE2, the next-step extension to support fusion gene determination within Galaxy, which runs as the last step of our workflow and combines the outputs of various Galaxy tools into a single view.

iReport is available from the DTL toolshed (toolshed.dtls.nl) and the main Galaxy toolshed.


Session 2, Tuesday, July 1, 11:00-12:15

Galaxy Deployment on Heterogenous Hardware

Carrie Ganote

Carrie Ganote1, Soichi Hayashi1

Slides, Video

Indiana University, like many institutions, houses a heterogenous mixture of compute resources. In addition to university resources, the National Center for Genome Analysis Support, the Extreme Science and Engineering Discovery Environment, and the Open Science Grid all provide resources to biologists with NSF affiliations. Such a diverse mixture of compute power and services could be applied to address the equally diverse set of problems and needs in the bioinformatics field.

Many software suites are well suited for large numbers of fast CPUS, such as phylogenetic tree building algorithms. De novo assembly problems really crave a machine with lots of RAM to spare. Alignment and mapping problems where each input is a separate invocation lend themselves perfectly to high-throughput, heavily distributed compute systems. Galaxy is a web interface that acts as a mediator between the biologist and the underlying hardware and software - in an ideal setup, Galaxy would be able to delegate work to the best suited underlying infrastructure.

We present an instance of Galaxy at Indiana University, installed and maintained by NCGAS, that takes advantage of a variety of compute resources to increase utilization and efficiency. The OSG is a distributed grid through which Blast jobs can be run. IU, NCGAS and XSEDE jointly support Mason, a 512Gb/node system. For IU users, Big Red 2 is the first university-owned petaFLOPS machine. Connecting these resources to Galaxy and using the best tool for the job results in the best performance and utilization - everyone wins.


Connecting Galaxy to tools with alternative storage and compute models

Brad Chapman

Brad Chapman1, Rory Kirchner1, Oliver Hofmann1, Winston Hide1

Slides, Video

The community developed bcbio-nextgen framework provides implementations of best-practice pipelines for variant calling and RNA-seq analysis. The framework handles computation, data storage and program connectivity in ways that parallel Galaxy's approaches, making it difficult to plug in as a standard tool. We'd like to be able to integrate with Galaxy by sharing the underlying implementation code for accessing data, rather than pushing and pulling large files. This talk will discuss ideas to access shared data on external object stores like S3 or HDFS in a consistent way that does not rely on data copying. It also will incorporate approaches to compartmentalize complex sets of tools inside containers using Docker. The goal is to stimulate discussion about ways to make Galaxy a modular component within complex analysis environments. Our ultimate vision is to have an Amazon based cloud implementation that uses CloudMan to run a Galaxy front end sending out jobs to tools like bcbio-nextgen.


A journal’s experiences of reproducing published data analyses using Galaxy

Peter Li

Peter Li1, Huayan Gao2, Tin-Lap Lee2 and Scott C. Edmunds1

  • 1 GigaScience, BGI-Hong Kong Co., Ltd, Hong Kong 2 School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong

Slides, Video

GigaScience is a journal with a focus on the publication of reproducible research. This is facilitated by its GigaDB database where the data and the tools used for its analysis may be deposited by authors and made publicly available with citable DOIs. We have investigated the extent by which the results from articles published in GigaScience can be made reproducible using Galaxy in a pilot project based on a previously published paper reporting on SOAPdenovo2. The performance of this de novo genome assembler was compared with SOAPdenovo1 and ALL-PATHS-LG by Luo et al., (2012) for its ability to assemble bacterial, insect and human genomes. After integrating the three genome assemblers, and their associated tools into Galaxy, workflows were implemented in a way that re-created the genome assembly pipelines used by the authors. However, our aim of reproducing the genome assembly statistics from Luo et al., (2012) with the workflows was met with mixed success. Whilst the results generated by SOAPdenovo2 could be reproduced by our Galaxy workflows, we were less successful with SOAPdenovo1 and ALL-PATHS-LG. In this presentation, we will show how Galaxy was used, the problems that were encountered and the results of this reproducibility exercise.

Reference


Enabling Dynamic Science with Flexible Infrastructure

Anushka Brownley Aaron Gardner

Anushka Brownley1, Aaron Gardner1

Slides, Video

As a trusted industry leader in designing and implementing effective scientific infrastructure for research and other organizations, BioTeam has partnered with the Galaxy Project to build and offer the SlipStream Galaxy Appliance, a commercially supported platform. With the increasing throughput of data generation instruments, the dynamic landscape of computational tools, and the variability in analysis processes, it is challenging for scientists to work within the confines of a static infrastructure. BioTeam will discuss some of these challenges and the technical advances we have been working on to build a more flexible Galaxy appliance to support the changing compute and analysis needs of the scientific researcher.


Session 3, Tuesday, July 1, 1:15-2:30

State of the Galaxy

GalaxyTeam/anton.jpg GalaxyTeam/james.jpg

Anton Nekrutenko1 and James Taylor2

Slides, Video

An overview of where the Galaxy Project is and where it is going.


Update on Ion Torrent Sequencing – Accurate, Long Reads

MikeLelivelt/pic.png

Mike Lelivelt1

Slides, Video


Session 4, Tuesday, July 1, 4:00-5:30

The Galaxy Tool Shed: A Framework for Building Galaxy Tools

Greg Von Kuster

Greg von Kuster1 and the Galaxy Team

Slides, Video

The Tool Shed has become an integral part of the process for building and deploying Galaxy tools and other utilities. In addition to tools, the Tool Shed supports Galaxy Data Managers, custom data types and exported Galaxy workflows. This list will be extended to support additional utilities when appropriate. The Tool Shed provides the ability to define relationships between repositories, enabling complementary utilities to be installed together.

The Tool Shed assures reproducibility within Galaxy when utilities are installed from the Tool Shed using the streamlined installation process between the two applications. An underlying principle of this assurance is that all versions of utilities available in the Tool Shed will always be accessible to any Galaxy instance. This principle implies that a select development path should be followed to produce repositories that are optimal for sharing.

Here we'll examine the various components and steps that comprise this process. Development begins within a local environment that includes Galaxy and a Tool Shed, where a hierarchy of related repositories can be built. The Tool Shed allows the developer to export the related repositories into a capsule that can be imported into another Tool Shed. This mechanism streamlines the process of deploying utilities from a development environment to the test and main public Galaxy Tool Sheds where an automated install and test framework certifies the repositories for sharing. When installed together into Galaxy after certification, the related repositories provide complementary Galaxy utilities that function together.


Integrating the NCBI BLAST+ suite into Galaxy

Peter Cock

Peter Cock1, John Chilton2, Björn Grüning3, Jim Johnson4, Nicola Soranzo5

  • 1 The James Hutton Institute, Scotland, United Kingdom
    2 Department of Biochemistry and Molecular Biology, Penn State University, United States
    3 Pharmaceutical Bioinformatics, Institute of Pharmaceutical Sciences, Albert-Ludwigs-University, Freiburg, Germany
    4 Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, United States
    5 Bioinformatics Research Program, CRS4, Pula, Italy

Slides, SlideShare, Video

NCBI BLAST is one of the best known computational tools in modern biology, and a common addition to Galaxy instances. This talk covers the history of the Galaxy wrappers for the NCBI BLAST+ command line tool suite, example use cases and workflows, and in particular our development process as a potential best practice model for Galaxy tool development - both technically and by showcasing Galaxy functionality, but also in terms of community building.

Initially included within the main Galaxy distribution, the BLAST+ wrappers are now run as a separate open source project using a dedicated repository on GitHub, combined with open discussion on the public Galaxy development mailing list.

The BLAST+ wrappers have grown to take advantage of most features offered by Galaxy and the ToolShed, including ToolShed dependencies, custom datatypes (including composite types for BLAST databases), configuration files for local databases, Galaxy tool XML macros to avoid duplication, and functional testing.

Automated testing is an important part of the development model and release process used. Integration with TravisCI provides continuous integration testing where any update to the code is automatically tested on a Virtual Machine. This is reinforced by a policy of staging updates to the Galaxy Test ToolShed for an additional round of automated testing, prior to release on the main Galaxy ToolShed.

Finally, an overview of how BLAST is setup on the Galaxy Instances we maintain will cover issues like job parallelization, thread and memory considerations, updating NCBI BLAST databases, and caching BLAST databases on cluster nodes.


deepTools: a flexible platform for exploring deep-sequencing data

Björn Grüning

Fidel Ramírez1, Friederike Dündar1,2, Sarah Diehl1, Björn A. Grüning3, and Thomas Manke1

Slides, Video

We present a Galaxy based web server for processing and visualizing deeply sequenced data. The web server core functionality consists of a suite of newly developed tools, called deepTools, that enable users with little bioinformatic background to explore the results of their sequencing experiments in a standardized setting. Users can upload preprocessed files with continuous data in standard formats and generate heatmaps and summary plots in a straight-forward, yet highly customizable manner. In addition, we offer several tools for the analysis of files containing aligned reads and enable efficient and reproducible generation of normalized coverage files. As a modular and open-source platform, deepTools can easily be expanded and customized to future demands and developments. The deepTools webserver is freely available at http://deeptools.ie-freiburg.mpg.de and is accompanied by extensive documentation and tutorials aimed at conveying the principles of deepsequencing data analysis. The web server can be used without registration. deepTools is also available from the Galaxy toolshed, which allows an easy automated installation to any Galaxy instance.


Session 5, Wednesday, July 2, 9:10-10:25

The GCC2014 Hackathon

GCC2014 Hackathon Participants

Dannon Baker1, Brad Chapman2, John Chilton3, Kyle Ellrott4, and GCC2014 Hackathon Participants

Slides, Video

This year for the three days before GCC we are hosting a Galaxy Hackathon. Hackathons are events at which a group of developers with different backgrounds and skills collaborate hands-on and face-to-face to try to solve problems affecting a particular community, and in this case the Galaxy community. Gathering a diverse set of people in a single room where they can focus on code free of all the distractions that are inevitable back at the office has proven to be a great mechanism for not only getting interesting things done in a short amount of time, but also for community building. The hackathon goals include growing the Galaxy developer community and connecting existing developers who are interested in similar problems, giving them an in-person opportunity to code together and plan for future post-hackathon collaborations.

In this talk, we’ll very briefly describe our Galaxy Hackathon goals and provide a general overview of progress made at the event. Since hackathons are by definition community driven, most of the talk will showcase the efforts of and be presented by the self-organizing groups that form during the event.


More Options, Less Time: Streamlining Access to Reference Datasets

Dan Blenkenberg

Daniel Blankenberg1 and the Galaxy Team2

Slides, Video

Recent enhancements to the Galaxy framework have introduced a new class of Galaxy Utilities, known as Data Managers (doi:10.1093/bioinformatics/btu119). Data Manager tools allow the Galaxy administrator to download, create and install additional datasets for any type of built-in datasets using a web-based GUI in real time.

Despite these advances, populating a Galaxy instance with a set of built-in datasets can be quite time consuming, especially in cases where data not only needs to be downloaded, but additional computation, such as building indexes, is required. While this works quite well, it is wasteful to have each Galaxy installation build these datasets especially for common resources and genomes. It can take considerable amounts of time to populate a new Galaxy instance with needed datasets. Although the Galaxy Project provides a public rsync server with all of the built-in datasets that are used on the Main public site, utilizing this resource can be difficult and unwieldy, as there is a large amount of data and it lacks an accessible interface interface. While the individual location files are made available, they cannot be used as-is by an end user, unless the user has the exact same directory structure on their own machine that is hosting their Galaxy instance.

Here, we describe a new set of resources that aim to rectify this situation. These resources streamline the configuration of built-in data datasets for new and existing Galaxy instances and alleviate the technical barriers preventing many users from taking advantage of prebuilt reference datasets.


Building More Powerful Galaxy Workflows with Dataset Collections

John Chilton

John Chilton1 and the Galaxy Team

Slides, Video

Galaxy features the ability to extract a sample analysis histories out into reusable workflows as well as the ability to construct such workflows up from scratch or via modification to existing workflows. While these have been salient features of Galaxy for some time, the kinds of workflows that could be expressed by Galaxy have had critical limitations. Perhaps most glaring of these is that Galaxy workflows have required a fixed number of inputs. Many relatively basic biomedical analyses require running a variable number of inputs across identical processing steps (“mapping”) and then combining or collecting these results into a merged output (“reducing). This talk will present dataset collections - an extension to Galaxy that allows for the expression of these mapping, reducing workflows.

In particular, the concepts behind dataset collections will be covered including briefly discussing implementation details such as data model modifications and API methods. Demonstration of how to “map” existing Galaxy tools across dataset collections to produce new collections and how to “reduce” these collections using other tools. Likewise, modification to the workflow extraction and editing interfaces to accommodate these new operations will be demonstrated.

Dataset collections are a powerful new feature that greatly enhance the expressivity of Galaxy workflows, but a lot work remains to do be done. The talk will conclude with a potential roadmap and timeline for dataset collection related development - including building UI components for digging into collections, building new collections, visualizing across collections, and tool enhancements allowing tools to create collections.


An Appliance for Life Science Research: Isilon, Penguin and Galaxy

Patrick Combes

Patrick Combes1

Slides, Video

Isilon and Penguin Computing have paired to create a mid-size appliance for Galaxy by leveraging their respective strengths in storage and compute. This session will detail the architecture and projected use cases for the appliance.


Session 6, Wednesday, July 2, 10:55-12:15

Lab Specimen Tracking with Galaxy

Martin Čech

Martin Čech1, Pavel Švéda1, Ondřej Fabián1 and the Galaxy Team

Slides, Video

No experiment begins with sequencing. Instead it commences with a collection of samples followed by DNA isolation (generation of cDNA, immunoprecipitation etc.), preparation of sequencing libraries, sequencing itself, and, finally, data analysis. In other words, during an NGS experiment a biological specimen undergoes transformation into a dataset to be analyzed. When an experiment involves a handful of samples, tracking the specimen-to-dataset metamorphosis is straightforward. However, low cost of sequencing enables individual single-PI laboratories to perform studies involving hundreds and even thousands of samples. At this scale tracking information about individual samples becomes challenging. Yet such tracking is essential for troubleshooting and ensuring a successful study. We have developed an open-source sample tracking system based on mobile devices carried by everyone in their pockets. The mobile application is able to communicate with a variety of sequencing instruments and trigger automated data analyses through the Galaxy system (http://usegalaxy.org).


The Munich NGS-FabLab for medical sequence data

Sebastian Schaaf

Sebastian Schaaf1,2, Aarif Mohamed Nazeer Batcha2, Sandra Fischer2, Guokun Zhang2, Ulrich Mansmann1,2

Slides, Video

Using NGS data in a clinical context comes along with a whole range of challenges, constraints and requirements, affecting all levels of an IT infrastructure dealing with that type of data – and related biomedical metadata. Especially in Germany, the restrictive data security laws play a key role. In 2010, the Munich regional area successfully applied for a grant ('Leading-Edge Cluster Competition') dedicated to ‘personalized medicine’, supporting infrastructures for improving cross-connections between the medical faculties of both universities and associated institutions, their hospitals, independent research institutes (Helmholtz Centre, Max Planck Institutes) and industrial partners.

Aiming for a structured, biomedical metadata-driven organization of clinical NGS data, an interconnected, user-friendly, modular, broad-ranged and self-hosted open source analysis platform turned out to be crucial. Or in a nutshell: a Galaxy instance.

This talk is about the experiences of nearly three years of getting from blank to a conceptual Galaxy-driven NGS infrastructure, dedicated to scientist or clinicians from basic research up to experimental molecular diagnostics within a university medical center’s environment. Topics will include experiences with core IT, faculty politics, project cooperations, software establishment etc. as well as derived Dos and Don’ts. Furthermore, some small software improvements will be presented, hopefully contributing back to the community. On top, we would like to draw connections to contents presented, discussed, improved since the last two GCC’s in Chicago and Oslo - and also may have been forgotten. Over time, we had the impression to face several of them, pretty glad not to be in a minority of one.


Galaxydx - A Web-server dedicated to diagnosis data analysis

Vivien Deshaies Alban Lermine

Vivien DESHAIES1,2,3, Alban LERMINE1,2,3, Séverine LAIR1,2,3 , Nicolas SERVANT1,2,3, Elodie GIRARD1,2,3, Julien TARABEUX4,5, Philippe HUPE1,2,3, Claude HOUDAYER4,5, Emmanuel BARILLOT1,2,3

Slides, Video

Early cancer diagnostic is a challenge that can dramatically improve cancer treatment efficiency. High throughput sequencing technology is the more promising solution to reach this goal, but the analysis of their output is not straightforward and most of the time, need to launch software only available via command line interface.

Galaxy is a web platform that aim to: (1) make command line softwares accessible in an easy to use web interface, (2) construct personal workflows, (3) make analyses reproducible among time, (4) share know-how (workflow sharing) as well as data and annotations.

We built Galaxydx, an implementation of Galaxy containing a suite of softwares used for the analyses of diagnosis sequencing data (PGM torrent suite, BWA, GATK, VarScan, Annovar, … etc). Galaxydx allows Clinicians as well as Biologists to be autonomous to perform a complete set of analyses such as: (1) mapping, (2) variant calling, (3) variant filtering, (4) variant annotation, (5) rearrangements calling and (6) visualization through diagnosis dedicated Genome browser (Alamut).

We also work on data integrity and confidentiality by modifying the Galaxy writing methodology. Analyses in Galaxydx are organized by project and user, output files are owned by the user who generates them. It allows us to systematically check system rights on data before any process (Can the current user read input data? Can the current user write in this project?)


Using Galaxy and Globus to deliver Science as a Service

Ravi Madduri

Ravi K Madduri1,2, Paul Dave2, Alex Rodriguez2, Vassily Trubetskoy3, Dinanath Sulakhe2, Lea Davis3, Nancy Cox3 and Ian Foster1,2

Slides, Video

At the Computation Institute, we originally posited the notion of science as a service in 2005 as a means of publishing and accessing scientific data and applications through well-defined and internet accessible services. Our vision of science as a service worked well in a world when computing resources were scarce; when we needed to federate heterogeneous resources and make them accessible to researchers; when different tools and data were provided using different interfaces and representations; and when research problems involved datasets that could be hosted and analyzed on a single computer. In this talk we re-examine our vision of science as a service in a world in which computing resources are now commoditized; a world in which researchers are increasingly facing 'big data' challenges; a world in which Cloud providers, such as Amazon Web Services, have become viable alternatives to purchasing dedicated infrastructure; and a world in which building reliable infrastructure for solving scientific problems is only an API call away.

We will present our efforts on using Galaxy and Globus to create cloud-based services for scientific domains such as Genomics, Climate modeling, Cosmology, ECG Analysis and Material Sciences. We will present lessons learned, extensions we created to enable these communities adoption of Galaxy as an analysis engine. We will present a recent genomics usecase enabled using Galaxy based Globus Genomics on creating and running Consensus Genotyper for exome sequencing pipeline on large scale Tourette's Syndrome data set. (Joint work with Dr. Nancy Cox's group at UChicago.)


SGI UV: Harnessing the Big Brain Platform for Galaxy

James Reaney

James Reaney1

  • 1 Senior Director, Research Markets, SGI

Slides, Video

GI UV scales to truly extraordinary levels – today up to 2,560 physical cores and 64TB of cache-coherent, globally shared memory in a single system. UV is also a developer’s dream playground: standard Intel x86 architecture, standard Linux distros, support for large numbers of Nvidia GPU and Xeon® PHI®, and all those cores and memory at your disposal in a single OS. Run standard ISV applications or any open-source code just like any Linux instance, no recompiling necessary. The versatility, high performance, and extreme scale of UV makes it the ultimate "analysis supernode", but what if we used UV as an enabling platform for Galaxy workflows? How much more extensible might the tools become? What new scales might Galaxy workflows reach? What larger-scale research might be simply enabled in the first place by having a more effective computational architecture underlying the Galaxy workflow?


Session 7, Wednesday, July 2, 1:15-2:35

Building a virtual research environment with Galaxy

Olivier Inizan Mikael Loaec

Olivier Inizan1, Mikael Loaec1, Eric Rasche2, Hadi Quesneville1

Slides, Video

The democratization of virtualization techniques provide a new opportunity to improve bioinformatics analysis. Storing, sharing and reusing tools dedicated to an analysis is the goal of the galaxy toolshed project. With virtualization techniques, it is now possible to expand their strategy to all the components required to perform a bioinformatic analysis such as the operating system, the software, the datasets, the dependencies, the user data, …).

Integrating these components in a virtual machine provide a virtual research environment (VRE) that could be duplicated and shared. With the growing availability of infrastructures supporting virtualization (such as cloud computing infrastructures), VREs offer a new opportunity to improve bioinformatics analysis accessibility and reproducibility.

Accessibility and reproducibility are the building blocks of the Galaxy project and the Galaxy platform could play a significant role in such environments. However, to become accessible and shareable, creating and updating a VRE should be automated as much as possible, from the virtual machine provisioning to tools deployment and tests.

Here we describe our progress towards an automation process for the deployment of a Galaxy instance. The current work is focused on virtual machine provisionment with Cobbler and automatic configuration with Puppet. The opportunities that such an approach provides to developers and biologists will be discussed, illustrated on the future French infrastructures dedicated to cloud computing: the IFB and INRA academic Clouds.


The Australian Genomics Virtual Laboratory

Andrew Lonie

Andrew Lonie1, Enis Afgan2,3, Ron Horst4, Simon Gladman5, Clare Sloggett1, Nuwan Goonasekera1, Igor Manukin4, Yousef Kowsar4

Slides, Video

The Australian Genomics Virtual Laboratory (GVL) is a national program aiming to provide the research community with an accessible, scalable genomics analysis platform on national compute infrastructure. The GVL leverages a significant investment in cloud infrastructure by the Australian government and existing cloud management tools to enable researchers to create on-demand genomics analyses environments based on the open source Galaxy workflow platform, linked through high speed networks to very large reliable data storage, and local instances of visualization engines like the UCSC browser.

This talk will discuss the technical and practical lessons learned during the development of the Genomics Virtual Lab, including considerations in defining and implementing a one-size-fits-all pre-configured Galaxy image, the constraints a cloud environment places on practical 'real data' genomics, identification of and interaction with the user base, and deliberations on the future of the Genomics Virtual Laboratory including architecting for the entire genomics analysis life cycle on the cloud.


Galaxy on the GenomeCloud : Yet another on-demand Galaxy cloud, but only powered by Apache CloudStack

Youngki Kim

Youngki Kim1, CB Hong1, Kjoong Kim1, Daechul Choi1

Slides, Video

Bioinformatics and genome data analysis in South Korea is at its early stage but getting busier. To keep pace with this trend of research, GenomeCloud was created at the end of 2012. GenomeCloud is an integrated platform for analysing, interpreting and storing genome data, based on KT's cloud computing infrastructure which uses Apache CloudStack software. GenomeCloud consists of g-Analysis (automated genome analysis pipelines at your fingertips), g-Cluster (easy-of-use and cost-effective genome research infrastructure) and g-Storage (a simple way to store and share genome-specific data).

Because of flexible tool integration architecture and seamless workflow creation functionality, Galaxy was selected to achieve multi purpose goals such as agile pipeline development and bioinformatics education support. To provide on-demand and Apache CloudStack based Galaxy cluster, we have automated virtual machine creation, clustering and various software setup including Galaxy.

Furthermore, seamless integration with GenomeCloud helps researchers not only create and manage Galaxy through a convenient web interface but also fully utilizes genome data in g-Storage. g-Storage is powered by OpenStack Swift and specially designed genome file transfer protocol.

Galaxy on the GenomeCloud uses Grid Engine as a Cloud HPC Solutions, Ganglia as a distributed monitoring system and LVM over NFS as a large volume shared storage, all of which are setup automatically upon request. This talk will be about our experiences while integrating Galaxy with GenomeCloud and use cases of Galaxy such as scalable bioinformatics education system and request fulfillment of RNA-seq analysis.


Test-driven Evaluation of Galaxy Scalability on the Cloud

Nuwan Goonasekera

Enis Afgan1,2, Derek Benson3, and Nuwan Goonasekera1

  • 1 VLSCI, University of Melbourne, Melbourne, Australia
    2 CIR, RBI, Zagreb, Croatia
    3 Research Computing Centre, University of Queensland

Slides, Video

To verify the essential functions of a Galaxy instance are being provided correctly to the end-user, functional testing of typical Galaxy tasks is important. In addition, for groups which intend to deploy their own Galaxy instances (on the cloud or otherwise), knowing the scalability characteristics of the instance with respect to the number of users, machine size, storage solution and cloud provider, is also important. By combining both functional and performance testing into one common testing infrastructure, we assessed both of these aspects with the same underlying test code.

With respect to the first aspect of assessing whether the basic functions of Galaxy are working correctly from an end-user perspective, functional testing was performed via the browser automation tool Selenium, which can mimic the exact actions of an end-user interacting with the application. We then extended these tests to use the Selenium Grid, which converted the functional test into a performance test by running the tests in parallel, thus simulating multiple concurrent users.

This presentation will describe how these two aspects were used to determine the scalability characteristics of Galaxy on the cloud. The presentation will discuss the following:

  • Describe how the same infrastructure is reused for testing the functional and scalability characteristics of Galaxy, using CloudMan;

  • Analyse how a number of variables, such as the number of users, machine size and storage option, affects scalability;
  • Provide insights into how Galaxy scales on the cloud, and what factors to consider when deploying on your own infrastructure;
  • Provide a reusable suite of tests for functionally verifying and benchmarking private GVL/Galaxy instances

Data and results collected to obtain above conclusions will be made publicly available and can act as reference data points for others reusing the presented system on their own Galaxy instances.


Bioinformatics on AWS: New and Noteworthy Features

Angel Pizaro

Angel Pizarro1

In this talk, we will cover recent service and feature releases from Amazon Web Services, and how they apply to bioinformatics and scientific computing.


Poster Abstracts

There will be two poster sessions:

First Poster Session: Tuesday, July 1, 2:30-4:00

  • Odd numbered posters will be presented during poster session 1.

Second Poster Session: Wednesday, July 2, 2:30-4:00

  • Even numbered posters will be presented during poster session 2.

Poster dimensions are a maximum of 48" x 48" (122cm x 122cm).

Contents

  1. P1: Lifeportal - web portal to high performance computing resources at University of Oslo
  2. P2: Building a scalable Galaxy cluster for biomedical research in The Netherlands
  3. P3: Practical experiences from the Munich NGS-FabLab - Tools, compatibility and pitfalls off the standard tracks
  4. P4: e-Science in France, a Life science Western story
  5. P5: drylab.nl.enabling.translational.research
  6. P6: Mississippi: a galaxy server centered on small RNA analysis
  7. P7: Bacterial and viral NGS analysis in a public health agency using Galaxy
  8. P8: iReport: HTML Reporting in Galaxy
  9. P9: workflow4metabolomics.org : Galaxy and the metabolomics analysis Universe
  10. P10: The Munich NGS-FabLab - A glimpse on an IT infrastructure for medical sequence data
  11. P11: Oqtans: Online quantitative transcriptome analysis
  12. P12: Locally managed Galaxy instances with access to external resources in a supercomputing environment
  13. P13: Argument Parsing Libraries for Automatic Galaxy XML Generation
  14. P14: Advantages and Challenges of Using the Galaxy API within an Integrated Data Analysis and Visualization Platform
  15. P15: Resistance to Toxic Compounds in Metagenomic Fosmid Library from Mangrove Sediment in São Paulo State, Brazil
  16. P16: BlockClust: efficient clustering and classification of non-coding RNAs from short read profiles
  17. P17: A Galaxy-Based framework for online streaming data analytics in Heart Rate Variability Analysis
  18. P18: Implementing qDNAseq in Galaxy: a whole genome sequencing copy number analysis tool
  19. P19: Integrating Integrated Genome Browser with Galaxy
  20. P20: An approach for detecting structural variations from NGS paired end reads using Split Reads, Discordant Read Pairs and Local Alignment
  21. P21: Synapse: Software infrastructure for collaborative reproducible research
  22. P22: Integration of Galaxy with IRIDA, a Genomic Epidemiology Platform
  23. P23: Galaxy on the GenomeCloud : Yet another on-demand Galaxy cloud, but only powered by Apache CloudStack
  24. P24: GenomeSpace: An Environment for Frictionless Bioinformatics
  25. P25: Less talking, more doing: crowd-sourcing the integration of Galaxy with a high-performance computing cluster
  26. P26: Galaxy Training Network
  27. P27: Integrating new visualization tool in Galaxy
  28. P28: Integrating GALAXY workflows in a metadata management environment
  29. P29: Genocloud : the GenOuest private cloud for Galaxy
  30. P30: Integrating Galaxy with UCSC Cancer Genomics



P1: Lifeportal - web portal to high performance computing resources at University of Oslo

Nikolay Vazov1, Katerina Michalickova1

Poster

One of the main goals of the HPC (High Performance Computing) services at University of Oslo, Norway, is to make the complex HPC resources accessible to wide audience with a varied degree of experience. The Lifeportal (lifeportal.uio.no) is currently geared towards the biomedical research with a special emphasis on the next generation sequencing data processing while a text mining instance is being finalized.

In addition to the existing Galaxy core facilities, the Lifeportal has a set of newly developed features that are essential for the Galaxy - HPC functionality. Our poster will discuss:

  • user authentication and authorization with the integration of the National Academic IDP based on SAML technology
  • integrated user/project management module for project applications, authorization and management
  • project accounting module based on an external resource allocation manager (GOLD)
  • module for project reporting and providing feedback to the funding agency
  • big file upload based on Filesender technology allowing to upload files up to 250 GB into Galaxy
  • details of cluster deployment via SLURM DRMAA including:
    • Galaxy code modification allowing for user-selected cluster job parameters such as queue, time, memory, number of nodes and cores
    • export of Galaxy libraries for deployment of the core Galaxy tools on the cluster
    • general changes to tool wrappers needed for cluster implementation


P2: Building a scalable Galaxy cluster for biomedical research in The Netherlands

David van Enckevort1, Anthony Potappel2, Niek Bosch3, Jeroen Beliën4, Rita Azevedo5, Rob Hooft5, Sander Ruiter2, Sanne Abeln6, Irene Nooren3, Jan-Willem Boiten7

  • 1 University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
    2 Vancis, Amsterdam
    3 SURFsara, Amsterdam, The Netherlands
    4 VU university medical center, Amsterdam, The Netherlands
    5 Netherlands eScience Center, Amsterdam, The Netherlands
    6 VU university, Amsterdam, The Netherlands
    7 Center for Translational Molecular Medicine, Eindhoven, The Netherlands

Poster

Introduction

For the national translational IT project CTMM/TraIT Galaxy has been selected as one of the tools in the experimental domain. The TraIT partners (among others NBIC and SURFsara) have developed a vision how to make Galaxy available to the research community in The Netherlands. The scalable Galaxy cluster on the SURFsara HPC Cloud will be transferred to Vancis to provide a sustainable production-level Galaxy cluster. In the design of this environment Vancis has made use of the knowledge and experience of NBIC and SURFsara hosting the public NBIC instance on the SURFsara HPC Cloud.

Material & Methods

To assess the minimal requirements for the infrastructure we used metrics collected while running the NBIC Galaxy on the HPC Cloud. Next we drafted a set of use cases the infrastructure should be able to fulfil, such as the ability to run Omics-pipelines and the ability to scale to handle peak demand. We identified I/O performance as a major bottleneck, since many Galaxy tools are I/O intensive, while Galaxy has a shared data design. Memory was also recognized as a critical factor, since typical datasets are in the order of the tens of gigabytes. We also built upon the experiences from SURFsara in operating the HPC Cloud and other HPC. To accommodate for a full set of development, testing, acceptance & production environments, as well as private installations, the infrastructure should support multiple Galaxy clusters. The chosen architecture will use a Linux High Availability environment with OpenStack, which will run on two large-size blades. Storage is split into multiple tiers with different characteristics to support both high I/O workloads and a reliable large storage. The chosen setup is horizontally scalable in a cost-efficient manner.

Results

From May to September 2014 we will pilot the new architecture within the TraIT project. For this pilot we have selected a few TraIT NGS tools and pipelines to stress test the system under different workload scenarios. Furthermore we have established a process to ensure the quality of the tools required for a stable production environment.


P3: Practical experiences from the Munich NGS-FabLab - Tools, compatibility and pitfalls off the standard tracks

Aarif Mohamed Nazeer Batcha1, Sebastian Schaaf1,2, Guokun Zhang1, Sandra Fischer1, Ashok Varadharajan1, Ulrich Mansmann1,2

Poster

Over three years, the Munich NGS-FabLab was build up first as a concept and later as a running IT system, based on an assessment of requirements, constraints and given structural conditions. Since some months it is in active use, although still under intense development.

As every developer knows, especially complex and broad open source software like Galaxy does not come error-free. Expected issues were due to non-standard elements like the operating system (SLES 11), hardware (x86 server not supported by the clinics IT, FPGA hybrid-computer, network load, …), computational requests (projects with special needs or proprietary software) and not to forget financing and politics. Apart from that, Galaxy itself and the associated software packages and/or the respecting wrappers surprisingly often turned out to be in need of corrections, although we assumed to use standard input data and perform simple jobs. Finally, those tools or computations which were needed, but are not yet supported by the Galaxy framework, most work invested deals with trouble-shooting, bug-hunting and code analysis.

Experiences, fixes, improvements and new integrations are subject to this poster, which may appear more like a collage of loosely connected sub-topics. While we did not return those code snippets to the community yet, we also hope to get into the process of submitting contents for public use and discuss them, in order to improve the framework as a whole.


P4: e-Science in France, a Life science Western story

Yvan LE BRAS1, Aurélien ROULT1, Cyril MONJEAUD1, Mathieu BAHIN2, Olivier QUENEZ3,4, Claudia HERIVEAU1, Olivier SALLOU1, Anthony BRETAUDEAU1,5 and Olivier COLLIN1

  • 1 GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes1, Campus de Beaulieu, 35042, Rennes Cedex, France
    2 IGDR, UMR 6290-CNRS Université de Rennes1, 2 avenue Professeur Léon Bernard, Campus de Villejean, 35065, Rennes Cedex, France
    3 Inserm U1079, Institut de Recherche et d'Innovation Biomédicale (IRIB), Université de Rouen, France
    4 Centre National de Référence pour les Malades Alzheimer Jeune, CHU de Rouen, Lille et Paris-Salpêtrière, Rouen, France
    5 INRA IGEPP, UMR1349 Agrocampus-Ouest INRA Université Rennes1, domaine de la motte, 35653, Le Rheu, Cedex 35327, France

Poster

Research processes are evolving at a rapid pace. This evolution, mainly due to technological advances, offers powerful equipment and generalizes the digital aspect of the research data. If facing the actual data deluge context represents a challenge, it also offers an opportunity to change and enhance our manner to tackle research tasks and disseminate science. In Life Sciences, as in other domains, we are noting a sharp increase in storage and computing needs. Regularly adding hardware resources to the bioinformatics core facilities is no longer sustainable. Scientific data management and analysis have to be enhanced in order to offer services and developments matching the new uses.

Since 2 years, Galaxy platform is used in combination with ISATools and HUBzero to build a Life Sciences Virtual Research Environment. Each tool offers complementary functionalities: ISAtools software suite for metadata management, HUBzero for scientific collaboration and Galaxy for computation. The resulting combination allows scientists to manage their project from collaboration to data management and analysis. This Virtual Research Environment (VRE) is tested in partnership with the scientific communities in Western France. The evaluation will give us insights on the usage and acceptance of new tools in a scientific field characterized by profound modification of its traditional processes.

Although the deployment of this kind of environment is challenging, it represents an opportunity to pave the way towards better research processes through enhanced collaboration, data management, analysis practices and resources optimization.


P5: drylab.nl.enabling.translational.research

Christian Rausch1, Daoud Sie1, Jeroen Galle2, Jeroen Crappe2, Gerben Menschaert2, Bauke Ylstra1, Wim Van Criekinge1,2

  • 1 Cancer Center Amsterdam, VU University Medical Center, Amsterdam, The Netherlands
    2 Biobix, Lab of Bioinformatics and Computational Genomics, Ghent University, Ghent, Belgium

Poster

The Cancer Center Amsterdam (CCA) of VU University Medical Center is a research center that performs internationally recognized research in the area of oncogenetics, immunopathogenesis, disease profiling, innovative therapy and quality of life.

We are currently establishing a Drylab that empowers both researchers and clinicians with state-of-the-art bioinformatics solutions. The Drylab is expected to contribute scientifically, which we want to make possible by building a team with diverse interdisciplinary backgrounds: Biology, statistics, experimental design, bioinformatics etc.

Establishing an organizational context with continued funding is an ongoing challenging task. First, we have built a scalable infrastructure. We established Drylab.nl as a custom Wordpress instance, expanded with a helpdesk and ticketing system and linked to a Galaxy based workflow system using a tool shed to (re)use and share internal and external workflows. In external collaborations (e.g. with Biobix in Ghent, Belgium) we are building/exchanging pipelines/workflows for RNAseq, proteogenomics (riboSeq) and methylome analysis (methylcapSeq). We are also implementing a workflow validation procedure using test data. In order to close the loop to the end user we are planning to visualize genomic data on different platforms.

Our initial measure for success will be the actual consolidation and integration of bioinformatics efforts in addition to (re)use of these workflows by non-experts. We do recognize that in order to mature we have to avoid getting caught in a "firefighting mode". Given the shared vision amongst all stakeholders and the embedded organizational context we hope to mature and become an innovation engine within translational medicine.


P6: Mississippi: a galaxy server centered on small RNA analysis

Marius van den Beek1, Christophe Antoniewski1

Poster

Non-coding small RNAs (miRNA, siRNA, piRNA, …) are involved in the regulation of genes and transposable elements as well as in the defense against viral infections. Their discovery and their functional characterization rely heavily on high throughput RNA sequencing. The ~20:30nt length of small RNAs raises specific challenges for meaningful read mapping and analysis, so that standard RNAseq analysis methods have to be adapted. We provide an integrated set of galaxy tools that should streamline the most frequent small RNA analysis needs. This includes a modified bowtie-wrapper and workflows that allow users to quickly and reproducibly interrogate various aspects of small RNA biology. We provide tools for the discovery and differential expression analysis of miRNAs and a way for genome-wide visualization of miRNA precursors that complements Trackster. Furthermore we provide tools to detect the “ping-pong” biogenesis signature of piRNAs, to detect piRNA-producing loci in the genome and to study and visualize the impact of piRNAs and siRNAs on transposable elements.


P7: Bacterial and viral NGS analysis in a public health agency using Galaxy

Ulf Schaefer1, Anthony Underwood1, and Jonathan Green1

Poster

Public Health England is home to the United Kingdom's national microbiology reference laboratories and deals with the surveillance and control of infectious disease. Assays for the investigation of selected pathogenic bacteria and viruses are being migrated from traditional wet lab based methodologies such as Multiple Loci VNTR Analysis to methods based on Next Generation Sequencing (NGS) data. Apart from the set up of an NGS service and automated analysis of a small number of priority organisms, one of the key challenges in the management of this paradigm shift in public health is to enable microbiologists and epidemiologists with little to no bioinformatics knowledge and training to interact with and derive scientific value from NGS data. We maintain a local installation of Galaxy in an attempt to address this challenge. This local installation houses all specialised software required for public health microbiology and phylogenetics. Furthermore it provides bespoke workflows for standard analyses regularly employed in outbreak investigations, such the creation a SNP tree from multiple viral or bacterial NGS samples. In addition to an overview of our hardware and software setup, this presentation will highlight 1) An example of a public health specific workflow that can be used for routine reference microbiology services and 2) some of the soft issues around employing Galaxy in this context, such as user acceptance, training, and support.


P8: iReport: HTML Reporting in Galaxy

Saskia Hiltemann1, Youri Hoogstrate1, Hailiang Mei2, Guido Jenster1, Andrew Stubbs1

  • 1 ErasmusMC, Rotterdam, The Netherlands
    2 LUMC, Leiden, The Netherlands

Poster

Galaxy offers a number of great visualisation tools (Trackster, Circster), but currently lacks the ability to easily summarise the various outputs of a workflow into a single view. iReport is a Galaxy tool for the easy creation of HTML reports from Galaxy outputs. Rather than having a static HTML output, iFUSE2 uses javascript and jQuery to allow for interactivity in the form of searching and sorting of tables, automatic zooming of image data, tabbed view for organisation of outputs, etc. Users define the number and names of tabs for their report, and can add different types of content-items to these tabs (e.g. text, tabular data, image data, PDF files, links to datasets, and more).

We have previously implemented Galaxy-based data processing pipelines for next-generation sequencing (NGS) and for array based allelic copy number determination named CGtag (Hiltemann et al. 2014) and developed a web based fusion gene visualizer, iFUSE (Hiltemann 2013). We used the iReport tool to make iFUSE2, the next-step extension to support fusion gene determination within Galaxy, which runs as the last step of our workflow and combines the outputs of various Galaxy tools into a single view.

iReport is available from the DTL toolshed (toolshed.dtls.nl) and the main Galaxy toolshed.


P9: workflow4metabolomics.org : Galaxy and the metabolomics analysis Universe

Misharl MONSOOR1, Gildas LE CORGUILLE1, Marion LANDI2, Mélanie PETERA2, Pierre PERICARD1, Christophe DUPERIER2, Marie TREMBLAY-FRANCO3, Jean-François MARTIN3, Sophie GOULITQUER1, Etienne THEVENOT4, Franck GIACOMONI2, Christophe CARON1

  • 1 ABiMS, FR2424 CNRS-UPMC, Station Biologique, Place Georges Teissier, 29680, Roscoff, France
    2 PFEM, UMR1019 INRA, Centre Clermont-Ferrand-Theix, 63122, Saint Genes Champanelle, France
    3 PF MetaToul-AXIOM, UMR 1331 Toxalim INRA, 180 chemin de Tournefeuille, F-31027, Toulouse, France
    4 DRT/LIST/DM2I/LADIS, Saclay Center CEA, F-91191, Gif-sur-Yvette, France

Poster

Facing the emergence of new technologies in the field of metabolomics, treatment solutions adopted so far (XCMS, R scripts, etc.) clearly show their limits. Bottlenecks affect unified access to core applications as well as computing infrastructure and storage. In the context of collaboration between metabolomics (MetaboHUB French infrastructure) and bioinformatics platforms (IFB: Institut Français de Bioinformatique), we have developed a full pipeline using Galaxy framework for data analysis including preprocessing, normalization, quality control, statistical analysis and annotation steps. This modular and extensible workflow is composed with existing components (XCMS and CAMERA functions, etc.) but also a whole suite of complementary statistical tools. This implementation is accessible through a web interface, which guarantees the parameters completeness. The advanced features of Galaxy have made possible the integration of components from different sources and of different types. Finally, an extensible environment is offered to metabolomics communities (platforms, end users, etc.), and enables preconfigured workflows sharing for new users, but also experts in the field.


P10: The Munich NGS-FabLab - A glimpse on an IT infrastructure for medical sequence data

Sebastian Schaaf1,2, Aarif Mohamed Nazeer Batcha2, Guokun Zhang2, Sandra Fischer2, Ashok Varadharajan2, Ulrich Mansmann1,2

Poster

While NGS data becomes increasingly important in medical basic research and molecular diagnostics, dealing with it is a challenge in multiple aspects. Apart from ‘classical’ issues like high demands to hardware, the interconnectivity to resources of biomedical meta information for enriching sequence data is a central task. Users from various fields of study have to be enabled to work with a variety of bioinformatic tools off the command line (which currently do not offer any gold standard analyses), concentrating on contents instead of technical elements. On top, patient-related data is subject to strong restrictions by the German data security law, which also affects IT infrastructures on all levels. For medical genome informatics in Munich, the NGS-FabLab (including its admin round-table "NGS-ART") is the central hub for clinicians, researchers and developers, serving as data center, knowledge core, teaching unit and technical template for further instances. During development, the standard Galaxy distribution setup has been equipped with some features that we would like to present with this poster.

Apart from the operating system layer (VMWare, SLES 11), key features are fully automated scripts for proper development cycles and quick setups, distributed computing resources (SGE queue, Convey FPGA hybrid-core computer), highly integrated network structures and access controls. Furthermore, scientific broadness has been enhanced (e.g. via qiime toolbox, pathway analyses, additional and improved tools). Last but not least, archiving and sophisticated analysis are subject to improvements by using Bii as searchable and Galaxy-interconnected database, relying on biomedical ontologies.


P11: Oqtans: Online quantitative transcriptome analysis

Vipin T. Sreedharan1, Yi Zhong and Gunnar Rätsch

  • 1 Memorial Sloan Kettering Cancer Center, New York City, NY-10065 USA

Poster

Powerful algorithmic techniques lead to software applications that can answer important biomedical questions that analzye massive and complex genomic data sets. Starting from 2009, oqtans has served the biological research community with state-of-the-art machine learning tools for sequence analysis and high-throughput experimental technologies like RNA sequencing.

We have been leveraging the oqtans codebase to withstand different RNA-seq downstream analysis directions. In particular, it has been utilized recently for translational research to understand the effect of anticancer therapeutics. To measure the translational efficiency change for protein coding genes from multiple samples (treated vs nontreated), we used the sequencing based transcriptome scale ribosome footprinting and RNA-seq data. Our approach allowed us to detect significant changes of the ribosome binding profile of mRNA transcripts between two conditions using a non-parametric testing strategy.

Moving the Galaxy framework from academic to clinical research introduces a myriad of informatics challenges concerning the security of the data sets. In addition to developing new methods for oqtans components, it is equally important to handle the informatics complexities that come with scaling oqtans for clinical use. We have deployed our instance under ModSecurity and encrypted user authentication and subsequent session transmissions using Secure Sockets Layer (SSL). We have applied patches to the core codebase of the Galaxy framework to responsively address vulnerable redirection via URL injection, Reflected and stored Cross-site scripting (XSS) and properly sanitize and encode all potential user input and output.

Availability


P12: Locally managed Galaxy instances with access to external resources in a supercomputing environment

Nuria Lozano1,2, Oscar Lozano2, Beatriz Jorrin1, Juan Imperial3, Vicente Martin2

Poster

For a research lab, accessing shared resources like those available in supercomputer centers is a welcome addition to Galaxy capabilities. However, privacy or flexibility requirements might impose the need for a locally managed Galaxy installation. In these cases a way to communicate a local instance of Galaxy with the supercomputer would be a solution.

The Center for Biotechnology and Genomics of Plants (CBGP) and the Madrid Supercomputing and Visualization Center (CeSViMa) are located at Technical University of Madrid (UPM) Montegancedo Campus. CeSViMa manages the large heterogeneous Magerit cluster, with about 4,000 Power7 and 1,000 Intel cores, accessed in batch mode. The resource manager used is SLURM and scheduler is MOAB. Standard job runs in Magerit involve logging into one of the interactive nodes, preparing a job command file and then submitting them to one of the batch queues. The challenge was to be able to seamlessly use this system through a Galaxy front-end. The solution adopted was to set up a Virtual Private Server that runs Galaxy. The Galaxy instance has been installed in a filesystem shared between VPS and Magerit, which is under the control of Magerit GPFS filesystem.

Galaxy jobs are sent to Magerit through Command Line Interface. A Job Plug-In has been coded that creates the needed Jobfiles transparently submitted to the queuing system.

Using this approach, research group members are fully responsible for deploying and maintaining their own Galaxy Local Instance, while heavy work is offloaded to external computing resources.


P13: Argument Parsing Libraries for Automatic Galaxy XML Generation

Eric Rasche1 and Dr. Ryland F. Young1

Poster

Addition of new software to Galaxy is currently a non-trivial task. Galaxy tools consist of many interdependent parts; packaged executables or scripts, tool data, and tool configuration in the form of XML files. This presents a problem in the form of a large codebase to maintain, especially for groups that regularly produce tools to add to Galaxy.

With the goals of code deduplication, simplification of deployment workflow, and improved accessibility of the Galaxy platform for new developers, we have developed Python and Perl libraries that function to replace traditional methods of obtaining command line arguments like GetOpt and argparse. Our libraries are capable of automatically generating valid Galaxy XML tool description files that represent the full set of a tool's command line options. This removes the need to maintain the Python/Perl script and the XML file separately, as the XML files can be regenerated at any time from the Python/Perl script. We believe this will lead to significant reductions in time spent on maintenance of codebases and decreases turn around times for shipping new releases. These libraries will benefit anyone adding new custom tools to Galaxy by providing a convenient method to specify command line parameters, an easy way to access that data in their tools, and automatic Galaxy integration.


P14: Advantages and Challenges of Using the Galaxy API within an Integrated Data Analysis and Visualization Platform

Ilya Sytchev1, Nils Gehlenborg2, Shannan Ho Sui1, Richard Park2,3, Psalm Haseley2, Winston Hide1, Peter Park2

  • 1 Center for Stem Cell Bioinformatics, Harvard Stem Cell Institute
    2 Center for Biomedical Informatics of Harvard Medical School
    3 Boston University Bioinformatics Program

Poster

The Refinery Platform (http://refinery-platform.org) is an integrated web-based data visualization and analysis system powered by an ISA-Tab-compatible data repository. Analyses are implemented as Galaxy workflows. As a result, Refinery makes extensive use of the Galaxy API to automate analyses, including such features as uploading datasets into Galaxy libraries, importing "workflow templates", exporting workflows back into Galaxy after initialization with user-selected inputs, running workflows, and downloading workflow results from Galaxy histories back into Refinery. Some of these features were implemented through custom extensions to the Galaxy API. We directly benefit by using key Galaxy features such as cluster deployment, progress monitoring, a large selection of tools, and the workflow editor.

The recent development of the BioBlend library (http://bioblend.readthedocs.org) motivated us to replace our existing custom Galaxy API client code with BioBlend library components. BioBlend encapsulates the underlying REST API of Galaxy in a way that is more suitable for programming and makes it easier to automate end-to-end large-data analyses. It has a more robust implementation and is maintained by the community to keep up-to-date with the changes in the Galaxy API. Extensions to the BioBlend library and the Galaxy API to enable the use of Galaxy in fully automated fashion will be contributed back to this community effort. We hope to use this opportunity to gain feedback and suggestions for improvements from the Galaxy developer community.


P15: Resistance to Toxic Compounds in Metagenomic Fosmid Library from Mangrove Sediment in São Paulo State, Brazil

Lucélia Cabral1, Sanderson Tarciso Pereira de Sousa1, Gileno Vieira Lacerda Júnior1, Júlia Ronzella Ottoni1, Daniela Ferreira Domingos1, Valéria Maia de Oliveira1.

  • 1 Divisão de Recursos Microbianos, Research Center for Chemistry, Biology and Agriculture (CPQBA), Campinas University - UNICAMP. Mailbox: 6171. CEP: 13081-970. Campinas. São Paulo. Brazil

Poster

The mangrove is a typically tropical ecosystem, located between land and sea, and very rich in biodiversity, including aquatic animals, birds, reptiles, mammals and microorganisms. Despite of this, mangroves have been highly exposed to anthropic activities, including oil spills and industrial waste disposals that carry heavy metals. Microorganisms found in the environment can adapt to the presence of pollutants, thus developing survival mechanisms. However, traditional cultivation methods are not efficient for cultivation of most microorganisms present in nature. In this context, the aim of this study was to assess the presence of heavy metal resistance in a fosmid library constructed using metagenomic DNA from sediment samples collected from a mangrove area located in Bertioga, State of São Paulo, Brazil. The fosmid library comprised 13,000 clones and the sampling site was affected by oil spill. Next generation sequencing was performed using the 454 sequencing plataform. Sequences associated with toxic compounds resistance were analyzed using MG-RAST V3.3.8. The annotations used were: Functional abundance, Hierarchical classification, level 1 (Virulence, Disease and Defense), level 2 (Resistance to antibiotics and toxic compounds), Level 3 (Resistance). The most abundant sequences involved in metal resistance in the dataset were cobalt-zinc-cadmium resistance detected by the presence of Cobalt-zinc-cadmium resistance protein and Cobalt-zinc-cadmium resistance protein CzcA (489 and 346 hits, respectively). Sequences related with copper and silver resistance were detected by the presence of cation efflux system protein CusA (330 hits). The functional screening of fosmid library will be performed and the positive clones will be selected for further studies on metal tolerance and degradation.


P16: BlockClust: efficient clustering and classification of non-coding RNAs from short read profiles

Pavankumar Videm1, Dominic Rose1,5, Fabrizio Costa1, Rolf Backofen1-4

Presented by Björn Grüning1

  • 1 Bioinformatics Group, Department of Computer Science, University of Freiburg, Germany
    2 Centre for Biological Signalling Studies (BIOSS), University of Freiburg, Germany
    3 Centre for Biological Systems Analysis (ZBSA), University of Freiburg, Germany
    4 Centre for Non-coding RNA in Technology and Health, Bagsvaerd, Denmark
    5 Munich Leukemia Laboratory (MLL), Munich, Germany

Poster

Non-coding RNAs play a vital role in many cellular processes such as RNA splicing, translation, gene regulation. However the vast majority of ncRNAs still have no functional annotation. One prominent approach for putative function assignment is clustering of transcripts according to sequence and secondary structure. However sequence information is changed by post-transcriptional modifications, and secondary structure is only a proxy for the true three dimensional conformation of the RNA polymer. A different type of information that does not suffer from these issues and that can be used for the detection of RNA classes, is the pattern of processing and its traces in small RNA-seq reads data.

Here we introduce BlockClust, an efficient approach to detect transcripts with similar processing patterns. We propose a novel way to encode expression profiles in compact discrete structures, which can then be processed using fast graph kernel techniques. We perform both unsupervised clustering and develop family specific discriminative models; finally we show how the proposed approach is both scalable, accurate and robust across different organisms, tissues and cell lines.

BlockClust was tested and works with small RNA-seq data of eukaryotic organisms. It is the first tool of its kind, which is easily installable and usable on galaxy framework. To run BlockClust all you need is an alignment file of short reads in Sequence Alignment/Map (SAM/BAM) format. A complete workflow of BlockClust and its tool dependencies are now available at Galaxy ToolShed.


P17: A Galaxy-Based framework for online streaming data analytics in Heart Rate Variability Analysis

Calogero Zarbo

Calogero Zarbo1, Andrea Bizzego1,2,3, Marco Mina1, Gianluca Esposito2,4, Cesare Furlanello1

  • 1 FBK - Fondazione Bruno Kessler, Trento, Italy
    2 University of Trento, Italy
    3 SKIL Telecom Italia, Trento, Italy
    4 RIKEN BSI, Wako-Shi, Japan

Poster

The emerging applications in physiological data processing, encouraged by the availability of wearable sensors for continuous self-monitoring and quantified self, require new platforms for time series analysis supporting real-time processing and fast prototyping capabilities. We recently proposed Physiolyze, a Galaxy-based web framework to support complex workflows for Heart Rate Variability (HRV) analysis. Here we extend Physiolyze by introducing scalable online processing capabilities.

The enhanced version still relies on Galaxy as core platform to design and manage the pipelines. In order to incrementally analyze the streams, a set of Python routines based on the Bioblend library works as middleware to trigger the pipelines as new data become available. A web interface based on the Django Python framework allows the user to control the execution of the pipelines, running them on new data streams.

We tested our system on the task of predicting infant behavioral state from HRV patterns. We simulated a real-time scenario of 100 asynchronous data streams from data for 24 infants previously collected with a Light WP Holter ECG recorder (GE Healthcare). The system incrementally extracts 37 HRV indicators from each data stream and predicts the infant state (e.g. wake, sleep, cry) with a Random Forest regression model. The pipeline is modular and fully managed as a Galaxy workflow.

Our system can easily be adapted to other online streaming analytics applications, such as for the parallelized analysis of multiple data streams acquired from physiological sensors and wearable devices.


P18: Implementing qDNAseq in Galaxy: a whole genome sequencing copy number analysis tool

Stef van Lieshout1, Ilari Scheinin1, Daoud Sie1, Remond J.A. Fijneman1, Bauke Ylstra1

  • 1 Department of Pathology, VU University Medical Center, Amsterdam, The Netherlands

Poster

DNA copy number aberrations are a hallmark of cancer and can be quantified by shallow whole-genome sequencing (WGS). A robust method has been developed1 that detects copy number aberrations by binning and counting sequence reads in non-overlapping windows (usually of 15kb). Then a combined LOESS correction for mappability and GC content is applied followed by excluding genomic regions from both ENCODE project blacklists and a novel blacklist based on sequence depth of 38 individuals from the 1000 Genomes project.

The procedure is available as a Bioconductor package, QDNAseq2. The accompanying Galaxy tool uses the popular BAM format as input and reports results in a clear and concise HTML based view within Galaxy itself. Various output formats can be downloaded, including an R data structure file for downstream analysis and a Zipped archive with all the output together.

Due to precalculated bin annotations, current limitations include the support for one genome build (GRCh37/hg19) and one sequencing type (50bp single read). Additional dedicated tools will handle these challenges and future plans include the addition of different strategies for segmenting and calling the copy number data.

Funding was supported by the Center for Translational Molecular Medicine, Translational research IT project (CTMM TraIT).


P19: Integrating Integrated Genome Browser with Galaxy

Ann Loraine1, David Norris1, Kyle Suttlemyre1, Tarun Kanaparthi1

  • 1 University of North Carolina - Charlotte

Poster;

Integrated Genome Browser is a fast, flexible and free Java-based desktop software tool that enables interactive exploration of genomic data sets. To accommodate large data sets, IGB featured a simple ReST-style interface that triggers incremental loading of data from local files or URLs. We used this ReST-style interface and the Galaxy viewers API to enable IGB visualization for Galaxy users. When Galaxy users create compatible data files, they now see a link labeled “View in IGB” upon clicking data file links in their Galaxy History. Clicking this link triggers delivery of data to IGB for display. This is a simple interaction from the user’s perspective, but from an engineering point of view, it highlights a key extension point for Galaxy that enables integration with IGB or other visualization tools. By enabling access to data sets in a user-friendly, web-based interface, Galaxy offers many possibilities to enhance user interactions for data analysis and data sharing.


P20: An approach for detecting structural variations from NGS paired end reads using Split Reads, Discordant Read Pairs and Local Alignment

Michael Ta1, Philip D. Cotter1, Mathew W. Moore1

  • 1 Bioinformatics Department, ResearchDx, Irvine CA, USA

Poster

A major challenge in Next Generation Sequencing is the development of efficient algorithms to detect structural variants present in the genome. Several different approaches for the detection of structural variants have been identified. Breakdancer searches for clusters of anomalous read pairs for sites to investigate. Similarly, another analysis tool, SoftSearch, uses the soft clipped read data from the aligner to determine sites of interest and heuristically report potential structural variants around them. Our algorithm, HardSearch, expands on the approach of SoftSearch to further identify the exact break points that support chromosomal structural variations. Paired end reads from DNA-seq with an unmapped mate are collected around each potential fusion site; the unmapped mates are realigned to the reference genome using a local aligner. The segment of each read that aligns with the highest alignment score without gaps is subtracted from the original and the remainder is realigned allowing for the identification of the breakpoint and breakpoint partners.


P21: Synapse: Software infrastructure for collaborative reproducible research

J Christopher Bare1, Synapse Platform Team1, Michael R Kellen1, Stephen H Friend1

Poster

Synapse (http://www.synapse.org) is a free and open source informatics platform for data-driven collaborative research. Built from the ground up for a rich data sharing experience, Synapse provides tools for versioning, annotating and citing data combined with provenance tracking and fine grained access control. Synapse operates under a complete governance process developed, approved and monitored by an independent ethics advisory board and the Western Institutional Review Board.

Synapse is designed to support Sage Bionetworks' mission to promote a scientific culture founded on broad and open collaboration. Sage Bionetworks develops and operates Synapse as a public resource for the scientific community. For example, the Cancer Genome Atlas pan-cancer group published a total of 18 papers in Nature Publishing Group journals (http://www.nature.com/tcga/), using Synapse as a single point for sharing data, results and methods among 250 collaborators spread across 30 institutions. In partnership with DREAM, Synapse hosts predictive modeling challenges on a diverse array of topics including disease prognosis, drug response, toxicology and genetic variant analysis.

Galaxy and Synapse share many goals including transparency and reproducibility. Both enable sharing and reuse of research code and are cloud-native applications with similar models of computation including provenance, workflows, data sets and pages.

Bridging these two complementary services would benefit users of both. Synapse could act as a data source for Galaxy workflows and/or as a shared workspace for results and intermediate products. Other options to explore include exchanging workflows, provenance, and narrative pages. Integration between Synapse and Galaxy could enrich the ways in which data and analysis code can be presented, shared and reused.


P22: Integration of Galaxy with IRIDA, a Genomic Epidemiology Platform

Aaron Petkau1, Franklin Bristow1, Thomas Matthews1, Josh Adam1, Damion Dooley2, Emma Griffiths4, Geoff Winsor4, Matthew Laird4, Melanie Courtot2,4, William Hsiao2,3, Gary Van Domselaar1, Fiona Brinkman4

  • 1 National Microbiology Laboratory, Public Health Agency of Canada, Canada
    2 BC Public Health Microbiology and Reference Laboratory, Canada
    3 University of British Columbia, Canada
    4 Simon Fraser University, Canada

Poster

The continuing decrease in the cost of genomic sequencing and the development of new data analysis methods has led to the increasing usage of whole genome sequencing as an epidemiological tool. Whole genome sequencing can provide a high-resolution snapshot of the relationship among pathogens and lead to a greater ability to identify and track infectious disease outbreaks. Initiatives, such as the Global Microbial Identifier, have already started the discussion on developing a system and standards for genomic epidemiology. In our project, IRIDA (Integrated Rapid Infectious Disease Analysis), we propose a platform for genomic epidemiology which provides a secure storage of whole genome sequence data, epidemiological metadata, data analysis pipelines, visualization of results, a RESTful API, and a federated data sharing model. Galaxy has already proven to be a useful application for integration of common bioinformatics tools and data, execution of data analysis pipelines, collection of results, and data sharing. In addition, Galaxy provides a RESTful API for programmatic access to running instances of Galaxy. We intend to leverage Galaxy as much as possible by interacting with locally installed Galaxy instances via the API to execute pre-defined data analysis pipelines, store data results and Galaxy histories, and manage installed bioinformatics tools. Direct export of whole genome sequencing data to instances of Galaxy will be provided for more complicated analysis. IRIDA will be released as free and open-source software and make use of common data standards to facilitate sharing with other genomic epidemiology platforms. More information will be made available at http://irida.ca.


P23: Galaxy on the GenomeCloud : Yet another on-demand Galaxy cloud, but only powered by Apache CloudStack

Youngki Kim1, CB Hong1, Kjoong Kim1, Daechul Choi1

Poster

Bioinformatics and genome data analysis in South Korea is at its early stage but getting busier. To keep pace with this trend of research, GenomeCloud was created at the end of 2012. GenomeCloud is an integrated platform for analysing, interpreting and storing genome data, based on KT's cloud computing infrastructure which uses Apache CloudStack software. GenomeCloud consists of g-Analysis (automated genome analysis pipelines at your fingertips), g-Cluster (easy-of-use and cost-effective genome research infrastructure) and g-Storage (a simple way to store and share genome-specific data).

Because of flexible tool integration architecture and seamless workflow creation functionality, Galaxy was selected to achieve multi purpose goals such as agile pipeline development and bioinformatics education support. To provide on-demand and Apache CloudStack based Galaxy cluster, we have automated virtual machine creation, clustering and various software setup including Galaxy.

Furthermore, seamless integration with GenomeCloud helps researchers not only create and manage Galaxy through a convenient web interface but also fully utilizes genome data in g-Storage. g-Storage is powered by OpenStack Swift and specially designed genome file transfer protocol.

Galaxy on the GenomeCloud uses Grid Engine as a Cloud HPC Solutions, Ganglia as a distributed monitoring system and LVM over NFS as a large volume shared storage, all of which are setup automatically upon request. This talk will be about our experiences while integrating Galaxy with GenomeCloud and use cases of Galaxy such as scalable bioinformatics education system and request fulfillment of RNA-seq analysis.


P24: GenomeSpace: An Environment for Frictionless Bioinformatics

Michael Reich1, John Liefeld1, Marco Ocana1, Donkeung Jang1, James Robinson1, Peter Carr1, Barbara Hill1, Thorin Tabor1, Helga Thorvaldsdottir1, Aviv Regev1, Jill P. Mesirov1

  • 1 Broad Institute, Cambridge, MA

Poster

Over the past several years, initiatives such as The Cancer Genome Atlas and 1000 Genomes Project have produced an explosion of genomic data. These efforts offer a new era of potential for the understanding of basic mechanisms of disease and identification of novel treatments. Comprehensive analysis of these datasets requires coordinated use of Web-based applications, data repositories, and desktop analysis tools. However, the effort required to transfer data between tools, convert between formats, and manage results often prevents researchers from utilizing the wealth of methods available. Many "bench to bedside" discoveries are possible with combinations of existing tools, but the necessary transitions between them puts them out of the reach of most researchers.

GenomeSpace is an environment that brings together diverse computational tools, enabling non-programmer scientists to easily combine their capabilities. It provides a space to create, manipulate and share a growing collection of genomic analysis tools. GenomeSpace features support for cloud-based data storage and analysis, automatic conversion of data formats, and ease of connecting new tools to the environment via a RESTful API.

The Galaxy main server is one of the first GenomeSpace-enabled tools, as well as the Galaxy-based Cistrome epigenetic analysis platform. These and the other GenomeSpace-enabled tools, including Cytoscape, GenePattern, Genomica, IGV, ArrayExpress, Genomica, and others, form a comprehensive environment for analysis of genomic data, with new resources being released regularly. We show how researchers can use GenomeSpace to combine the capabilities of these tools and how developers can add their tools to the GenomeSpace environment.


P25: Less talking, more doing: crowd-sourcing the integration of Galaxy with a high-performance computing cluster

Dirk Colbry1, Michael R. Crusoe2, Andy Keen1, Greg Mason1, Jason Muffett1, Matthew Scholz1, Tracy K. Teal2

  • 1 Michigan State University, Institute for Cyber-Enabled Research
    2 Michigan State University, Department of Microbiology and Molecular Genetics

Poster

On March 5th, 2014 a team of system administrators and bioinformaticians conducted a hack-a-thon to integrate Galaxy on top of the high-performance computing cluster at Michigan State University complete with single-sign-on and the ability to run jobs as the submitting user. They elicited and received strong community support during the hack-a-thon and engaged Galaxy developers and users through IRC and Twitter. In eight hours this hack-a-thon was able to quickly navigate the various integration hurdles via real time assistance from the Galaxy community. The entire deployment was done as openly as possible with coordination of the various efforts via a separate public chat channel. While there were a couple person-days of prep and follow up, the scheduling of a single day to do the bulk of the installation proved to be critical in getting the job done and was far more effective than the many hours talking about the idea of deploying Galaxy prior. The format allowed for rapid progress as communication time was reduced and developers could modify or add components, receive prompt feedback and continue to build on the growing infrastructure. We advocate a similar recipe of using virtual machines, the Puppet configuration management system, and agile development enabled by the built-in implementations of various components of Galaxy to enable forward progress.


P26: Galaxy Training Network

Dave Clements1,2, Vicky Schneider3, Nikhil Joshi4, Joseph Fass4, Monica Britton4, Andrew Lonie5,6, Simon Gladman5,7, Mark Crowe8

Poster

Scalability is a recurring challenge in all aspects of high-throughput biology, including training. There is far more demand for training than can be met by just in-person training by the core Galaxy Team.

This poster will highlight training resources that are available for teaching bioinformatics in Galaxy and for teaching using and administering Galaxy itself. This includes information about the new Galaxy Training Network. The Galaxy Training Network unifies core project and community training efforts under one umbrella so that existing training resources become more easily available, and it makes it easier for new arrivals to get up to speed with training in their locations and communities. We will also highlight directories of tutorials/worked exercises, including sample data, slide sets, videos, and computational resources such as shared virtual machine images and Amazon Web Service Machine Images (AMI’s).


P27: Integrating new visualization tool in Galaxy

Alexan ANDRIEUX1, Pierre PETERLONGO1, Yvan LE BRAS2, Cyril MONJEAUD2, Charles DELTEL3

  • 1 Genscale, INRIA, Campus de Beaulieu, 35042, Rennes Cedex, France
    2 GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes1, Campus de Beaulieu, 35042, Rennes Cedex, France
    3 SED, INRIA, Campus de Beaulieu, 35042, Rennes Cedex, France

Poster

Galaxy supports adding tools, constructing workflows and analyzing diverse and large datasets. Galaxy offers some visualization tools, like Trackster and Phyloviz, but users can have difficulties finding the right visualizer to see the output of their own tools. To avoid the use of external tools, users may also want to integrate their own visualization tools.

In earlier versions of Galaxy, implementation of a new visualizer was complex because it required 1) to put each file type (JavaScript, Css, Mako, Python …) of the new visualizer in the right place in the directories tree and also 2) edit several Galaxy source files. Recent Galaxy versions give the possibility to add visualizations more easily: You only have to give to the new visualizer the right structure and paste it. It’s a good beginning even if some tasks are still difficult as for adding the Galaxy save function to the new visualizer.

The new visualization framework was tested to facilitate Mapsembler 2 outputs interpretation. This tool extends references sequences from each side with one or more sets of reads. Sometimes, several extensions are possible and Mapsembler 2, constructs a graph with each possible of extension. To view the output graph we have developed a graph viewer in JavaScript and jQuery. At the moment, this visualizer is compatible only with the Mapsembler 2 outputs, but further works will make it compatible with semantic web or Systems biology tools to visualize, for example, rdf files or biological networks. Finally, this work represents an important step towards visualization of data in Life Sciences Virtual Research Environment (introduce by the poster n°4).


P28: Integrating GALAXY workflows in a metadata management environment

Francois MOREEWS1, Yvan LE BRAS2, Olivier Dameron3, Cyril MONJEAUD2 and Olivier COLLIN2

  • 1 Genscale team, Irisa / INRA, Campus de Beaulieu, 35042 Rennes Cedex, France
    2 GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes1, Campus de Beaulieu, 35042, Rennes Cedex, France
    3 Dyliss team, Irisa / Inria Rennes-Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes Cedex, France"

Poster

The Galaxy platform offers repositories of user data and related analysis processes (data histories and workflows). These repertories enable traceability and reproducibility of the processes within the platform. At a larger scale, to answer questions like "What protocol was used to analyze my data?" or "how were these data generated?", we could consider any protocol as a metadata set that annotates inputs and results.

We present a preliminary approach for integrating the GALAXY workflows in an extensible meta-data management environment.

Using ISA-tools, we have developed a formalism to describe an abstraction of data processing workflows. This specification, in the ISA-TAB format is named ISA-DATAFLOW.

A conversion tool extracts a structured dataflow representation in GRAPHML, a generic XML graph format, from GALAXY workflows. This intermediary format can then be normalized using controlled vocabularies and converted into ISA-TAB following our ISA-DATAFLOW specification.

We plan to integrate this work to propose advanced research functionalities within a virtual research environment (VRE) deployed on a geographically and thematically distributed infrastructure already using multiple Galaxy instances. Future developments will concern workflow meta-analysis and workflow composition assistance.


P29: Genocloud : the GenOuest private cloud for Galaxy

Cyril Monjeaud1, Olivier Sallou1 and Olivier Collin1

Presented by Yvan Le Bras1

  • 1 GenOuest Core Facility, UMR6074 IRISA CNRS/INRIA/Université de Rennes1, Campus de Beaulieu, 35042, Rennes Cedex, France

Poster

The GenOuest bioinformatics core facility has recently deployed a private cloud named Genocloud. In addition to providing images for different biology domains (Bio-imaging, Next Generation Sequencing, collaboration, etc.), Genocloud offers solutions to deploy a Galaxy instance.

There are two ways you can set up your own Galaxy server. The first one is through the provision of a template inside the Genocloud interface. The second solution is through a Galaxy cookbook created for CHEF, an infrastructure for automatic applications deployment. This cookbook can be directly installed inside any virtual machine already running on Genocloud via our Xgrid solution.

Xgrid is an internal open-source web application integrated inside our images. This application allows users to load cookbooks from a CHEF server via web interface (click operations) and install it dynamically on the virtual machine. Furthermore, Xgrid can launch EC2 command-lines and configure an entire Sun Grid Engine (SGE) cluster. We also provide an extra template to deploy a Galaxy server already configured for SGE cluster submissions.


P30: Integrating Galaxy with UCSC Cancer Genomics

Melissa Cline1, Teresa Swatloski1, Brian Craft1, Mary Goldman1, David Haussler1, Jingchun Zhu1

  • 1 University of California Santa Cruz

Poster

The UCSC Cancer Genomics Browser is a powerful tool for visual analysis of published cancer genomics datasets. Its combined visualization of genomic and clinical data allows users to dynamically sort and cluster cohorts of genomic data by clinical features, cancer subtypes and genomic signatures. It displays data from a large library of over 580 cancer genomics datasets, including TCGA, LINCS and CCLE. The Cancer Browser is currently available in the Galaxy test toolshed, providing direct export into Galaxy datasets, and will soon be available in the main toolshed.

Building on the success of the Cancer Browser, we are now developing the Xena platform to enable users to host, visualize, analyze and share their own data from within a secure virtual machine (XenaVM). Users will be able to visualize their datasets separately under the Xena Browser, or integrate their data with published datasets to form larger “virtual cohorts”. Data will be hosted by a network of Xena servers, with the UCSC server hosting the UCSC cancer genomics library. To provide greater analysis capabilities, Xena will be tightly integrated with Galaxy. Users will be able to export data from Xena into Galaxy datasets, analyze those datasets under Galaxy, and import the analysis results directly into Xena, facilitating cycles of analysis and visualization.

Lightning Talks

Accepted Talks, Session 4, Tuesday, July 1

These talks have been accepted for the first lightning talks storm on Tuesday.

Visualising Proteomics Data in Galaxy

Ira Cooke1

  • 1 Latrobe University

Slides, Video

...

Building a scalable Galaxy cluster for biomedical research in The Netherlands

David van Enckevort1, Anthony Potappel2, Niek Bosch3, Jeroen Beliën4, Rita Azevedo5, Rob Hooft5, Sander Ruiter2, Sanne Abeln6, Irene Nooren3, Jan-Willem Boiten7

  • 1 University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
    2 Vancis, Amsterdam
    3 SURFsara, Amsterdam, The Netherlands
    4 VU university medical center, Amsterdam, The Netherlands
    5 Netherlands eScience Center, Amsterdam, The Netherlands
    6 VU university, Amsterdam, The Netherlands
    7 Center for Translational Molecular Medicine, Eindhoven, The Netherlands

Slides, Video

For the national translational IT project CTMM/TraIT Galaxy has been selected as one of the tools in the experimental domain. The TraIT partners (among others NBIC and SURFsara) have developed a vision how to make Galaxy available to the research community in The Netherlands. The scalable Galaxy cluster on the SURFsara HPC Cloud will be transferred to Vancis to provide a sustainable production-level Galaxy cluster. In the design of this environment Vancis has made use of the knowledge and experience of NBIC and SURFsara hosting the public NBIC instance on the SURFsara HPC Cloud.

To assess the minimal requirements for the infrastructure we used metrics collected while running the NBIC Galaxy on the HPC Cloud. Next we drafted a set of use cases the infrastructure should be able to fulfil, such as the ability to run Omics-pipelines and the ability to scale to handle peak demand. We identified I/O performance as a major bottleneck, since many Galaxy tools are I/O intensive, while Galaxy has a shared data design. Memory was also recognized as a critical factor, since typical datasets are in the order of the tens of gigabytes. We also built upon the experiences from SURFsara in operating the HPC Cloud and other HPC. To accommodate for a full set of development, testing, acceptance & production environments, as well as private installations, the infrastructure should support multiple Galaxy clusters. The chosen architecture will use a Linux High Availability environment with OpenStack, which will run on two large-size blades. Storage is split into multiple tiers with different characteristics to support both high I/O workloads and a reliable large storage. The chosen setup is horizontally scalable in a cost-efficient manner.

From May to September 2014 we will pilot the new architecture within the TraIT project. For this pilot we have selected a few TraIT NGS tools and pipelines to stress test the system under different workload scenarios. Furthermore we have established a process to ensure the quality of the tools required for a stable production environment.


Mississippi: a galaxy server centered on small RNA analysis

Marius van den Beek1, Christophe Antoniewski1

Slides, Video

Non-coding small RNAs (miRNA, siRNA, piRNA, …) are involved in the regulation of genes and transposable elements as well as in the defense against viral infections. Their discovery and their functional characterization rely heavily on high throughput RNA sequencing. The ~20:30nt length of small RNAs raises specific challenges for meaningful read mapping and analysis, so that standard RNAseq analysis methods have to be adapted. We provide an integrated set of galaxy tools that should streamline the most frequent small RNA analysis needs. This includes a modified bowtie-wrapper and workflows that allow users to quickly and reproducibly interrogate various aspects of small RNA biology. We provide tools for the discovery and differential expression analysis of miRNAs and a way for genome-wide visualization of miRNA precursors that complements Trackster. Furthermore we provide tools to detect the “ping-pong” biogenesis signature of piRNAs, to detect piRNA-producing loci in the genome and to study and visualize the impact of piRNAs and siRNAs on transposable elements.


A Galaxy-Based framework for online streaming data analytics in Heart Rate Variability Analysis

C Zarbo1, A Bizzego1,2,3, M Mina1, G Esposito2,4, C Furlanello1

  • 1 Predictive Models for Biomedicine & Environment - Fondazione Bruno Kessler, Trento, Italy
    2 University of Trento, Italy
    3 SKIL Telecom Italia, Trento, Italy
    4 RIKEN BSI, Wako-Shi, Japan

Slides, Video

The emerging applications in physiological data processing, encouraged by the availability of wearable sensors for continuous self-monitoring and quantified self, require new platforms for time series analysis supporting real-time processing and fast prototyping capabilities. We recently proposed Physiolyze, a Galaxy-based web framework to support complex workflows for Heart Rate Variability (HRV) analysis. Here we extend Physiolyze by introducing scalable online processing capabilities.

The enhanced version still relies on Galaxy as core platform to design and manage the pipelines. In order to incrementally analyze the streams, a set of Python routines based on the Bioblend library works as middleware to trigger the pipelines as new data become available. A web interface based on the Django Python framework allows the user to control the execution of the pipelines, running them on new data streams.

We tested our system on the task of predicting infant behavioral state from HRV patterns. We simulated a real-time scenario of 100 asynchronous data streams from data for 24 infants previously collected with a Light WP Holter ECG recorder (GE Healthcare). The system incrementally extracts 37 HRV indicators from each data stream and predicts the infant state (e.g. wake, sleep, cry) with a Random Forest regression model. The pipeline is modular and fully managed as a Galaxy workflow.

Our system can easily be adapted to other online streaming analytics applications, such as for the parallelized analysis of multiple data streams acquired from physiological sensors and wearable devices.


Ebiogenouest régional initiative : a use case for the structuration of the biologists community

Yvan Le Bras1 and Olivier Collin1

  • 1 CNRS UMR 6074 IRISA-INRIA, Rennes, France

Slides, Video

Two years after the beginning of a western France e-Science project, we propose here to highlight some results and show prospects.


Intergalactic travel: Sending usegalaxy.org through the wormhole

Nate Coraor1, Dannon Baker 2 and John Chilton1

  • 1 Galaxy Team, Penn State University, University Park, Pennsylvania
    2 Galaxy Team, Johns Hopkins University, Baltimore, Maryland

Slides, Video

Due to resource constraints, the main public Galaxy server run by the Galaxy Team, usegalaxy.org, moved from Penn State to the Texas Advanced Computing Center, with backups at the Pittsburgh Supercomputing Center. In addition to these resources, Galaxy has been awarded an XSEDE Grant of over 400,000 SUs, which we will be utilizing to further extend usegalaxy.org's computing Capacity.

This talk provides an overview of the work that was done to move the site, what challenges we faced, and some of the work that is going on right now and in the near future.

Accepted Talks, Session 8, Wednesday, July 2

These talks have been accepted for the second lightning talks storm on Wednesday.

Plan for Galaxy based Chip-exo Analysis platform

Bongsoo Park1

  • 1 Center for Eukaryotic Gene Regulation, The Pennsylvania State University

Slides, Video

BeeGFS: Accelerating the access to BLAST and Galaxy Indices

Franz-Josef Pfreundt1, Björn Grüning2

  • 1 Fraunhofer ITWM
    2 Bioinformatics Uni Freiburg

Slides, Video

Less talking, more doing: Crowd-sourcing the integration of Galaxy with a high-performance computing cluster

Michael Crusoe1

  • 1 Michigan State University

Slides, Video

Running and maintaining a reliable production Galaxy server

Shane Sturrock1

  • 1 New Zealand Genomics Ltd

Slides, Video

Private BLAST: Using Galaxy

Emma Prudent1

Presented by Gilda Le Corguillé1

  • 1 Abims

Slides, Video

SNPedia

Michael Cariaso1

  • 1 KeyGene

Slides, Video

Galaxy: Farm to Federation

Kyle Ellrott1, Dannon Baker2

  • 1 UC Santa Cruz
    2 John Hopkins University

Slides, Video

Galaxy Docker Containers: Docker, Docker, Docker

Björn Grüning1

  • 1 Bioinformatics Uni Freiburg

This talk was entirely a live demo.

Video

Galaxy Search API

Kyle Ellrott1

  • 1 UC Santa Cruz

Slides, Video


Abstract Submission

The deadlines for oral and poster presentation abstracts has already passed. However, you can still submit late abstracts. Late oral abstracts will be considered to fill cancellations, but not in the initial selection of abstracts.

Abstracts are submitted electronically. Abstracts should be 250 words of plain text or less. Talks and posters on any topics of interest to the Galaxy community are welcome. Areas of interest include, but are not limited to:

  • Compelling or novel uses of Galaxy for biomedical analysis
  • Best practices for local Galaxy installation and management
  • Integrating tools and/or data sources into the Galaxy framework
  • Deploying galaxy on different infrastructures

See the GCC2013 Abstracts list to see the broad range of topics presented in 2013.

There will also be an opportunity for lightning talks, which will be solicited at the meeting.

Please Note: By submitting an abstract you:

  • Agree to make your slides/posters freely available on this web site no later than August 1, 2014.
  • Those giving oral presentations agree to have their presentations videotaped and made publicly available during and after the conference.

  Submit a Late Abstract  


Special GCC and Galaxy series in GigaScience

GigaScience Journal

The GigaScience "Galaxy: Data Intensive and Reproducible Research" series announced for the 2013 conference has published its first papers, and is continuing to take submissions for this year's meeting and beyond. BGI is also continuing to cover the article processing charges until the end of the year, and for more information see their latest update. Accepted talks and selected posters from GCC2014 are eligible for consideration to appear in this series.

GigaScience is co-published in collaboration between BGI Shenzhen and BioMed Central focused on studies utilizing large-scale datasets and workflows.

Timeline

Date

Event/Deadline

February 14

Talk and Poster Abstract submission opens; rolling poster evaluation begins

April 4

Talk Abstract submission closed

April 18

Authors notified of Talk abstract acceptance status

April 25 26

Poster Abstract submission closes

May 2

All authors notified of Poster abstract acceptance status

August 1

All conference material made available on the conference web pages.

Questions? Contact the Organizers.