Development News Brief

February 18, 2011

Get Galaxy!

new: % hg clone http://www.bx.psu.edu/hg/galaxy galaxy-dist

upgrade: % hg pull -u -r ca0c4ad2bb39

What's New

galaxyproject at Twitter

Join us at Twitter for the latest updates on development projects, conferences and training, and all things Galaxy!

#usegalaxy

http://twitter.com/#!/search/galaxyproject

New Ways to Search Data Libraries

In addition to searching on data library names and descriptions, we have introduced the ability to search on attributes of library datasets.

Two different approaches to searching library datasets are supported

Searching with whoosh
Searching with Apache's Lucene. (Contributed by Brad Chapman)

The public Galaxy instances hosted at Penn State University use the whoosh implementation, while Brad Chapman's instance uses the * Lucene * implementation. Choose either of these implementations for your local Galaxy instances (but not both).

There are two new config settings in the "Data Libraries" section of universe_wsgi.ini.sample for these implementations (again, uncomment only one of these if you choose to use this feature):

# Search data libraries with whoosh
#enable_whoosh_library_search = True
# Whoosh indexes are stored in this directory.
#whoosh_index_dir = database/whoosh_indexes

# Search data libraries with lucene
#enable_lucene_library_search = False
# maxiumum file size to index for searching, in MB
#fulltext_max_size = 500
#fulltext_noindex_filetypes=bam,sam,wig,bigwig,fasta,fastq,fastqsolexa,fastqillumina,fastqsanger
# base URL of server providing search functionality using lucene
#fulltext_url = http://localhost:8081

Details for searching with whoosh

You will need to build indexes on a regular, timely basis for searching your library datasets. You can do this by running the following script included in the distribution:

% <your galaxy install dir>/scripts/data_libraries/build_whoosh_index.sh

Running this script will build whoosh indexes in the directory named by the config setting labeled whoosh_index_dir shown above.

We recommend adding something like the following setting to your Galaxy server's cron settings to keep the indexes current:

0,30 * * * * cd /var/opt/galaxy/g2test/galaxy_test ; bash ./scripts/data_libraries/build_whoosh_index.sh

The above setting rebuilds the indexes every 30 minutes. You may want to build them more often because search results will include only those library datasets that were indexed the last time the script was executed. In other words, if a library dataset was uploaded after the whoosh indexes were created, it will not be returned in search results even if it meets the search criteria.

Using whoosh, you can search on the "name", "info", "message" and "dbkey" attributes of library datasets using the following new search box displayed at the top of the "Data Libraries" page:

Clicking the "Advanced search" link below the search box allows you to search data library names and descriptions as you did in the past:

Details for searching with Lucene

If you use the Lucene implementation instead of whoosh, there are a few differences. You'll need to build Lucene indexes using the following script:

<your galaxy install dir>/scripts/data_libraries/build_lucene_index.sh

In addition to the "name", "info", "message" and "dbkey" attributes, this implementation will also index the contents of your library datasets. You should decide on the optimal values for the following config settings for your Galaxy instance:

# maxiumum file size to index for searching, in MB
#fulltext_max_size = 500
#fulltext_noindex_filetypes=bam,sam,wig,bigwig,fasta,fastq,fastqsolexa,fastqillumina,fastqsanger

You'll also need to decide on a value for the following config setting:

# base URL of server providing search functionality using lucene
#fulltext_url = http://localhost:8081

Managing disk space with Data Libraries

Adjusted the cleanup_datasets.py script to more correctly handle the lifecycle of Library Datasets.

Managing library datasets is a bit complex, so here is a scenario that hopefully provides clarification. The complexities of handling library datasets is mostly contained in the delete_datasets() method in the cleanup_datasets.py script.

Example of usage:

Assume we have 1 library dataset with

LibraryDatasetDatasetAssociation -> LibraryDataset and Dataset

This dataset would have the following database column values:

    LibraryDatasetDatasetAssociation deleted: False
    LibraryDataset deleted: False, purged: False
    Dataset deleted: False purged: False

Now a user deletes the associated dataset from a data library via a UI menu option.

This action results in the following database column values:

    LibraryDatasetDatasetAssociation deleted: False
    LibraryDataset deleted: True*, purged: False
    Dataset deleted: False, purged: False

After the number of days configured for the delete_datasets() method (option -6 in the cleanup_datasets.py script) have passed, execution of the delete_datasets() method results in the following database column values:

    LibraryDatasetDatasetAssociation deleted: True
    LibraryDataset deleted: True, purged: True
    Dataset deleted: True*, purged: False

After the number of days configured for the purge_datasets() method (option -3 in the cleanup_datasets.py script) have passed, execution of the purge_datasets() method results in the following database column values:

    LibraryDatasetDatasetAssociation deleted: True
    LibraryDataset deleted: True, purged: True
    Dataset deleted: True, purged: True
    (dataset file removed from disk if -r flag is used)

This scenario is about as simple as it gets. Keep in mind that a Dataset object can have many HistoryDatasetAssociations and many LibraryDatasetDatasetAssociations, and a LibraryDataset can have many LibraryDatasetDatasetAssociations.

Another way of stating it is: LibraryDatasetDatasetAssociation objects map LibraryDataset objects to Dataset objects, and Dataset objects may be mapped to History objects via HistoryDatasetAssociation

Updated & Improved

Current Tools

Update FASTX tool wrappers to handle fastqsanger formats.
The BLAST+ tools are now uncommented in the sample tool configuration.
Fix an Rpy syntax problem in the Histogram tool that was causing the tool to fail in some local installations.
Enable Cuffcompare to take an arbitrary number of input GTF files.
Add GTF sniffer to datatypes config sample file.

New Tools

Added a Line/Word/Character counter to Text Manipulation tool group.
Extended Extract Genomic DNA tool to support GFF/GTF features and custom genomes:
- This is especially useful for extracting genomic data that correspond to transcripts in GTF format, as is produced by Cufflinks.
- Tool also now accepts sequence data from a history item, enabling the extraction of data from custom genomes.

New Community Tools Added (Tool Shed)

Velvet, ABySS, Minimus2, Phrap, and Newbler. http://community.g2.bx.psu.edu/

Data Libraries

Fix a bug that included the option for a regular (non-admin) user to upload a directory of files when the config setting existed, but the user's directory did not exist. Instead of throwing an exception, an attempt to create the directory will now be made. If the attempt fails, the option to "Upload a directory of files" will not be included in the select list on the upload form for that particular user.
When browsing a data library, only display the check boxes and the actions to perform on multiple selected datasets if library datasets are displayed.
Clean up the behavior when performing an action on multiple library datasets. Instead of throwing an error and not allowing the action on any of the datasets if the user is not authorized to perform the action on one of them, all "unauthorized" datasets are now left alone and the action is performed on all datasets for which the user is authorized to perform the action. Appropriate messages detail what occurred.
Allow regular users to delete multiple datasets returned from library dataset searches based on the behavior described above.
Do not include the "Select datasets for import into selected histories" option in the actions popup menu at the data library level if the library's root folder doesn't contain any accessible datasets. Make behavior the same for folder popup menus.
Fix a bug where a regular user could never see deleted library items when toggling show / hide deleted items in a data library.
Add a purged column to the LibraryDataset table. A LibraryDataset is marked "purged" when all associated LibraryDatasetDatasetAssociations are marked "deleted".

Workflows

Workflow steps on the "Run workflow" page are now collapsible to just the title bar with the tool name and annotation. Steps without any inputs to be set will be collapsed by default.
When running a workflow, possible values for Input Dataset steps are filtered to datatypes that are valid for immediately subsequent steps, instead of just using 'data'.
Jquery1.5 workflow fixes with quotations and flagging of outputs.

Workflows containing tools that are not loaded will now open in the editor instead of failing on the dictionary lookup. The offending nodes are marked as having an error state, and the workflow cannot be saved. This will enable users to at least look at and try to recover a workflow by removing (and potentially replacing) the offending steps, as well as providing a better view of what's actually wrong with the workflow.

Workflows will no longer try to run when required tools are not found. An error message is displayed with a link to the editor view (to fix the workflow).
Interval datatype now uses line estimates for large files. This fixes potential "'?' lines" seen by some users.
Removed "Annotation:" labels for a cleaner look on the run workflow page.
New History functionality now shows only the checkbox by default, displaying the input for a history name only when checked.
Tool text has been standardized to refer to datasets as such, instead of the older descriptor 'query'.
We now use the inflector utility to pluralize groupings. Fixes "Querys", etc.

Trackster

Enhancements:
- Support large (10,000+) sets of chromosomes/contigs, as is often the case for low/non-model coverage genomes.
- Improve speed by streamlining data fetching and caching in Trackster.
- Add support for child tracks.
- UI hints for zooming out/in.
- Extend CIGAR string parsing to handle all operations (however, insertions still not fully supported).
- History dataset selection: show history name, show and order by hid in display.
- Modified interval index squish mode to display blocks and treat intervals as half-open.
Bug fixes:
- Fix CIGAR string parsing bug that prevented sequence data from being displayed correctly.
- BED files without score column or with scores that are floats can now be displayed.
- Better handling of error messages so that messages and data do not overlap.

User Interface (UI)

Search improvements:
- Size of search box increases to show help text (column label).
- Help text is highlighted rather than hidden when user clicks on search box.
- Standard and advanced search divs are toggled correctly when moving between searches.

Galaxy Reports

Clarify in the reports config that Galaxy reports do not support SQLite.
Add 2 new sample tracking reports and 2 new workflow reports, each of which provides the total number of items per month and the total number of items per user ( per month ).

Source

Better egg version conflict resolution. Our eggs should always override dependencies installed elsewhere on sys.path.
Job runners are now loaded dynamically; changes to the source are no longer required. Existing 3rd-party runners will need to add a class to instantiate to the runner-global 'all' list. See the provided runners for examples.
The Galaxy-side code for the LWR job runner by John Chilton is now included. Documentation in the wiki is forthcoming, to learn more see John's documentation at: http://bitbucket.org/jmchilton/lwr/
"LWR allows you to create a "cluster" out of any systems without installing specialized tightly integrated DRM software like PBS/SGE."

Bug Fixes

Gracefully fall back to English if a client requests a locale not installed on the server.
Don't attempt to check for the python-daemon egg on Python 2.4 (it's used by an optional experimental component and is not compatible with Python 2.4).
The drmaa job runner would not always detect job completion on some systems. This change could also cause job failures to be considered successful job completions, but we have no way to test for these conditions. Please report if you discover this to be the case.
Migrating to database version 62 no longer fails for MySQL. Thanks Leandro Hermida.
Downloading eggs for offline systems works again.

About Galaxy

Galaxy is supported in part by NSF, NHGRI, the Huck Institutes of the Life Sciences, and The Institute for CyberScience at Penn State.