BLAST on OSG provides a timesaving alternative for large-scale analysis

~Greg Moore

The Basic Local Alignment Search Tool (BLAST), an algorithm for comparing primary biological sequence information, is one of the most widely used tools in bioinformatics. The National Center for Genome Analysis Support (NCGAS) and the Indiana University (IU) High Throughput Computing group have been experimenting with using the Galaxy web-based user interface to submit BLAST jobs on the Open Science Grid (OSG).

Rob Quick, manager of IU’s High Throughput Computing group, has been the OSG operations area coordinator since 2006. His BLAST project collaborators include Le-Shin Wu, NCGAS bioinformatics support lead; Soichi Hayashi, a software developer with IU’s High Throughput Computing group and OSG Operations; and Carrie Ganote, NCGAS bioinformatics analyst.

About Galaxy

Galaxy at IU provides a web-based platform for data-intensive genome analysis research. It employs IU’s Mason cluster for compute services and the IU Data Capacitor for project storage, and is hosted on IU’s Quarry Gateway Web Services Hosting System. Galaxy is a scientific workflow platform that makes computational biology easier for research scientists who do not know computer programming. NCGAS has created Galaxy portals for IU investigators and NSF-funded life science researchers nationally. These provide ready access to the full suite of genome assembly, annotation, alignment, and other applications — as well as the file transfer and transformation utilities necessary to build genome science workflows.

A way to run BLAST in parallel

Le-Shin Wu points out that today’s technologies for genome sequencing are faster and cheaper, and create more sequence data than ever before. With limited local computing resources, analyzing, understanding, and using these vast amounts of genomic information become challenging in terms of efficiency. One solution is to split a single, sizeable analysis task into many independent, smaller tasks and then distribute them to multiple computing resources in parallel. The OSG, which can support large amounts of central processing unit (CPU) hours simultaneously, provides a means to accomplish this.

Soichi Hayashi has been researching a way to run BLAST in parallel by splitting up the target database into many chunks and making it run in a distributed high-throughput computing (DHTC) environment, namely the OSG. In turn, Carrie Ganote has enabled OSG BLAST on IU’s Galaxy interface. Ganote says that the interface for running BLAST on OSG will provide an alternative to the National Center for Biotechnology Information BLAST servers, which are wonderful for small jobs and parameter tinkering, but prohibitively slow for large jobs (and require an active browser for the duration of the run).

“A fast alternative on stable resources would allow the public to increase their productivity without needing to install BLAST themselves or run it from the command line,” Ganote says. “Our current implementation is on IU resources, but any NSF-funded project may have access to NCGAS support by creating an account with us.”

OSG_blast_notesMore efficiency for researchers

Ganote’s work focuses mainly on genome and transcriptome assembly and downstream analysis. “When dealing with biology,” she says, “nothing is 100% predictable. This is certainly true with genomics. Every organism is different, and even with very closely related species, it’s not always safe to guess that one will be completely like another.”

“When attempting analysis and assessment of a new organism, it’s difficult to tell whether the path one is taking is the correct one,” continues Ganote, “and often trial and error is the only way forward. This may mean running software many times with different settings, looking for the ‘best’ result. When each software run takes days (maybe even weeks), this can make for a large time sink for researchers.”

Ganote and her collaborators are seeking to speed up the execution from weeks to days, running searches on a scale that is simply impossible using traditional methods. BLAST searches are a key tool for bioinformatics researchers; however, Hayashi estimates that the sizes of these databases are increasing faster than Moore’s law. Therefore, running BLAST on the OSG may provide a good alternative to traditional methods of executing BLAST.

“We obviously want to reduce the time it takes to run BLAST searches,” says Hayashi. “But we also don’t want to waste computational resources by running BLAST jobs on environments like Mason, which provides a large amount of memory. BLAST jobs primarily use CPU hours and not memory, so running BLAST on OSG allows researchers to free up those expensive computational resources and run other applications that cannot be run on OSG.”

Making BLAST work

The team’s first phase was making sure they could run BLAST on the OSG. They have proven that can be done effectively. Their next step is tackling huge search queries and breaking them into smaller jobs that take full advantage of the OSG. Hayashi is currently updating their prototype — it is still in a local testing environment, and will not be released publicly until they are confident of its production quality.

Queries can be quite different in size and nodes may not know how to handle them. “BLAST is unpredictable,” says Hayashi. “This makes it difficult to reassemble the queries. Sometimes it’s difficult to tell if the fault is the site or the job, because sometimes the site is just being hammered by our jobs.”

Ganote points out that with jobs such as BLAST, the question is not really if the problem can be solved — each BLAST job is easy enough to run on a personal computer. However, when thousands of these jobs need to be run, the researcher is often resigned to waiting weeks for all of the parts to finish. By breaking the problem across many computers in a grid such as the OSG, the wait time for the project is radically diminished. “This will be an immense benefit to researches,” she says.

The IU-NCGAS team, in particular, believes that integrating complementary technologies can result in a solution that is better than each technology on its own. This is the case with BLAST and OSG.

“Underused compute resources lead to waste — waste of power to run idle machines, waste of funds that supported the purchase of the machines, waste of useful lifetime on the machines,” notes Ganote. “By allowing small individual jobs to percolate and opportunistically fill idle cycles, the value of the grid resource is increased and useful work is being accomplished.”

Getting the word out

Hayashi would like to get the word out to researchers about how developers like him can help. He is actively making himself known at IU, so researchers will know to call on him and other IU research technologies staff like him. He says, “It can take a lot of brain power to figure out a problem. We came in and said you don’t have to do it this way: We can run the billion queries on the OSG with just brute force. This means biologists don’t have to engineer a solution. They can focus on the biology. Developers can focus on the problem.”

To learn more, listen to a recording of Rob Quick’s OSG Campus Infrastructure Communities (CIC) Series webcast on Galaxy-based BLAST submission to OSG, available here.