Linked grids uncover genetic mutations
Knowing which genetic mutation within a family causes a particular disease can lead to recommended lifestyle changes that may help avoid symptoms. Or it could give rise to medication to treat the disease. But the type of computing analysis needed to identify the mutations can take the equivalent of years to complete on a single computer.
Geneticists use a statistical method called genetic linkage analysis to determine the location of disease-provoking mutations on the chromosome. Based on a given genealogy and its members’ genetic makeup, the analysis is mapped onto a probabilistic graphical model that represents the likelihood of a genetic marker being linked to the disease. In large families with many genetic markers, these computations are extremely compute-intensive.
Superlink-online, a distributed system developed at the Technion-Israel Institute of Technology, helps researchers perform their analyses in a matter of days by distributing these computations over thousands of computers worldwide. Geneticists submit their data through the web portal with a single click and get their results, ready to use. Behind the scenes, the system splits the computations into hundreds of thousands of independent jobs, invokes them on the available resources, and assembles the results back into a single data set.
Superlink ran on a single computer in 2002 when it was first released by Professor Dan Geiger and his students at the Technion. By 2005, as more computer power was needed to perform increasingly complex analyses, then-doctoral student Mark Silberstein began working on a distributed version. Silberstein and his adviser, Professor Assaf Schuster, realized that the only way to satisfy exceedingly high computing demands was to enable opportunistic use of non-dedicated computers.
![]() An example of a complex consan- guineous pedigree, or graphic map of a family tree. The squares represent males, while the circles represent females. Individuals affected by a genetic mutation are represented with solid squares or circles. Click for larger image. Image courtesy Kwanghyuk (Danny) Lee, Baylor College of Medicine |
"[The data] was too complex to analyze on one CPU...,” says Silberstein. “It was impossible to provide ’service’ with this quality of service.” The opportunistic model was chosen because “with literally zero budget for purchasing and maintaining dedicated hardware, and with the actual resource demand reaching thousands of CPUs, we could not afford any other model.”
In early 2006, thanks to close collaboration with the Condor team at the University of Wisconsin-Madison, the first version of Superlink-online was released using Wisconsin’s Condor pool and the Technion’s own home-brewed Condor pool with about 100 CPUs. Eventually, additional resources came from the Open Science Grid, EGEE, and the Superlink@Technion community grid, which uses the idle cycles on participants’ home computers.
Since then, the system has enabled hundreds of geneticists worldwide to analyze much larger data sets, producing results two orders of magnitude faster than the serial version. "The analysis of complicated pedigrees is always painful and challenging," says Researcher Kwanghyuk (Danny) Lee of Baylor College of Medicine. "With the help of Superlink-online, however, the large complicated families can be analyzed very fast and very accurately." In fact, several rare-disease-causing mutations have been found, including those causing Hereditary Motor and Sensory Neuropathy, “Uncomplicated” Hereditary Spastic Paraplegia, and Ichthyosis.
During a 3 month period, over 25,000 non-dedicated hosts from all grids have been actively participating in the computations, reaching maximum effective throughput roughly equal to that of a dedicated cluster of up to 8,000 cores.
Silberstein and others are finalizing a version that will significantly extend Superlink-online’s power by accessing resources over all the aforementioned grids and Tokyo Institute of Technology’s Tsubame supercomputer. A new resource management system, GridBot, unifies these into a single scheduling framework.
For its part, Open Science Grid has been an essential part of Superlink-online.
“Without OSG, we would not be close to where we stand now,” says Silberstein.” Recently we managed to complete one genetic analysis task in 5 days. This task comprised about 3.5 million jobs of approximately 10 minutes each—roughly 55 years of CPU time. One-third of this workforce was from OSG.”
And this particular analysis was especially important—It confirmed the suspected location of a mutation which causes Age-Related Corticol Cataracts.
~Marcia Teckenbrock, Open Science Grid, Fermilab
August, 2009



