GlueX team nears needed throughput on OSG
Richard Jones, an associate professor of physics at the University of Connecticut, is a member of the GlueX Collaboration, a group of scientists pursuing a diverse program of physics using photon beams. Based at the Thomas Jefferson National Accelerator Facility in Virginia, GlueX is scheduled to begin producing data in 2014. In December 2012, the GlueX team did a production run on the Open Science Grid to simulate what might happen when the facility is running and producing real results. Jones recently spoke to the OSG communications team about their experiences.
The goal of the GlueX experiment is to understand the confinement of quarks and gluons in quantum chromodynamics. GlueX will ultimately produce a linearly polarized photon beam. A detector will then collect data on meson production and decays – however, after the first year of running, statistics will exceed the current photo-production data by several orders of magnitude.
From planning to production, the lifecycle of big experiments like this can take a decade. It has turned out to be even longer for GlueX, in large part due to changes in computing. During initial planning in 1998, collaborators relied on Moore’s Law (over the history of computing hardware, the number of transistors on integrated circuits has doubled approximately every two years). They also assumed that by 2008 they would be able to afford the 10,000 cores needed for Monte Carlo simulations.
By 2004, Moore’s Law seemed to be losing momentum. After seeing how grid computing benefitted high energy physics colleagues at the LHC, the GlueX team successfully pursued a Physics at the Information Frontier grant from the National Science Foundation. By 2009, they had formed a new OSG virtual organization with the University of Connecticut as its home site. By bringing in an existing cluster, they gained experience running their software stack on top of the grid platform.
Manpower was a significant concern. The batch system, storage management, and staging are unique to the lab, which means they have limited manpower to devote to outside resources. On the other hand, the grid could bring institutional resources to bear on the research, in order to get from raw data to the publication of results. In the end, the grid won out: “In the last four months,” Jones said, “we’ve shown how well our software runs on a grid platform, and demonstrated that a modest amount of manpower can make it work.”
Jones describes the December run as a data challenge. The main point was to see the pros and cons (and tradeoffs) between typical lab computing methods and grid methods. With that in mind, the GlueX team developed a stack of software applications to take data from a Monte Carlo generator to a simulator to a detector resolution – and after that to reconstruction and the production of a data summary tape. After formalizing the workflow, they ran it through OSG production to show net throughput efficiency. It met their expectations. “It was an interesting experiment,” Jones said. “On the OSG, we received higher throughput than I had ever seen by a factor of four.” The team had previously been able to manage simultaneous execution on 2,000 cores on the OSG. When the OSG production reached its peak, they had about 8,000 jobs running simultaneously. In fact, the OSG run marked the first time they had come close to the projected capacity they would need (10,000 cores) – and it was the first time they had managed that much throughput from simultaneous processes.
The secondary goal of the run was to produce a large enough minimum bias trigger sample to evaluate realistic backgrounds for studies of exclusive multi-hadron final states. Now GlueX researchers can do detailed studies of particle identification strategies, and also do higher-level analysis that will allow them to extract mesonic resonances from the data. It requires significant processing to be able to analyze the event distribution, in large part due to the difficulty of making cuts to eliminate background (which is necessary to extract a clean signal).
“The success of our analysis technique relies on identifying an exclusive final state – that there are no missing particles, that nothing escapes our detector,” says Jones. “Being able to isolate an exclusive sample in any number of final states is a challenge.”
“When we ramp up to full intensity, our data rate is about 20,000 events per second,” adds Jones. “During the first year or two, we’ll be running at a tenth of that, just learning how the detector works. Our goal in this data challenge was to be able to simulate a large fraction of the first year’s running. What we achieved was about eight weeks. Given the limitations of how many months a year you can run, we think that will work out to about a half year.”
The GlueX team learned a lot in December, gaining a realistic sense of how data would run on the OSG. They paid considerable attention to inefficiencies, and to anticipating how jobs might fail to complete. Under certain conditions some jobs would hang until researchers noticed they were incomplete, so the team has been identifying and reproducing glitches in order to resolve them before the next run. “Some things only show up when you try to produce a large sample,” Jones said. “Those problems won’t happen next time.”
In doing the data challenge, they are following the example of their Large Hadron Collider (LHC) colleagues, who started doing data challenges as far back as 2000. An external review committee recommended this approach to the GlueX team. Jones offers this advice: “If you‘re going to use the grid, prepare ahead of time. You can’t anticipate issues that will appear when you go from a scale of data volume or compute parallelism to a new one. Computing and storage advances increase the capacity of what science can be done, but along with them come new challenges that must be understood and overcome.” Jones praised the amount of operational expertise in the OSG community. He is grateful that the community is set up to help bring others on board and get them going and it’s that help that they received that’s really been crucial.
Jones had this advice for other scientists looking into using the OSG: “Does grid computing make sense for your problem? Some science problems may run very efficiently on 10,000 cores, and some may require much more specialized hardware. OSG resources are typical Ethernet environments – you’re not looking for enormous speed in interprocess communication. If the problem does work there, you’ll find that the OSG is a highly collaborative environment, a place where scientists will find other colleagues that are ready to help. The OSG has a strong commitment to scientific productivity that I find very attractive.”
2013 Grid School
Applications are now open for the 2013 OSG User School. The Deadline is Friday, March 29, 2013
Once again, the OSG will offer a week-long “summer school” on high-throughput computing. This is a great opportunity for students to learn about HTC from the experts. The School is primarily aimed at graduate students at US institutions in any branch of science or research that can use large-scale computing. The goal is to help students learn the basics of high-throughput computing and its applications, so that they can begin to use HTC tools to transform their research.
Applications are open during March and acceptance notifications will go out mid-April. The 2013 OSG User School will take place from June 24-27 at the University of Wisconsin-Madison. Also, accepted students will attend the XSEDE13 Conference as part of a broader summer program in research computing. Local expenses are paid for all students, and there is some budget for travel expenses.
Announcement emails will be distributed soon.
Link to Online application form: http://vdt.cs.wisc.edu/osgus-2013/
Link to 2013 OSG User School twiki page: https://www.opensciencegrid.org/bin/view/Education/OSGUserSchool2013
For questions about the school, the application process or anything else, please send an email to: firstname.lastname@example.org
Network Diagnosis in Grids
The Open Science Grid has a new focus area in networking since last summer. Work is underway to enable OSG to provide networking information for OSG sites and applications and part of this work is focused on providing easier ways to identify and diagnose network problems in our OSG grids. This work is capitalizing on the perfSONAR Performance Toolkit ([[http://psps.perfsonar.net/toolkit/][pS-Performance Toolkit]]) and the Modular Dashboard. The pS-Performance Toolkit provides a suite of tools and a standardized infrastructure to measure network performance between instances. Tests can be scheduled or run “on-demand”. Specialized applications (NDT and NPAD) deliver a detailed analysis of the network path from a client host to a perfSONAR-PS instance. The Modular Dashboard gathers network metrics from scheduled tests between perfSONAR-PS instances and summarizes and displays the results. This combination of Toolkit and Dashboard makes it much easier to identify, localize and fix network problems in our distributed OSG infrastructure.
Link to the OSG Campus Infrastructures Community webcast , January 25, 2013: http://www.youtube.com/watch?v=rND2Wyc0Oaw
Pakiti is a Web-based application you can set up for your site that summarizes the patching status of machines at your site. Pakiti also knows about security specific updates, and can show which systems need security updates vs. other software updates, as well as link to the relevant CVEs to easily see which vulnerabilities apply to your systems and how critical these vulnerabilities are. CVE (Common Vulnerabilities and Exposure) is a dictionary of publicly known information security vulnerabilities and exposures kept by mitre.org. Pakiti does not install any updates itself.
Pakiti was developed at CERN, and is now available in the OSG v3 software release. The OSG security team has been running a central Pakiti server to monitor a few different hosts at various sites, and now any OSG site can set up their own Pakiti server without making their sites’ vulnerability information available off site.
The Pakiti client that is installed on monitored systems is a simple bash script that should not interfere with normal operations. The data sent to your site’s Pakiti server is essentially the output of ‘rpm -qa’, as well as the operating system release version.
The Pakiti homepage is http://pakiti.sourceforge.net
OSG-specific installation instructions are available at: https://twiki.grid.iu.edu/bin/view/Documentation/Release3/PakitiInstallation
~Kevin Hill, OSG Security Team