The advantages of cloud computing were dramatically illustrated last week by researchers working on the STAR nuclear physics experiment at Brookhaven National Laboratory's Relativistic Heavy-Ion Collider. New simulation results were needed for presentation at the Quark Matter physics conference; but all the computational resources were either committed to other tasks or did not support the environment needed for STAR computations.
Fortunately, working with technology developed by the Nimbus team at the U.S. Department of Energy's (DOE) Argonne National Laboratory, the STAR researchers were able to dynamically provision virtual clusters on commercial cloud computers and run the additional computations just in time.
Nimbus is an open source cloud computing infrastructure that provides tools allowing users to deploy virtual machines on resources, similar to Amazon's EC2, as well as user-level tools such as the Nimbus Context Broker that combines several deployed virtual machines into “turnkey” virtual clusters.
The Nimbus team at Argonne has been collaborating with STAR researchers at Brookhaven's Relativistic Heavy Ion Collider for a few years. Both research groups are supported by DOE's Office of Science.
“The benefits of virtualization were clear to us early on,” said Jerome Lauret, software and computing project leader for the STAR experiment. “We can configure the virtual machine image exactly to our needs and have a fully validated experimental software stack ready for use.” The image can then be overlaid on top of remote resources using infrastructure such as Nimbus.
With cloud computing, Lauret said, a 100-node STAR cluster can be online in minutes. In contrast, Grid resources available at sites not expressly dedicated to STAR can take months to configure.
The STAR scientists initially developed and deployed their virtual machines on a small Nimbus cloud configured at the University of Chicago. Then they used the Nimbus Context Broker to configure the customized cloud into Grid clusters which served as platform for remote job submission using existing Grid tools. However, these resources soon proved insufficient to support STAR production runs.
“A typical production run will require on the order of 100 nodes for a week or more,” said Lauret.
To meet these needs, the Argonne Nimbus team turned to Amazon EC2. A Nimbus gateway was developed to allow scientists to easily move between the small Nimbus cloud and Amazon EC2.
“In the early days, the gateway served as a protocol adapter as well,” said Kate Keahey, the lead of the Nimbus project. “But eventually we found it easier to simply adapt Nimbus to be protocol-interoperable with EC2 so that the scientists could move their virtual machines between the University of Chicago cloud and Amazon easily.”
Over the past year, the STAR experiment in collaboration with the Nimbus team successfully conducted a few noncritical runs and performance evaluations on EC2. The results were encouraging. When the last-minute production request came for new simulations, the STAR researchers had virtual machine images ready to go.
“It was a textbook case of EC2 usage,” said Keahey. “The overloaded STAR resources were elastically ‘extended' by additional virtual clusters deployed on EC2.” The run used more than 300 virtual nodes at a time, using the default EC2 instances at first and moving on to the high-CPU medium EC2 instances later to speed the calculations.
Using cloud resources to generate last-minute results for the Quark Matter conference demonstrates that the use of cloud resources for science has moved beyond “testing waters” and into real production. According to Keahey, virtualization and cloud computing provide an ideal platform for resource sharing.
“One day a provider could be running STAR images, and the next day it could be climate calculations in entirely different images, with little or no effort,” said Keahey. “With Nimbus, a virtual cluster can be online in minutes.”
Cite This Page: