Featured Research

from universities, journals, and other organizations

Bug repellent for supercomputers proves effective

Date:
November 14, 2012
Source:
DOE/Lawrence Livermore National Laboratory
Summary:
Researchers have used the Stack Trace Analysis Tool, a highly scalable, lightweight tool to debug a program running more than one million MPI processes on the IBM Blue Gene/Q-based Sequoia supercomputer.

Lawrence Livermore National Laboratory (LLNL) researchers have used the Stack Trace Analysis Tool (STAT), a highly scalable, lightweight tool to debug a program running more than one million MPI processes on the IBM Blue Gene/Q (BGQ)-based Sequoia supercomputer.

The debugging tool is a significant milestone in LLNL's multi-year collaboration with the University of Wisconsin (UW), Madison and the University of New Mexico (UNM) to ensure supercomputers run more efficiently.

Playing a significant role in scaling up the Sequoia supercomputer, STAT, a 2011 R&D 100 Award winner, has helped both early access users and system integrators quickly isolate a wide range of errors, including particularly perplexing issues that only manifested at extremely large scales up to 1,179,648 compute cores. During the Sequoia scale-up, bugs in applications as well as defects in system software and hardware have manifested themselves as failures in applications. It is important to quickly diagnose errors so they can be reported to experts who can analyze them in detail and ultimately solve the problem.

"STAT has been indispensable in this capacity, helping the multi-disciplined integration team keep pace with the aggressive system scale-up schedule," said LLNL computer scientist Greg Lee.

"While testing a subsystem of Blue/Gene Q, my test program consistently failed only when scaled to 1,179,648 MPI processes. Although the test program was simple, the sheer scale at which this program ran made debugging efforts highly challenging. But when I applied STAT, it quickly revealed that one particular rank process was consistently stuck in a system call," said Dong Ahn, a computer scientist in Livermore Computing.

Based on this finding, a system expert took a close look at the compute core on which this rank process was running and discovered a hardware defect. "Replacing the component suddenly got the entire Sequoia system back to life," Ahn said. "Putting this exercise into perspective, this error was due to a defect in a tiny hardware unit, the decrementor, of a single hardware thread out of a total of 4.7 million hardware threads. I felt it was like finding a needle in a haystack over a coffee break."

Sequoia delivers 20 petaflops of peak power and was ranked No. 1 in June of this year's TOP500 list. It is currently ranked No. 2, behind Oak Ridge National Laboratory's Titan.

LLNL plans to use Sequoia's impressive computational capability to advance understanding of fundamental physics and engineering questions that arise in the National Nuclear Security Administration's (NNSA) program to ensure the safety, security and effectiveness of the United States' nuclear deterrent without testing. Sequoia also will support NNSA/DOE programs at LLNL that focus on nonproliferation, counterterrorism, energy, security, health and climate change.

As LLNL takes delivery of the Sequoia system and works to move it into production, computer scientists will migrate applications that have been running on earlier systems to this newer architecture. This is a period of intense activity for LLNL's application teams as they gain experience with the new hardware and software environment.

"Having a highly effective debugging tool that scales to the full system is vital to the installation and acceptance process for Sequoia. It is critical that our development teams have a comprehensive parallel debugging tool set as they iron out the inevitable issues that come up with running on a new system like Sequoia," said Kim Cupps, leader of the Livermore Computing Division at LLNL.

STAT is particularly important for LLNL because supercomputer simulations are essential in virtually every mission area of the Laboratory. The tool also has been used at other sites and proved to be effective on a wide range of supercomputer platforms, including Linux clusters and Cray systems.

The team is actively pursuing further optimization of STAT technologies and is exploring commercialization strategies. More information about STAT, including a link to the source code, is available on the Web.


Story Source:

The above story is based on materials provided by DOE/Lawrence Livermore National Laboratory. Note: Materials may be edited for content and length.


Cite This Page:

DOE/Lawrence Livermore National Laboratory. "Bug repellent for supercomputers proves effective." ScienceDaily. ScienceDaily, 14 November 2012. <www.sciencedaily.com/releases/2012/11/121114134713.htm>.
DOE/Lawrence Livermore National Laboratory. (2012, November 14). Bug repellent for supercomputers proves effective. ScienceDaily. Retrieved July 25, 2014 from www.sciencedaily.com/releases/2012/11/121114134713.htm
DOE/Lawrence Livermore National Laboratory. "Bug repellent for supercomputers proves effective." ScienceDaily. www.sciencedaily.com/releases/2012/11/121114134713.htm (accessed July 25, 2014).

Share This




More Computers & Math News

Friday, July 25, 2014

Featured Research

from universities, journals, and other organizations


Featured Videos

from AP, Reuters, AFP, and other news services

Mobile App Gives Tour of Battle of Atlanta Sites

Mobile App Gives Tour of Battle of Atlanta Sites

AP (July 25, 2014) Emory University's Center for Digital Scholarship has launched a self-guided mobile tour app to coincide with the 150th anniversary of the Civil War's Battle of Atlanta. (July 25) Video provided by AP
Powered by NewsLook.com
Bill Gates: Health, Agriculture Key to Africa's Development

Bill Gates: Health, Agriculture Key to Africa's Development

AFP (July 24, 2014) Health and agriculture development are key if African countries are to overcome poverty and grow, US software billionaire Bill Gates said Thursday, as he received an honourary degree in Ethiopia. Duration: 00:36 Video provided by AFP
Powered by NewsLook.com
Creative Makeovers for Ugly Cellphone Towers

Creative Makeovers for Ugly Cellphone Towers

AP (July 24, 2014) Mobile phone companies and communities across the country are going to new lengths to disguise those unsightly cellphone towers. From a church bell tower to a flagpole, even a pencil, some towers are trying to make a point. (July 24) Video provided by AP
Powered by NewsLook.com
Robot Parking Valet Creates Stress-Free Travel

Robot Parking Valet Creates Stress-Free Travel

AP (July 23, 2014) 'Ray' the robotic parking valet at Dusseldorf Airport in Germany lets travelers to avoid the hassle of finding a parking spot before heading to the check-in desk. (July 23) Video provided by AP
Powered by NewsLook.com

Search ScienceDaily

Number of stories in archives: 140,361

Find with keyword(s):
Enter a keyword or phrase to search ScienceDaily for related topics and research stories.

Save/Print:
Share:

Breaking News:
from the past week

In Other News

... from NewsDaily.com

Science News

    Health News

      Environment News

        Technology News



          Save/Print:
          Share:

          Free Subscriptions


          Get the latest science news with ScienceDaily's free email newsletters, updated daily and weekly. Or view hourly updated newsfeeds in your RSS reader:

          Get Social & Mobile


          Keep up to date with the latest news from ScienceDaily via social networks and mobile apps:

          Have Feedback?


          Tell us what you think of ScienceDaily -- we welcome both positive and negative comments. Have any problems using the site? Questions?
          Mobile: iPhone Android Web
          Follow: Facebook Twitter Google+
          Subscribe: RSS Feeds Email Newsletters
          Latest Headlines Health & Medicine Mind & Brain Space & Time Matter & Energy Computers & Math Plants & Animals Earth & Climate Fossils & Ruins