Welcome
Participating Sites
Teaching Assistants
Group Discussion Etiquette
Software Requirements
Getting Started on TeraGrid
Course Schedule
Presentation Materials for Week
  + Keynote
  + Introduction to HPC Systems
  + Hybrid MPI Programming
  + Multi-core Programming
  + Totalview Debugging Techniques
  + Parallel I/O
  + Experience from the Field
  + Eclipse
  + DDT Debugging Techniques
  + Numerical Libraries
  + Performance and Code Profiling
  + Visualization
    > Overview and Introduction to Scientific Visualization
    > Parallel Visualization, Data Formatting, Software Overview
    > Hands-on Tutorial: VisIt
    > Hands-on Tutorial: ParaView
    > Sample Datasets
Biographies for Presenters
General Exercises
  + Jacobi Exercise 1
  + Jacobi Exercise 2
  + Jacobi Exercise 3
  + Jacobi Exercise 4
  + Jacobi Exercise 5
Molecular Dynamics Background
  + Molecular Dynamics Exercise 1
  + Molecular Dynamics Exercise 2
  + Molecular Dynamics Exercise 3
  + Molecular Dynamics Exercise 4
  + Molecular Dynamics Exercise 5
Access to Other Training Resources

Jacobi Exercise 5

Exercise 5: Swimming Out to the Middle of the Ocean: Performance Analysis and Enhancements

Objectives

Trying everything possible to make the Jacobi Iteration algorithm to gain the best performance over 4,096 cores.
Gaining confidence with the HPC architecture, and understanding that petascale computing is not only about programming.
Gaining confidence using the performance tools to analyze the performance of petascale codes.

You can move on when?

You think you have done everything within your power to get the best performance out of your 2D MPI Jacobi iteration algorithm.

Description

In the spirit of GI-Joe, programming is only half the battle. Getting the best performance out of an HPC architecture, especially at the petascale, is a highly iterative and architecture specific affair. Thankfully, we have performance tools and experts to help us with such things, otherwise there would be no hope.

Now that you have a working MPI implementation of the 2D decomposed algorithm, you will now use performance tools to see if there are any improvements that can be made. Also, each of our HPC architectures has specific optimizations that can be done to improve the performance of parallel codes on them. These improvements are generally explored and documented by the vendor and the center that operate the machine. So you will be required to look into things like compiler optimizations, process placement, and other environment settings to see if you can get better performance. Some of the architectures even have different compilers. The sky is the limit; try anything you think will get better performance.

Instructions

Using the performance tools you have learned about in the workshop and the documentation for your HPC architecture, make your code as fast as possible over 4,096 cores for a matrix size of 262,144 x 262,144 (this satisfies the processor and matrix constraints of the 2D algorithm).
Post your best time for 100 iterations at : Time Postings.

Information on optimizing runs for the HPC architectures can be found at:

For Ranger: http://services.tacc.utexas.edu/index.php/ranger-user-guide
For Athena: http://www.nics.tennessee.edu/computing-resources/kraken/
For Bluefire: http://www.cisl.ucar.edu/computers/bluefire/

Questions

Did you find anything in particular the either helped or hindered performance of your code? Why do you think that happened?