Jacobi Exercise 5

Exercise 5: Swimming Out to the Middle of the Ocean: Performance Analysis and Enhancements

Objectives

  1. Trying everything possible to make the Jacobi Iteration algorithm to gain the best performance over 4,096 cores.
  2. Gaining confidence with the HPC architecture, and understanding that petascale computing is not only about programming.
  3. Gaining confidence using the performance tools to analyze the performance of petascale codes.

You can move on when?

You think you have done everything within your power to get the best performance out of your 2D MPI Jacobi iteration algorithm.

Description

In the spirit of GI-Joe, programming is only half the battle.  Getting the best performance out of an HPC architecture, especially at the petascale, is a highly iterative and architecture specific affair.  Thankfully, we have performance tools and experts to help us with such things, otherwise there would be no hope.

Now that you have a working MPI implementation of the 2D decomposed algorithm, you will now use performance tools to see if there are any improvements that can be made.  Also, each of our HPC architectures has specific optimizations that can be done to improve the performance of parallel codes on them.  These improvements are generally explored and documented by the vendor and the center that operate the machine.  So you will be required to look into things like compiler optimizations, process placement, and other environment settings to see if you can get better performance.  Some of the architectures even have different compilers.  The sky is the limit; try anything you think will get better performance.

Instructions

  1. Using the performance tools you have learned about in the workshop and the documentation for your HPC architecture, make your code as fast as possible over 4,096 cores for a matrix size of 262,144 x 262,144 (this satisfies the processor and matrix constraints of the 2D algorithm).
  2. Post your best time for 100 iterations at : Time Postings.
Information on optimizing runs for the HPC architectures can be found at:
  1. For Ranger: http://services.tacc.utexas.edu/index.php/ranger-user-guide 
  2. For Athena: http://www.nics.tennessee.edu/computing-resources/kraken/

  3. For Bluefire: http://www.cisl.ucar.edu/computers/bluefire/

Questions

  1. Did you find anything in particular the either helped or hindered performance of your code? Why do you think that happened?