Welcome
Participating Sites
Teaching Assistants
Group Discussion Etiquette
Software Requirements
Getting Started on TeraGrid
Course Schedule
Presentation Materials for Week
  + Keynote
  + Introduction to HPC Systems
  + Hybrid MPI Programming
  + Multi-core Programming
  + Totalview Debugging Techniques
  + Parallel I/O
  + Experience from the Field
  + Eclipse
  + DDT Debugging Techniques
  + Numerical Libraries
  + Performance and Code Profiling
  + Visualization
    > Overview and Introduction to Scientific Visualization
    > Parallel Visualization, Data Formatting, Software Overview
    > Hands-on Tutorial: VisIt
    > Hands-on Tutorial: ParaView
    > Sample Datasets
Biographies for Presenters
General Exercises
  + Jacobi Exercise 1
  + Jacobi Exercise 2
  + Jacobi Exercise 3
  + Jacobi Exercise 4
  + Jacobi Exercise 5
Molecular Dynamics Background
  + Molecular Dynamics Exercise 1
  + Molecular Dynamics Exercise 2
  + Molecular Dynamics Exercise 3
  + Molecular Dynamics Exercise 4
  + Molecular Dynamics Exercise 5
Access to Other Training Resources

Molecular Dynamics Exercise 4

Exercise 4: The Waves are Getting Huge Out Here: Load Imbalance

Objectives

Gain experience getting around one of the biggest impediments to parallel scalability, load imbalance.
To think more about data layout and distribution.

You can move on when?

You have implemented a data distribution strategy that will in some part alleviate the load imbalance.

Description

Now that you have made it though implementing this algorithm for a fairly uniform set of data, now we are going to shake things up a little bit. In this exercise, we are going introduce a procedure that will artificially imbalance the number of atoms in each cell. Throughout our workshop you have heard of load imbalance and how, especially at large processor counts, can limit greatly the parallel efficiency of a program. So, after introducing this procedure, you will need to do some sort of analysis that will divide the work as evenly as possible across the set of processors. You will not be perfect, but do the best you can.

Instructions

To setup the load imbalanced version of this code, one needs to replace the ComputeAtomsPerShell function (or subroutine) with the code located here for C/C++ and FORTRAN. This piece of code will take the number of atoms and create a random distribution of atoms of the total number across the whole space. The piece of code is meant for the serial version of the code, so you will need to incorporate it into the parallel code you have created.
Now, with this incorporated, rerun the scaling runs from above and see how your new results vary from the previous runs.
Using the information from our profiling and performance section of the workshop, profile a parallel run of the code. There should be significant difference is the run-time of each processor.
Create a scheme that instead of just dividing up the work as it comes on the processors actually estimates the amount of work each cell will do, and try to balance the workload using this estimate. See if you can come up with something that will improve the performance.

Tips

You will need to be able to arbitrarily assign a cell to a processor. The basic global map will not be sufficient for this, you will need to devise a local map scheme in order to build up the work you need to do.
One method for dividing up the work is as follows:
1. Run through and assign each processor at least one cell (must enforce that the number of cells is greater than the number of MPI Tasks), keeping track of how many atoms ends up in each processor.
2. Sort the processors in order of the number of atoms on each.
3. If adding new particles to the cell is below a certain threshold (in this case, NParticles/NCPUS, which is the ideal number of particles per processor), assign this cell to the current processor, otherwise go to the next MPI task.
This is the method implemented in the solution, here is some code from the solution that accomplishes this as well as a function for a QuickSort algorithm.

Questions

One fairly effective strategy for balancing load across a parallel job is the so-call "master-worker" or ?bag of tasks? methods, which dynamically assign work to processors on the basis of when their current assigned work is completed. To really get this algorithm to scale to Petascale, why do you think this is not a good strategy?

Hints

If you made the communication above independent of processors, redistributing the work here should be pretty straight-forward.
A full solution to the exercise is available C++.