Welcome
Participating Sites
Teaching Assistants
Group Discussion Etiquette
Software Requirements
Getting Started on TeraGrid
Course Schedule
Presentation Materials for Week
  + Keynote
  + Introduction to HPC Systems
  + Hybrid MPI Programming
  + Multi-core Programming
  + Totalview Debugging Techniques
  + Parallel I/O
  + Experience from the Field
  + Eclipse
  + DDT Debugging Techniques
  + Numerical Libraries
  + Performance and Code Profiling
  + Visualization
    > Overview and Introduction to Scientific Visualization
    > Parallel Visualization, Data Formatting, Software Overview
    > Hands-on Tutorial: VisIt
    > Hands-on Tutorial: ParaView
    > Sample Datasets
Biographies for Presenters
General Exercises
  + Jacobi Exercise 1
  + Jacobi Exercise 2
  + Jacobi Exercise 3
  + Jacobi Exercise 4
  + Jacobi Exercise 5
Molecular Dynamics Background
  + Molecular Dynamics Exercise 1
  + Molecular Dynamics Exercise 2
  + Molecular Dynamics Exercise 3
  + Molecular Dynamics Exercise 4
  + Molecular Dynamics Exercise 5
Access to Other Training Resources

Jacobi Exercise 4

Exercise 4: 2D MPI Jacobi Iterator

Objectives

Continue to gain experience with MPI and scalable parallel programming.
Understand scalability and algorithm design for large-scale applications.
Improve the overall scalability of your algorithm.

You can move on when?

You have a working asynchronous 2-D MPI implementation of the Jacobi iteration algorithm and have completed your scaling runs.

Description

This exercise is a follow-up to the previous one in that we will be adding additional parallel decomposition (and complication) to the algorithm. The 2-D decomposition requires dividing the matrix into blocks, one for each processor, and communication along all four dimensions of each. For large matrices and large core counts, this algorithm is superior.

Advantages:

Increases the granularity of the parallelism for this algorithm, which gives the potential for increased scalability.
Smaller data sizes on each processor can be advantageous for good caching performance.
At a given core count, better communication to computation ratio.

Disadvantages:

With more parallel granularity come greater communication requirements.
More complicated algorithm.

Advantage/Disadvantage depending on Architecture

Communication will involve many more, small messages vs. the 1-D decomposition. If the communication layer on your HPC architecture is sensitive to latency, then this could actually hurt performance.

Keep these concerns in mind as you are developing your algorithm.

Instructions

The parameters of the algorithm are such:

The grid matrix must be completely distributed, no replicating the matrix on all processors.
Adding the constraints that the core count must be a perfect square (4, 16, 64, 256, etc.) and that the dimension of the matrix has to be divisible by the square of the core count make the algorithm much simpler and is permitted.
The whole process must be parallel, that includes initialization of the grid and boundary conditions.
Only asynchronous MPI_Isend and MPI_Irecv can be used for communication between processors.
In this exercise, you must use a 2-D decomposition, that is parallelized in both dimension of the matrix.

Here is a guideline of the process that a parallel programmer may use to do this:

Study the serial algorithm and see where parallelism can be exploited. Also think about how the data can be divided. Best way to do this is on a piece of paper, drawing out the layout conceptually before you even touch the code.
Still on paper, figure out how this conceptualization moves to being expressed in the parallel programming language you want to use. What MPI calls do you need to use? Which processors will be doing what work? STILL ON PAPER.
Now begin programming the algorithm up in MPI.
Test the program on a small matrix and processor count to make sure it is doing what you expect it to do. Utilize the small debug queues on the HPC machine, or if you have MPI on your laptop or desktop develop there.
Once you are satisfied it works, scale it up.

With this in mind, go through this process to implement a 2-D of the Jacobi iteration algorithm. Use your 1D decomposition as a guide.

When you are satisfied you code is working, submit to the batch queue of your HPC architecture scaling runs for 100 iterations of the following sizes:

Matrix Dimension: 1024, 4096, 65536
Core Counts, 4, 16, 64, 256, 1024, 4096

Make a plot of the scaling of these runs.

Tips

The tips from the previous exercise apply here as well.
The new dimension adds a nice wrinkle to this implementation exercise, for at least two pieces of communication will be over non-contiguous sets of data. You will need to copy this data into an array to send the message and then unpack it into the proper place once it arrives.
Once again, no MPI_Barriers should be needed to complete this algorithm.
Knowing which section of the grid you are on is very important to this algorithm as the border sections are different than the inner sections. You should first come up with a scheme of conditionals that are equivalent to ?is this on the top border? and so forth.
Reiterate: Be careful with the MPI_Requests, this is the most difficult part of this algorithm.

For Athena, you may need the following environment:

MPICH_PTL_SEND_CREDITS -1

MPICH_MAX_SHORT_MSG_SIZE 8000

MPICH_PTL_UNEX_EVENTS 80000

MPICH_UNEX_BUFFER_SIZE 100M

Questions

As with the previous exercise, can you write an analytic expression for the scaling of the communication with matrix and core count? How does this compare to the previous algorithm?
Does the program scale? Are the results what you expected? If not, can you figure out why?

Extra Credit

Can you add an OpenMP call to this and make it a hybrid algorithm? Does this scale better than the your straight MPI algorithm? (Hint: You can essentially put it in the same place as in the above exercise).

Hints

A pseudo code version of this implementation can be found <link> here if you are having problems getting started.
If X is the square root of the number of cores and MY_PE is the current core, and each core is assigned a section of the grid in order by rows, then the following conditionals will tell you if you are not on a border:

Top Border (MY_PE < X)
Bottom Border (MY_PE >= X(X-1))
Right Border ((MY_PE % X) == (X-1))
Left Border ((MY_PE % X) == 0)

The inverse of these will also be needed to complete the algorithm.
A full solution to the exercise is here in C/C++.