Jacobi Exercise 4
Exercise 4: 2D MPI Jacobi Iterator
- Continue to gain experience with MPI and scalable parallel programming.
- Understand scalability and algorithm design for large-scale applications.
- Improve the overall scalability of your algorithm.
You can move on when?You have a working asynchronous 2-D MPI implementation of the Jacobi iteration algorithm and have completed your scaling runs.
DescriptionThis exercise is a follow-up to the previous one in that we will be adding additional parallel decomposition (and complication) to the algorithm. The 2-D decomposition requires dividing the matrix into blocks, one for each processor, and communication along all four dimensions of each. For large matrices and large core counts, this algorithm is superior.
- Increases the granularity of the parallelism for this algorithm, which gives the potential for increased scalability.
- Smaller data sizes on each processor can be advantageous for good caching performance.
- At a given core count, better communication to computation ratio.
- With more parallel granularity come greater communication requirements.
- More complicated algorithm.
- Communication will involve many more, small messages vs. the 1-D decomposition. If the communication layer on your HPC architecture is sensitive to latency, then this could actually hurt performance.
InstructionsThe parameters of the algorithm are such:
- The grid matrix must be completely distributed, no replicating the matrix on all processors.
- Adding the constraints that the core count must be a perfect square (4, 16, 64, 256, etc.) and that the dimension of the matrix has to be divisible by the square of the core count make the algorithm much simpler and is permitted.
- The whole process must be parallel, that includes initialization of the grid and boundary conditions.
- Only asynchronous MPI_Isend and MPI_Irecv can be used for communication between processors.
- In this exercise, you must use a 2-D decomposition, that is parallelized in both dimension of the matrix.
- Study the serial algorithm and see where parallelism can be exploited. Also think about how the data can be divided. Best way to do this is on a piece of paper, drawing out the layout conceptually before you even touch the code.
- Still on paper, figure out how this conceptualization moves to being expressed in the parallel programming language you want to use. What MPI calls do you need to use? Which processors will be doing what work? STILL ON PAPER.
- Now begin programming the algorithm up in MPI.
- Test the program on a small matrix and processor count to make sure it is doing what you expect it to do. Utilize the small debug queues on the HPC machine, or if you have MPI on your laptop or desktop develop there.
- Once you are satisfied it works, scale it up.
When you are satisfied you code is working, submit to the batch queue of your HPC architecture scaling runs for 100 iterations of the following sizes:
- Matrix Dimension: 1024, 4096, 65536
- Core Counts, 4, 16, 64, 256, 1024, 4096
- The tips from the previous exercise apply here as well.
- The new dimension adds a nice wrinkle to this implementation exercise, for at least two pieces of communication will be over non-contiguous sets of data. You will need to copy this data into an array to send the message and then unpack it into the proper place once it arrives.
- Once again, no MPI_Barriers should be needed to complete this algorithm.
- Knowing which section of the grid you are on is very important to this algorithm as the border sections are different than the inner sections. You should first come up with a scheme of conditionals that are equivalent to ?is this on the top border? and so forth.
- Reiterate: Be careful with the MPI_Requests, this is the most difficult part of this algorithm.
MPICH_PTL_SEND_CREDITS -1 MPICH_MAX_SHORT_MSG_SIZE 8000 MPICH_PTL_UNEX_EVENTS 80000 MPICH_UNEX_BUFFER_SIZE 100M
- As with the previous exercise, can you write an analytic expression for the scaling of the communication with matrix and core count? How does this compare to the previous algorithm?
- Does the program scale? Are the results what you expected? If not, can you figure out why?
- Can you add an OpenMP call to this and make it a hybrid algorithm? Does this scale better than the your straight MPI algorithm? (Hint: You can essentially put it in the same place as in the above exercise).
- A pseudo code version of this implementation can be found <link> here if you are having problems getting started.
- If X is the square root of the number of cores and MY_PE is the current core, and each core is assigned a section of the grid in order by rows, then the following conditionals will tell you if you are not on a border:
- Top Border (MY_PE < X)
- Bottom Border (MY_PE >= X(X-1))
- Right Border ((MY_PE % X) == (X-1))
- Left Border ((MY_PE % X) == 0)