Jacobi Exercise 2
Exercise 2: Let's Get Our Feet a Little Wet - OpenMP
- Gaining proficiency with multi-threaded parallelism through OpenMP.
- Understanding performance considerations of multi-threaded programs.
- Learning how to run multi-threaded programs on the HPC architecture.
You can move on when...
You have a working OpenMP parallel version of the Jacobi Iteration program, and have measured its performance over the specified scenarios.
As our first attempt at parallelizing the Jacobi iteration algorithm, we will use the OpenMP programming model. As you learned in the lectures, OpenMP is a quick and easy way to get parallelism out of a predominantly serial code by adding multi-threaded capabilities. So while this will not run over multiple nodes of the HPC machine, it will run over multiple cores on the node. However many this is will depend on which architecture you assigned (e.g. Ranger has 16 cores per node).
- The most time consuming part of this code is obvious, it is the updating of the matrix at each iteration. Look at the loop and decide which variable is best to parallelize.
- Insert an OpenMP pragma at the appropriate spot to parallelize the loop.
- Find in the documentation of the HPC architecture and learn how to compile OpenMP code on the machine. Ask for help if needed. Links to the documentation are:
- For Ranger: http://services.tacc.utexas.edu/index.php/ranger-user-guide
- For Athena: http://www.nics.tennessee.edu/computing-resources/kraken
- For BlueFire: http://www.cisl.ucar.edu/computers/bluefire/
- Submit a small test job to the queue over 1 thread and 4 threads and make sure you still get the same answers. (Look at the documentation to figure out how to run multi-threaded code)
- Test and plot the performance of the code over 1, 2, 4, 8 and 16 threads, with matrix sizes of 128, 256, 1024, and 4096.
- Are you now comfortable with running multi-threaded programs on your HPC architecture? It gets more complicated from here.
- Was the scaling of the algorithm what you expected? How far could you take the multi-threaded versions?
- Have you increased the size of matrix you can handle with this parallelization?
- Try taking the OpenMP parallelism further; see if including nested parallelism improves performance.
- Try different scheduling methods to see if any provide better performance. If it does or doesn't why do you think that is so.
- There should be only one OpenMP statement needed to complete this assignment.
- A solution to this example can be found here for C/C++ or FORTRAN.