Welcome
Participating Sites
Teaching Assistants
Group Discussion Etiquette
Software Requirements
Getting Started on TeraGrid
Course Schedule
Presentation Materials for Week
  + Keynote
  + Introduction to HPC Systems
  + Hybrid MPI Programming
  + Multi-core Programming
  + Totalview Debugging Techniques
  + Parallel I/O
  + Experience from the Field
  + Eclipse
  + DDT Debugging Techniques
  + Numerical Libraries
  + Performance and Code Profiling
  + Visualization
    > Overview and Introduction to Scientific Visualization
    > Parallel Visualization, Data Formatting, Software Overview
    > Hands-on Tutorial: VisIt
    > Hands-on Tutorial: ParaView
    > Sample Datasets
Biographies for Presenters
General Exercises
  + Jacobi Exercise 1
  + Jacobi Exercise 2
  + Jacobi Exercise 3
  + Jacobi Exercise 4
  + Jacobi Exercise 5
Molecular Dynamics Background
  + Molecular Dynamics Exercise 1
  + Molecular Dynamics Exercise 2
  + Molecular Dynamics Exercise 3
  + Molecular Dynamics Exercise 4
  + Molecular Dynamics Exercise 5
Access to Other Training Resources

Jacobi Exercise 2

Exercise 2: Let's Get Our Feet a Little Wet - OpenMP

Objectives

Gaining proficiency with multi-threaded parallelism through OpenMP.
Understanding performance considerations of multi-threaded programs.
Learning how to run multi-threaded programs on the HPC architecture.

You can move on when...

You have a working OpenMP parallel version of the Jacobi Iteration program, and have measured its performance over the specified scenarios.

Description

As our first attempt at parallelizing the Jacobi iteration algorithm, we will use the OpenMP programming model. As you learned in the lectures, OpenMP is a quick and easy way to get parallelism out of a predominantly serial code by adding multi-threaded capabilities. So while this will not run over multiple nodes of the HPC machine, it will run over multiple cores on the node. However many this is will depend on which architecture you assigned (e.g. Ranger has 16 cores per node).

Instructions

The most time consuming part of this code is obvious, it is the updating of the matrix at each iteration. Look at the loop and decide which variable is best to parallelize.
Insert an OpenMP pragma at the appropriate spot to parallelize the loop.
Find in the documentation of the HPC architecture and learn how to compile OpenMP code on the machine. Ask for help if needed. Links to the documentation are:

For Ranger: http://services.tacc.utexas.edu/index.php/ranger-user-guide
For Athena: http://www.nics.tennessee.edu/computing-resources/kraken
For BlueFire: http://www.cisl.ucar.edu/computers/bluefire/

Submit a small test job to the queue over 1 thread and 4 threads and make sure you still get the same answers. (Look at the documentation to figure out how to run multi-threaded code)
Test and plot the performance of the code over 1, 2, 4, 8 and 16 threads, with matrix sizes of 128, 256, 1024, and 4096.

Questions

Are you now comfortable with running multi-threaded programs on your HPC architecture? It gets more complicated from here.
Was the scaling of the algorithm what you expected? How far could you take the multi-threaded versions?
Have you increased the size of matrix you can handle with this parallelization?

Extra Credit

Try taking the OpenMP parallelism further; see if including nested parallelism improves performance.
Try different scheduling methods to see if any provide better performance. If it does or doesn't why do you think that is so.

Hints

There should be only one OpenMP statement needed to complete this assignment.
A solution to this example can be found here for C/C++ or FORTRAN.