Welcome
Participating Sites
Teaching Assistants
Group Discussion Etiquette
Software Requirements
Getting Started on TeraGrid
Course Schedule
Presentation Materials for Week
  + Keynote
  + Introduction to HPC Systems
  + Hybrid MPI Programming
  + Multi-core Programming
  + Totalview Debugging Techniques
  + Parallel I/O
  + Experience from the Field
  + Eclipse
  + DDT Debugging Techniques
  + Numerical Libraries
  + Performance and Code Profiling
  + Visualization
    > Overview and Introduction to Scientific Visualization
    > Parallel Visualization, Data Formatting, Software Overview
    > Hands-on Tutorial: VisIt
    > Hands-on Tutorial: ParaView
    > Sample Datasets
Biographies for Presenters
General Exercises
  + Jacobi Exercise 1
  + Jacobi Exercise 2
  + Jacobi Exercise 3
  + Jacobi Exercise 4
  + Jacobi Exercise 5
Molecular Dynamics Background
  + Molecular Dynamics Exercise 1
  + Molecular Dynamics Exercise 2
  + Molecular Dynamics Exercise 3
  + Molecular Dynamics Exercise 4
  + Molecular Dynamics Exercise 5
Access to Other Training Resources

Molecular Dynamics Exercise 1

Exercise 1: Starting Out

Objectives:

Getting familiar with the high-performance computing platform you will be using for the workshop.
Getting familiar with the Molecular Dynamics algorithm used in all of these exercises.

You can move on when?

You have successfully compiled, submitted, and competed a run with the Molecular Dynamics program and completed a plot of the scaling of the algorithm with respect to matrix dimension.

Description

In Exercise 1, you will become familiar with the serial version of the algorithm described in Background section. A reference implementation will be provided, with your task to examine and make sure you understand it, compile it on your HPC architecture, and then submit several runs of differing atom counts to view the performance characteristics of the code and the processors in your machine.

The program can be downloaded at:

Since the code is one straight file, compilation is trivial:

C/C++
- For Kraken: CC md.cpp -o md
- For Ranger: pgCC md.cpp -o md
- For Bluefire: xlC md.cpp -o md
FORTRAN
- For Kraken: ftn md.F -o md
- For Ranger: pgf90 md.F -o md
- For Bluefile: xlF md.F -o md

For further help on compiling codes on these HPC architectures:

The program has the following syntax:

moldyn <NumberOfParticles> <NumIterations>

NumberOfParticles ? Total number of particles in the system.

NumIterations - The number of fixed iterations

For example:

sbrown@kraken-pwd4(XT5): ./moldyn 100 10

The Total Number of Cells is 144 With 7 particles per cell, and
 1000 particles total in system

Iteration         1 with Total Energy   0.6493868661E+05 Per Particle

Iteration         2 with Total Energy   0.6493862883E+05 Per Particle

Iteration         3 with Total Energy   0.6493849175E+05 Per Particle

Iteration         4 with Total Energy   0.6493827537E+05 Per Particle

Iteration         5 with Total Energy   0.6493797969E+05 Per Particle

Iteration         6 with Total Energy   0.6493760472E+05 Per Particle

Iteration         7 with Total Energy   0.6493715046E+05 Per Particle

Iteration         8 with Total Energy   0.6493661691E+05 Per Particle

Iteration         9 with Total Energy   0.6493600409E+05 Per Particle

Iteration        10 with Total Energy   0.6493531200E+05 Per Particle

The Iteration Time is        0.0599999987

Instructions

Download the serial version of the code in your language of choice.
Spend some time looking over the code, if there is something you don't understand, please ask an instructor to help.
Compile the code with optimization level -O3.
Test the code on a small number of atoms (while your code may not give exactly the same answer as above, it should be similar).
Submit the following matrix sizes for 100 iterations to the queue: 1000, 10,000, and 100000 atoms.
Make a plot of atoms vs. time reported to determine the scaling of the algorithm.

Questions to Ponder...

Naively, this algorithm should scale as the number of atom squared, due to adding one more atom would require us to compute the contribution of the force from all of the other atoms in the system. By using cells and only computing adjacent cells contributions we have changed this. What is the scaling of your algorithm with number of atoms? Is it linear, square or something in between?
What may limit the size of system you can do with this serial algorithm?

Extra Credit

Are there any compiler flags beyond -O3 that enhance the serial performance of the code?
Are there any programmatic enhancements that could be made to improve performance?
One could analyze this algorithm with in-depth performance tools to understand why it performance at certain sizes.

Hints

The queue submission script for this exercise should be fairly similar to the one you used for the example hello_world at the beginning of the workshop.