OMP Tutorial


Sunu Engineer
Anand Sengupta

Resources
OMP Tutorial Online
Examples of Real Life OpenMP Fortran Code click here

Yet another OMP Tutorial

OpenMP C/C++ Specifications
OpenMP Fortran Specification


Navigation
Introduction
What is OpenMP
A Compiler Directive Example
How to login/ write code and compile your OMP Programs on Hercules
Paralle Do / Parallel For
Assignment 













Introduction

In high performance computing, there are tools that assist programmers with multi-threaded parallel processing on distributed-memory and shared-memory multiproccessor platforms.

  • On distributed-memory multiprocessor platforms, each processor has its own memory whose content is not readily available to other processors. Sharing of information among processors is customarily facilitated by message passing using routines from standard message passing libraries such as MPI.

  • On shared-memory multiprocessors, memory among processors can be shared. Message passing libraries such as MPI can be, and are, used for the parallel processing tasks. However, a directive-based OpenMP Application Program Interface (API) has been developed specifically for shared-memory parallel processing.

OpenMP has broad support from many major computer hardware and software manufacturers. Similar to MPI's achievement as the standard for distributed-memory parallel processing, OpenMP has emerged as the standard for shared-memory parallel computing. Both of these standards can be used in conjunction with Fortran 77, Fortran 90, C or C++ for parallel computing applications. It is worth noting that for a cluster of single-processor and shared-memory multiprocessor computers, it is possible to use both paradigms in the same application program to effect an increase in the (agregate) processing power. MPI is used to connect all machines within the cluster to form one virtual machine, while OpenMP is used to exploit the shared-memory parallelism on individual shared-memory machines within the cluster. This approach is commonly referred to as Multi-Level Parallel Programming (MLP).









What is OpenMP?

OpenMP is comprised of three complementary components:

  1. a set of compiler directives used by the programmer to communicate with the compiler on parallelism.

  2. a runtime library which enables the setting and querying of parallel parameters such as number of participating threads and the thread number.

  3. a limited number of environment variables that can be used to define runtime system parallel parameters such as the number of threads.

















A Compiler Directive Example

Code segments that consume substantial CPU cycles frequently involve do loops (or for loops in C). For loops that are parallelizable, OpenMP provides a rather simple directive to instruct the compiler to parallelize the loop immediately following the directive. Lets take a look at a Fortran example:

call omp_set_num_threads(nthread)    !requests "nthread" threads
!$OMP PARALLEL DO
DO i=1,N
DO j=1,M
.
.
.
END DO
END DO
!$OMP END PARALLEL DO

In this Fortran code fragment, the OpenMP library function omp_set_num_threads is called to set the number of threads to "nthread". Next, the !$OMP PARALLEL DO directive notifies the compiler to parallelize the (outer) do loop that follows. The current, or master, thread is responsible for spawning "nthread-1" child threads. The matching !$OMP END PARALLEL DO directive makes it clear to the compiler (and the programmer) the extent of the PARALLEL DO directive. It also serves to provide a barrier to ensure that all threads complete their tasks and that all child threads are subsequently released.

In the figure below, the execution stream starts in a serial region, followed by a parallel region using four threads. Upon completion, the child threads are released and serial execution continues until the next parallel region with 2 threads is in effect.

For C programs, a similar set of rules applies:

omp_set_num_threads(nthread);   /* requests nthread threads */
#pragma omp parallel for
{
for (i=0; i<n; i++) {
for (j=0; j<m; j++) {
.
.
.
}
}
}

Note in the above that the scope of the parallel for directive is delimited by the pair of curly braces ({}).


























How to login/ write code and compile your OMP Programs on Hercules


  • ssh username@hercules
    • For the workshop, 5 accounts have been created : ws1, ws2, ws3, ws4 and ws5. Get in touch with Tarun/Sarah/Anand for passwords.
  • use the vi editor to write/edit your code
  • For C codes,
    • do not forget to #include <omp.h>
    • cc prog.c  -omp
  • For Fortran codes
    • do not forget to include `mpif.h'
    • f77 prog.f -omp
  • To run
    • setenv OMP_NUM_THREADS 4
    • ./a.out
  • You may want to set the following environment variables in your .cshrc file
    • setenv OMP_SCHEDULE "STATIC"
    • setenv OMP_NUM_THREADS 4
    • setenv OMP_DYNAMIC  0
    • setenv OMP_NESTED 0



















Parallel Do / For


In OpenMP, the primary means of parallelization is through the use of directives inserted in the source code. One of the most fundamental and most powerful of these directives is parallel do (Fortran) or parallel for (C). Here are examples of these directives:

Fortran
   !$omp parallel do
do i = 1, n
a(i) = b(i) + c(i)
enddo

C/C++
   #pragma omp parallel for
for(i=1; i<=n; i++)
a[i] = b[i] + c[i];

The !$omp parallel do or #pragma omp parallel for directive indicates that the following loop is to be executed in parallel. There are several ways to specify the number of threads to be used. One of these is to set the environment variable OMP_NUM_THREADS to the required number of threads.

When a !$omp parallel do or #pragma omp parallel for directive is encountered, the loop indices are distributed among the specified number of threads. The way in which the loop indices are distributed is known as the schedule. In the default schedule, each thread will get a "chunk" of indices of approximately equal size. For example, if the loop goes from 1 to 100 and there are 3 threads, the first thread will process i=1 through i=34, the second thread will process i=35 through i=67, and the third thread will process i=68 through i=100. Other schedule options will be introduced in a later section.

There may not be any branches into or out of a parallel loop. This precludes the use of the break statement in C. There is an implied barrier at the end of the loop - each thread will wait at the end of the loop until all threads have reached that point before they continue.

A system-dependent overhead is incurred any time parallel threads are spawned, such as in the case of a parallel do or parallel for directive. Therefore, when a short loop is parallelized, such as the example above, it will probably take longer to execute on multiple threads than on a single thread since the overhead is greater than the time savings due to parallelization. By "short" we mean that there are not many operations within the loop. The next logical question should be "How long is long enough?" and unfortunately the answer is dependent upon the system and the loop under consideration. As a very rough estimate, several thousand operations (total over all loop iterations, not per iteration) may be required. There is only one way to know for sure, and that's to try parallelizing the loop, and then time it and see if it is running faster.

A Few Notes on Sentinels

The sentinel is the first part of the directive, !$omp or #pragma omp in the above examples. In Fortran fixed format (f77), the !$omp sentinel must start in column 1. One may also use the sentinels c$omp or *$omp. In Fortran free format (f90), only !$omp is allowed, and it may appear in any column. In C/C++, #pragma may appear in any column. All OpenMP directives follow these sentinel rules. All Fortran examples in this tutorial will be free-format.
















Assignments

1. Write a C or Fortran Code using OMP Directives to multiply large matrices. Compare the improvements in timings.
2. Try nested do/for loops.
3. Write a C/Fortran program which passes a number between the threads and in each pass, the number gets multiplied by 2. It should be finally printed outside the parallel part of the program.
4. Use the OMP SECTIONS / OMP SECTION to parallel-read multiple files.
5. Get the maximum number of threads and divide the addition of the elements of a longish array between them. Use the REDUCE + operation to find the final answer.