Posts Tagged ‘Parallel Algorithms


Prefix Sum Using OpenMP

OpenMP is an API for shared memory parallel programming. As in the various threads running on various cores or processors have memory space that is accessible by every thread, in addition to each thread’s own private memory space. It is available for C/C++ and Fortran.

Arguably, it is easier for parallel development than MPI, Pthreads etc. (And moreover, it is the one with which we’ll start our Parallel and Distributed Computing lab).

Some Basic Constructs

  • parallel : most basic construct. Creates a team of threads to execute the code that follows this directive, and all threads execute the code.
  • for : used just before a ‘for’ loop. Runs the iterations of the loop in parallel by splitting them amongst available threads.
  • barrier : used to provide barrier synchronization. Threads wait here until all threads in the team have finished execution.
  • single : only one thread executes the following block of code. Rest of the threads wait at the end of the construct.
And yeah, just in case this is your first time, don’t try to run these programs as you normally would. Compile from the command-line with -fopenmp flag.

First Non-Trivial Program

Formal Statement:

Given an array A[1..n], return an array S[1..n] such that S[i] = A[1] + A[2] + … + A[i] for all 0<i<n+1

For example:

Given the array [5,8,7,2,3], output will be [5,13,20,22,25]

Basic Parallel Algorithm:

  1. for j in (1 to log2(n))
  2.     for i in (j to n)
  3.         S[i] = A[i] + A[i – j]
  4.         i = i + 1
  5.     j = 2 * j

First attempt:

This code tries to simply use the most obvious parallelism in the algorithm. Steps 2 and 3 of the algorithm are shared between threads, each thread working on a different subset of the array. And yeah, to prevent race conditions, we use an auxiliary array which stores writes from a thread while other threads are reading from the other array.

#include <stdio.h>
#include <omp.h>

int main()
  int n, ar[2][100], *t, i, toread, size, j;
  printf("Enter length:");
  scanf("%d", &n);
  printf("Enter numbers:\n", n);
  for(i = 0; i < n; i++)
    scanf("%d", &ar[0][i]);
  /*set up complement to array that holds values*/
  toread = 1;
  /*copy first value, since it is not copied by the code*/
  ar[1][0] = ar[0][0];
  size = 0;
  /*following loop aims to get log2 of size, but can be avoided as in 2nd program*/
  while(i) {
    i >>= 1;
  /*following code implements algorithm*/
  for(j = 0; j < size; j++) {
    toread = !toread;
    if(toread) t = ar[0];
    else t = ar[1];
#pragma omp parallel for default(none) private(i) shared(n, j, t, ar, toread)
    for(i = 1; i < n; i++) {
      if(i - (1 << j) >= 0)
	t[i] = ar[toread][i] + ar[toread][i - (1 << j)];
      else t[i] = ar[toread][i];
  toread = !toread;
  for(i = 0; i < n; i++)
    printf("%d\n", *(*(ar + toread) + i));
  return 0;

Second attempt:

Here, we’ll divide the original array into sub-arrays, the number of which will be equal to the number of threads provided by the environment. Each thread will, linearly, calculate the prefix-sum for its assigned sub-array. These prefix-sums will be less than the actual sums, since elements before the start of a particular sub-array are ignored.

Now, the last elements of each of these sub-arrays is stored in another array. For this array, the prefix-sum array is calculated, and the corresponding values in original array are updated. This is done by adding to each element the requisite amount that was missing earlier.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <string.h>

int main()
  int *arr, *partial, *temp;
  int num_threads, work, n;
  int i, mynum, last;
  printf("Enter length:");
  scanf("%d", &n);
  if(!(arr = (int *) malloc (sizeof (int) * n))) return -1;
  printf("Enter numbers:\n");
  for(i = 0; i < n; i++)
    scanf("%d", arr + i);
#pragma omp parallel default(none) private(i, mynum, last) shared(arr, partial, temp, num_threads, work, n)
#pragma omp single
      num_threads = omp_get_num_threads();
      if(!(partial = (int *) malloc (sizeof (int) * num_threads))) exit(-1);
      if(!(temp = (int *) malloc (sizeof (int) * num_threads))) exit(-1);
      work = n / num_threads + 1; /*sets length of sub-arrays*/
    mynum = omp_get_thread_num();
    /*calculate prefix-sum for each subarray*/
	for(i = work * mynum + 1; i < work * mynum + work && i < n; i++)
      arr[i] += arr[i - 1];
    partial[mynum] = arr[i - 1];
#pragma omp barrier
    /*calculate prefix sum for the array that was made from last elements of each of the previous sub-arrays*/
	for(i = 1; i < num_threads; i <<= 1) {
      if(mynum >= i)
		temp[mynum] = partial[mynum] + partial[mynum - i];
#pragma omp barrier
#pragma omp single
      memcpy(partial + 1, temp + 1, sizeof(int) * (num_threads - 1));
    /*update original array*/
	for(i = work * mynum; i < (last = work * mynum + work < n ? work * mynum + work : n); i++)
      arr[i] += partial[mynum] - arr[last - 1];
  for(i = 0; i < n; i++)
    printf("%d\n", arr[i]);
  return 0;

Top Clicks

  • None