How to implement argmax with OpenMP?

Question 1

Your solution is completely standard conformant. Anyhow, if you are willing to add a bit of syntactic sugar, you may try something like the following:

#include<iostream>

using namespace std;

double toOptimize(int arg) {
  return arg * (arg%100);
}

class MaximumEntryPair {
public:

  MaximumEntryPair(size_t index = 0, double value = 0.0) : index_(index), value_(value){}

  void update(size_t arg) {
    double v = toOptimize(arg);
    if( v > value_ ) {
      value_ = v;
      index_ = arg;
    }
  }

  bool operator<(const MaximumEntryPair& other) const {
    if( value_ < other.value_ ) return true;
    return false;
  }  

  size_t index_;
  double value_;
};



int main() {
  MaximumEntryPair best;
#pragma omp parallel 
  {
    MaximumEntryPair thread_local;
    #pragma omp for
    for(size_t ii = 0 ; ii < 1050 ; ++ii) {
      thread_local.update(ii);
    } // implicit barrier
#pragma omp critical
    {
      if ( best < thread_local ) best = thread_local;
    }

  } // implicit barries
  cout << "The maximum is " << best.value_ << " obtained at index " << best.index_ << std::endl;
  cout << "\t toOptimize(" << best.index_ << ") = " << toOptimize(best.index_) << std::endl;
  return 0;
}

Question 2

I would just create a separate buffer for each thread to store a val and idx and then select the max out of the buffer afterwards.

    std::vector<double> thread_maxes(omp_get_max_threads());
    std::vector<int>    thread_max_ids(omp_get_max_threads());

    #pragma omp for reduction(max: best_value)
    for(size_t n = 2 ; n <= MAX ; ++n)
    {
      int thread_num = omp_get_num_threads();
      double v = toOptimize(n);
      if(v > thread_maxes[thread_num])
      {
        thread_maxes[thread_num] = v;
        thread_max_ids[thread_num] = i;
      }
    }

    std::vector<double>::iterator max =
      std::max_element(thread_maxes.begin(), thread_maxes.end());
    best.val = *max;
    best.idx = thread_max_ids[max - thread_maxes.begin()];

Question 3

Your solution is fine. It has O(nthreads) convergence with the critical section. However, it's possible to do this with O(Log(nthreads)) convergence.

For example imagine there were 32 threads. You would first find the local max for the 32 threads. Then you could combine pairs with 16 threads, then 8, then 4, then 2, then 1. In five steps you could merge the local max values without a critical section and free threads in the process. But your method would merge the local max values in 32 steps in a critical section and uses all threads.

The same logic goes for a reduction. That's why it's best to let OpenMP do the reduction rather than do it manually with an atomic section. But at least in the C/C++ implementation of OpenMP there is no easy way to get the max/min in O(Log(nthreads)). It might be possible using tasks but I have not tried that.

In practice this might not make a difference since the time to merge the local values even with a critical section is probably negligible compared the time to do the parallel loop. It probably makes more of a difference on the GPU though where the number of "threads" is much larger.