C - pthreads appear to only be utilizing one core

https://stackoverflow.com/questions/21768470

11-10-2022
|

Вопрос

Let me first of all say that this is for school but I don't really need help, I'm just confused by some results I'm getting.

I have a simple program that approximates pi using Simpson's rule, in one assignment we had to do this by spawning 4 child processes and now in this assignment we have to use 4 kernel-level threads. I've done this, but when I time the programs the one using child processes seems to run faster (I get the impression I should be seeing the opposite result).

Here is the program using pthreads:

#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <stdlib.h>

// This complicated ternary statement does the bulk of our work.
// Basically depending on whether or not we're at an even number in our
// sequence we'll call the function with x/32000 multiplied by 2 or 4.
#define TERN_STMT(x) (((int)x%2==0)?2*func(x/32000):4*func(x/32000)) 

// Set to 0 for no 100,000 runs
#define SPEED_TEST 1

struct func_range {
  double start;
  double end;
};

// The function defined in the assignment
double func(double x)
{
  return 4 / (1 + x*x);
}

void *partial_sum(void *r) 
{
  double *ret = (double *)malloc(sizeof(double));
  struct func_range *range = r;
#if SPEED_TEST
  int k;
  double begin = range->start;
  for (k = 0; k < 25000; k++)
  {
    range->start = begin;
    *ret = 0;
#endif
    for (; range->start <= range->end; ++range->start)
      *ret += TERN_STMT(range->start);
#if SPEED_TEST
  }
#endif

  return ret;
}

int main()
{
  // An array for our threads.
  pthread_t threads[4];
  double total_sum = func(0);
  void *temp;
  struct func_range our_range;
  int i;

  for (i = 0; i < 4; i++)
  {
    our_range.start = (i == 0) ? 1 : (i == 1) ? 8000 : (i == 2) ? 16000 : 24000;
    our_range.end = (i == 0) ? 7999 : (i == 1) ? 15999 : (i == 2) ? 23999 : 31999;
    pthread_create(&threads[i], NULL, &partial_sum, &our_range);
    pthread_join(threads[i], &temp);
    total_sum += *(double *)temp;
    free(temp);
  }

  total_sum += func(1);

  // Final calculations
  total_sum /= 3.0;
  total_sum *= (1.0/32000.0);

  // Print our result
  printf("%f\n", total_sum);

  return EXIT_SUCCESS;
}

Here is using child processes:

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

// This complicated ternary statement does the bulk of our work.
// Basically depending on whether or not we're at an even number in our
// sequence we'll call the function with x/32000 multiplied by 2 or 4.
#define TERN_STMT(x) (((int)x%2==0)?2*func(x/32000):4*func(x/32000)) 

// Set to 0 for no 100,000 runs
#define SPEED_TEST 1

// The function defined in the assignment
double func(double x)
{
  return 4 / (1 + x*x);
}

int main()
{
  // An array for our subprocesses.
  pid_t pids[4];
  // The pipe to pass-through information
  int mypipe[2];
  // Counter for subproccess loops
  double j;
  // Counter for outer loop
  int i;
  // Number of PIDs
  int n = 4;
  // The final sum
  double total_sum = 0;
  // Temporary variable holding the result from a subproccess
  double temp;
  // The partial sum tallied by a subproccess.
  double sum = 0;
  int k;

  if (pipe(mypipe))
  {
    perror("pipe");
    return EXIT_FAILURE;
  }

  // Create the PIDs
  for (i = 0; i < 4; i++)
  {
    // Abort if something went wrong
    if ((pids[i] = fork()) < 0)
    {   
      perror("fork");
      abort();
    }   
    else if (pids[i] == 0)
  // Depending on what PID number we are we'll only calculate
      // 1/4 the total.
#if SPEED_TEST
      for (k = 0; k < 25000; ++k)
      {
        sum = 0;
#endif
        switch (i)
        {
          case 0:
            sum += func(0);
            for (j = 1; j <= 7999; ++j)
              sum += TERN_STMT(j);
            break;
          case 1:
            for (j = 8000; j <= 15999; ++j)
              sum += TERN_STMT(j);
            break;
          case 2:
            for (j = 16000; j <= 23999; ++j)
              sum += TERN_STMT(j);
            break;
          case 3:
            for (j = 24000; j < 32000; ++j)
              sum += TERN_STMT(j);
            sum += func(1);
            break;
        }
#if SPEED_TEST
      }
#endif
      // Write the data to the pipe
      write(mypipe[1], &sum, sizeof(sum));
      exit(0);
    }
  }

  int status;
  pid_t pid;
  while (n > 0)
  {
    // Wait for the calculations to finish
    pid = wait(&status);
    // Read from the pipe
    read(mypipe[0], &temp, sizeof(total_sum));
    // Add to the total
    total_sum += temp;
    n--;
  }

  // Final calculations
  total_sum /= 3.0;
  total_sum *= (1.0/32000.0);

  // Print our result
  printf("%f\n", total_sum);

  return EXIT_SUCCESS;
}

Here is a time result from the pthreads version running 100,000 times:

real 11.15
user 11.15
sys 0.00

And here is the child process version:

real 5.99
user 23.81
sys 0.00

Having a user time of 23.81 implies that that is the sum of the time each core took to execute the code. In the pthread analysis the real/user time is the same implying that only one core is being used. Why isn't it using all 4 cores? I thought by default it might do it better than child processes.

Hopefully this question makes sense, this is my first time programming with pthreads and I'm pretty new to OS-level programming in general.

Thanks for taking the time to read this lengthy question.

Решение

When you say pthread_join immediately after pthread_create, you're effectively serializing all the threads. Don't join threads until after you've created all the threads and done all the other work that doesn't need the result from the threaded computations.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow