Why is a single thread faster than multiple threads even though they essentially have the same overhead?

Question 1

Your code is inherently wrong.

count++ is a three-step operation that reads the value, increments it, then stores it back into the variable.
If two threads run count++ at once on the same variable, one of them will overwrite the other one's changes.

Therefore, the multi-threaded version will end up doing extra work as each thread clobbers the other threads' progress.

If you make count a local variable, the timing should look much more normal.

Alternatively, you could use an interlocked increment, which is thread-safe, but has extra overhead to synchronize across threads.

Question 2

As some of the commenters on your original question have pointed out you have a correctness and performance issue. Firstly all your threads are accessing count concurrently. This means that there is no guarentee that the threads will actually all count to 300 million. You can solve this correctness bug by declaring count within your poop function

void poop(void* params) {
    int count  = 0;
    while(count < 300000000) {
        count++;
    }
    cout<< time(NULL) - t <<" \n";
}

Note that this is not a problem for t because it is only read, not written, by the threads. However it is a problem with cout as you're also writing to that from multiple threads.

In addition, as pointed out in the comments, all your threads are accessing a single memory location. This means that when a thread updates count the cache line that holds it must be flushed and reloaded. This is very inefficient memory access. Typically this happens when you are accessing consecutive elements in an array, rather than a single variable (bad idea, see above). The solution to this would be to pad your array to ensure that each entry is an exact multiple of your L1 cache line size, this is obviously somewhat specific to your target processor. Another option would be to restructure your algorithm so that either; each thread processed a large block of consecutive elements, or each thread access elements in a such a way that thread do not access adjacent locations.

As you're using Windows you might want to consider using a higher level of abstraction for your code, rather than the Win32 threads functions. The Parallel Patterns Library fits the bill here (as does Intel's Threaded Building Blocks).

    concurrency::parallel_invoke(
        [=] { poop(nullptr); },
        [=] { poop(nullptr); }
    );

This allows the PPL to schedule your tasks on a thread pool, rather than your application having to explicitly create threads.

You might also consider that for really small tasks the overhead of starting additional threads may outweigh the gains of running in parallel.