Question

I'm profiling a multithreaded program running with different numbers of allowed threads. Here are the performance results of three runs of the same input work.

1 thread:
  Total thread time: 60 minutes.
  Total wall clock time: 60 minutes.

10 threads:
  Total thread time: 80 minutes. (Worked 33% longer)
  Total wall clock time: 18 minutes.  3.3 times speed up

20 threads
  Total thread time: 120 minutes. (Worked 100% longer)
  Total wall clock time: 12 minutes.  5 times speed up

Since it takes more thread time to do the same work, I feel the threads must be contending for resources.

I've already examined the four pillars (cpu, memory, diskIO, network) on both the app machine and the database server. Memory was the original contended resource, but that's fixed now (more than 1G free at all times). CPU hovers between 30% and 70% on the 20 thread test, so plenty there. diskIO is practically none on the app machine, and minimal on the database server. The network is really great.

I've also code-profiled with redgate and see no methods waiting on locks. It helps that the threads are not sharing instances. Now I'm checking more nuanced items like database connection establishing/pooling (if 20 threads attempt to connect to the same database, do they have to wait on each other?).

I'm trying identify and address the resource contention, so that the 20 thread run would look like this:

20 threads
  Total thread time: 60 minutes. (Worked 0% longer)
  Total wall clock time: 6 minutes.  10 times speed up

What are the most likely sources (other than the big 4) that I should be looking at to find that contention?


The code that each thread performs is approximately:

Run ~50 compiled LinqToSql queries
Run ILOG Rules
Call WCF Service which runs ~50 compiled LinqToSql queries, returns some data
Run more ILOG Rules
Call another WCF service which uses devexpress to render a pdf, returns as binary data
Store pdf to network
Use LinqToSql to update/insert. DTC is involved: multiple databases, one server.

The WCF Services are running on the same machine and are stateless and able to handle multiple simultaneous requests.


Machine has 8 cpu's.

Was it helpful?

Solution

What you describe is that you want a scalability of a 100% that is a 1:1 relation between the increase in thread s and the decrease in wallcklock time... this is usally a goal but hard to reach...

For example you write that there is no memory contention because there is 1 GB free... this is IMHO a wrong assumption... memory contention means also that if two threads try to allocate memory it could happen that one has to wait for the other... another ponint to keep in mind are the interruptions happening by GC which freezes all threads temporarily... the GC can be customzed a bit via configuration (gcServer) - see http://blogs.msdn.com/b/clyon/archive/2004/09/08/226981.aspx

Another point is the WCF service called... if it can't scale up -for example the PDF rendering- then that is also a form of contention for example...

The list of possible contention is "endless"... and hardly always on the obvious areas you mentioned...

EDIT - as per comments:

Some points to check:

  • connection pooling
    what provider do you use ? how is it configured ?
  • PDF rendering
    possible contention would be measured somewhere inside the library you use...
  • Linq2SQL
    Check the execution plans for all these queries... it can be that some take any sort of lock and thus possibly create a contention DB-server-side...

EDIT 2:

Threads
Are these threads from the ThreadPool ? If so then you won't scale :-(

EDIT 3:

ThreadPool threads are bad for long-running tasks which is the case in your scenario... for details see

From http://www.yoda.arachsys.com/csharp/threads/printable.shtml

Long-running operations should use newly created threads; short-running operations can take advantage of the thread pool.

If you want extreme performance then it could be worth checking out CQRS and the real-world example described as LMAX .

OTHER TIPS

Instead of measuring the total thread time, measure the time for each of the operations that you do that do I/O of some sort (database, disk, net, etc.).

I suspect you are going to find that these operations are the ones that take longer when you have more threads, and this is because the contention is on the other end of that I/O. For example, your database might be serializing requests for data consistency.

yes, there's resource contention. All the threads have to read/write data to the same memory bus, directed to the same RAM modules, for example. It doesn't matter how much RAM is free, it matters that the reads/writes are carried out by the same memory controller on the same RAM modules, and that the data is carried over the same bus.

If there's any kind of synchronization anywhere, then that too is a contended resource. If there's any I/O, that's a contended resource.

You're never going to see a N x speedup when going from 1 to N threads. It's not possible because ultimately, everything in the CPU is a shared resource on which there will be some degree of contention.

There are plenty of factors preventing you from getting the full linear speedup. You're assuming that the database, the server the database is running on, the network connecting it to the client, the client computer, the OS and drivers on both ends, the memory subsystem, disk I/O and everything in between is capable of just going 20 times faster when you go from 1 to 20 threads.

Two words: dream on.

Each of these bottlenecks only has to slow you down by a few percent, then the overall result will be something like what you're seeing.

I'm sure you can tweak it to scale a bit better, but don't expect miracles.

But one thing you might look for is cache line sharing. Do threads access data that is very close to the data used by other threads? How often can you avoid that occurring?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top