Question

I have an application consisting of different modules written in C++.
One of the modules is meant for handling distributed tasks on SunGrid Engine. It uses the DRMAA API for submitting and monitoring grid jobs.If the client doesn't supports grid, local machine should be used

The shared object of the API libdrmaa.so is linked at compile time and loaded at runtime.
If the client using my application has this ".so" everything is fine but in case the client doesn't have that , the application exits failing to load shared libraries.
To avoid this , I have replaced the API calls with function pointers obtained using dlsym() and dlopen(). Now I can use the local machine instead of grid if the call to dlopen doesn't succeeds and my objective is achieved.
The problem now is that the application now runs successfully for small testcases but with larger testcases it throws segmentation fault while the same code using dynamic loading works correctly.

Am I missing something while using dlsym() and dlopen()?
Is there any other way to achieve the same goal?

Any help would be appreciated.

Thanx,

Was it helpful?

Solution

It is very unlikely to be a direct problem with the code loaded via dlsym() - in the sense that the dynamic loading makes it seg-fault.

What it may be doing is exposing a separate problem, probably by moving stuff around. This probably means a stray (uninitialized) pointer that points somewhere 'legitimate' in the static link case but somewhere else in the dynamic link case - and the somewhere else triggers the seg-fault. Indeed, that is a benefit to you in the long run - it shows that there is a problem that otherwise might remain undetected for a long time.

I regard this as particularly likely since you mention that it occurs with larger tests and not with small ones.

OTHER TIPS

As Jonathan Leffler says, the problem very likely exists in the case where you are using the API directly; it just hasn't caused a crash yet.

Your very first step when you get a SIGSEGV should be analyzing the resulting core dump (or just run the app directly under debugger), and looking where it crashed. I'll bet $0.02 that it's crashing somewhere inside malloc or free, in which case the problem is plain old heap corruption, and there are many heap-checker tools available to help you catch it. Solaris provides watchmalloc, which is a good start.

If you are throwing an exception across a extern "C" function then the application has to quit. This is because the C ABI does not have the facilities to propagate exceptions.

To counter this when using DLL's (or shared libs) you normally have a one C function that returns a C++ object. Then the remaining interaction is with that C++ object that was returned from the DLL.

This pattern suggests (and I stress suggests) a factory like object, thus your DLL should have a single extern "C" function that returns a void* which you can reinterpret_cast<> back into a C++ factory object.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top