Question

I am reading SSD, requesting 20 async jobs. io_getevents returned value 7 indicating that it timed out. timeout is set to 10 seconds as seen below. elapsed time of the call is really 4.89e-05 seconds, eg, there were still almost all 10 seconds left. Question: Anyone had an incident like that? If you did then have you found a solution?

Here is part of the code:

struct timespec ts = { 10, 0 } ; /* ten seconds delay */
const long ec = io_getevents( ctx, num_jobs, num_jobs, &events[ 0 ], &ts ) ;

When ec is returned 7, ts.tv_sec = 10, ts.tv_nsec = 0

Linux kernel:

Linux VTL80-G-1J4-823-21 2.6.18-274.18.1.el5 #1 SMP Thu Feb 9 12:20:03 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

Your help is greatly appreciated! BTW. I will not be able to check the post earlier than in a few hours.

Was it helpful?

Solution

By putting extra steps, and debugging outputs we figure out that there is a problem with the aio driver on our linux (5.3 Carthage, 2.6.18-128.el5)

Solution we applied (I'm putting it in just in case someone runs in the the same problem) is this:
(We count elapsed seconds for a call outselves.)

1) If we see an error returned from io_getevents() we report it. DONE.

2) If we see 0 jobs finished and elapsed seconds we count ourselves went above expected one we report an error. DONE. Otherwise we continue (we do not change timeout on io_getevents())

3) If some jobs are finished, we analyze their res for an error (negative value), and if there was any failing job we report it. DONE.

4) If there are some remaining jobs, we reset timer (yes, we will again wait 'expected' time) and continue.

With this method we will report an error if io_getevents() reported an error or any of the jobs reported an error. In the worst case scenario, when each job returns ok after whole T-epsilon wait time, the whole process will take N * T time to complete.

I hope someone will find it useful.
Blessings,
Greg.

Example:

struct timespec       tmCountStart ;
unsigned              seconds_delay = SECONDS_DELAY ;

clock_gettime( CLOCK_REALTIME, &tmCountStart ) ;
while ( num_remaining_jobs > 0 )
{
    struct timespec ts = { seconds_delay, 0 } ;
    struct io_event events[ num_remaining_jobs ] ;
    long ec ;

    do
    {
        ec = io_getevents( ctx, num_remaining_jobs, num_remaining_jobs, &events[ 0 ], &ts ) ;
    }
    while( ec == -EINTR ) ;

    if ( ec < 0 )
        throw exception reporting error ec. cancel all remaining jobs
    else if ( ec == 0 )
    {
        const double elapsed = count elapsed seconds from tmCountStart
        seconds_delay = SECONDS_DELAY - static_cast< unsigned >( elapsed ) ;
        if ( seconds_delay > SECONDS_DELAY ) 
            throw exception reporting timeout. cancel all remaining jobs
    }
    else // we got some jobs back. may not all of them
    {
        for ( int i = 0 ; i < ec ; i++ )
            if (( int64_t )events[ i ].res < 0 )
                throw exception reporting failing job. cancel all remaining jobs.

        num_remaining_jobs -= ec ;
        if ( num_remaining_jobs > 0 )
        {
            clock_gettime( CLOCK_REALTIME, &tmCountStart ) ; // reset timer.
            seconds_delay = SECONDS_DELAY ;
        }
    }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top