Question

I'm getting an error I really don't understand when reading or writing files using a PCIe block device driver. I seem to be hitting an issue in swiotlb_unmap_sg_attrs(), which appears to be doing a NULL dereference of the sg pointer, but I don't know where this is coming from, as the only scatterlist I use myself is allocated as part of the device info structure and persists as long as the driver does.

There is a stacktrace to go with the problem. It tends to vary a bit in exact details, but it always crashes in swiotlb_unmap_sq_attrs().

I think it's likely I have a locking issue, as I am not sure how to handle the locks around the IO functions. The lock is already held when the request function is called, I release it before the IO functions themselves are called, as they need an (MSI) IRQ to complete. The IRQ handler updates a "status" value, which the IO function is waiting for. When the IO function returns, I then take the lock back up and return to request queue handling.

The crash happens in blk_fetch_request() during the following:

if (!__blk_end_request(req, res, bytes)){
    printk(KERN_ERR "%s next request\n", DRIVER_NAME);

    req = blk_fetch_request(q);
} else {
    printk(KERN_ERR "%s same request\n", DRIVER_NAME);
}

where bytes is updated by the request handler to be the total length of IO (summed length of each scatter-gather segment).

Was it helpful?

Solution

It turned out this was due to re-entrancy of the request function. Because I was unlocking in the middle to allow IRQs to come in, the request function could be called again, would take the lock (while the original request handler was waiting on IO) and then the wrong handler would get the IRQ and everything went south with stacks of failed IO.

The way I solved this was to set a "busy" flag at the start of the request function, clear it at the end and return immediately at the start of the function if this is set:

static void mydev_submit_req(struct request_queue *q){
    struct mydevice *dev = q->queuedata;

    // We are already processing a request
    // so reentrant calls can take a hike
    // They'll be back
    if (dev->has_request)
        return;

    // We own the IO now, new requests need to wait
    // Queue lock is held when this function is called
    // so no need for an atomic set
    dev->has_request = 1; 

    // Access request queue here, while queue lock is held

    spin_unlock_irq(q->queue_lock);

    // Perform IO here, with IRQs enabled
    // You can't access the queue or request here, make sure 
    // you got the info you need out before you release the lock

    spin_lock_irq(q->queue_lock);

    // you can end the requests as needed here, with the lock held

    // allow new requests to be processed after we return
    dev->has_request = 0;

    // lock is held when the function returns
}

I am still not sure why I consistently got the stacktrace from swiotlb_unmap_sq_attrs(), however.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top