What's the suggested way of storing a resource ETag?

https://stackoverflow.com/questions/12049642

27-06-2021
|

سؤال

Where should I store the ETag for a given resource?

Approach A: compute on the fly

Get the resource and compute the ETag on the fly upon each request:

$resource = $repository->findByPK($id); // query

// Compute ETag
$etag = md5($resource->getUpdatedAt());

$response = new Response();
$response->setETag($etag);
$response->setLastModified($resource->getUpdatedAt());

if($response->isNotModified($this->getRequest())) {
    return $response; // 304
}

Approach B: storing at database level

Saving a bit of CPU time while making INSERT and UPDATE statements a bit slower (we use triggers to get ETag updated):

$resource = $repository->findByPK($id); // query

$response = new Response();
$response->setETag($resource->getETag());
$response->setLastModified($resource->getUpdatedAt());

if ($response->isNotModified($this->getRequest())) {
    return $response;
}

Approach C: caching the ETag

This is like approach B but ETag is stored in some cache middleware.

المحلول

I suppose it would depend on the cost of having available the items going into the ETag itself.

I mean, the user sends along a request for a given resource; this should trigger a retrieval operation on the database (or some other operation).

If the retrieval is something simple such as fetching a file, then inquiring on the file stats is fast, and there's no need of storing anything anywhere: a MD5 of the file path plus its update time is enough.

If the retrieval implies querying a database, then it depends on whether you can decompose the query without losing performance (e.g., the user requests an article by ID. You might retrieve relevant data from the article table only. So a cache "hit" will entail a single SELECT on a primary key. But a cache "miss" means you have to query again the database, wasting the first query - or not - depending on your model).

If the query (or sequence of queries) is well-decomposable (and the resulting code maintenable) then I'd go with the dynamic ETag again.

If it is not, then most depends on the query cost and the overall cost of maintenance of a stored-ETag solution. If the query is costly (or the output is bulky) and INSERT/UPDATEs are few, then (and, I think, only then) it will be advantageous to store a secondary column (or table) with the ETag.

As for the caching middleware, I don't know. If I had a framework keeping track of everything for me, I might say 'go for it' -- the middleware is supposed to caring and implementing the points above. Should the middleware be implementation-agnostic (unlikely, unless it's a cut-and-paste slap-on ... which is not unheard of), then there would be either the risk of it "screening" updates to the resource, or maybe an excessive awkwardness on invoking some cache-clearing API upon updates. Both factors would need to be evaluated against the load improvement offered by ETag support.

I don't think that in this case a 'silver bullet' exists.

Edit: in your case there is little - or even no - difference between cases A and B. To be able to implement getUpdatedAt(), you would need to store the update time in the model.

In this specific case I think that it would be simpler and more maintainable the dynamic, explicit calculation of the ETag (case A). The retrieval cost is incurred in any case, and the explicit calculation cost is that of a MD5 calculation, which is really fast and completely CPU-bound. The advantages in maintainability and simplicity in my opinion are overwhelming.

On a semi-related note, it occurs to me that in some cases (infrequent updates to the database and much more frequent queries to the same) it might be advantageous and almost transparent to implement a global Last-Modified time for the whole database. If the database has not changed, then there is no way that any query to the database can return varied resources, no matter what the query is. In such a situation, one would only need to store the Last-Modified global flag in some easy and quick to retrieve place (not necessarily the database). For example

function dbModified() {
    touch('.last-update'); // creates the file, or updates its modification time
}

in any UPDATE/DELETE code. The resource would then add a header

function sendModified() {
    $tsstring = gmdate('D, d M Y H:i:s ', filemtime('.last-update')) . 'GMT';
    Header("Last-Modified: " . $tsstring);
}

to inform the browser of that resource's modification time.

Then, any request for a resource including If-Modified-Since could be bounced back with a 304 without ever accessing the persistency layer (or at least saving all persistent resource access). No update time at record level would (have to) be needed:

function ifNotModified() {
    // Check out timezone settings. The GMT helps but it's not always the ticket
    $ims = isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])
        ? strtotime($_SERVER['HTTP_IF_MODIFIED_SINCE'])
        : -1; // This ensures the test will FAIL

   if (filemtime('.last-update') <= $ims) {
       // The database was never updated after the resource retrieval.
       // There's no way the resource may have changed.
       exit(Header('HTTP/1.1 304 Not Modified'));
   }
}

One would put the ifNotModified() call as early as possible in the resource supply route, the sendModified as early as possible in the resource output code, and the dbModified() wherever the database gets modified significantly as far as resources are concerned (i.e., you can and probably should avoid it when logging access statistics to database, as long as they do not influence the resources' content).

نصائح أخرى

In my opinion persisting ETags is BAD IDEA unless your business logic is ABOUT persisting ETags. Like when you write application to track users basing on ETags and this is a business feature :).

Potential savings in execution time will be small or non-existng. Bad sides of this solution are certain and growing as your application grows.

According to specification Resource in the same version shall give different E-Tags depending on end point from with has been obtained.

From http://en.wikipedia.org/wiki/HTTP_ETag:

"Comparing ETags only makes sense with respect to one URL—ETags for resources obtained from different URLs may or may not be equal, so no meaning can be inferred from their comparison."

From this you may conclude that you should persist not just ETags but also its endpoint and store as many ETags as many enpoints you have. Sounds crazy?

Even if you want to ignore HTTP specification and just provide one Etag for Entity without any metadata about its endpoints. You still kind of binding at least 2 layers (caching and business logic) that ideally should not be mixed. Idea behind of having Entity (versus some lose data) is to have separated and not coupled business logic in them, and do not pollute them with stuff about networking, view layer data or... caching.

IHMO, this depends on how often resources are updated vs how often resources are read.

If each ETag is read 1 or 2 times between modifications, then just calculate them on the fly.
If your resources are read far more times than they're updated, then you'd better cache them, calcultating the ETag every time the resource is modified (so you don't have to bother with out-of-date cached ETags).

If ETags are modified almost as often as they're read, then I'd still cache them, especially since it seems your resources are stored on a database.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow