Will this protect me from Etag tracking?

https://stackoverflow.com/questions/20335365

07-08-2022
|

Question

Background: ETag tracking is well explained here and also mentioned on Wikipedia.

An answer I wrote in a response to "How can I prevent tracking by ETags?" has driven me to write this question.

I have a browser-side solution which prevents ETag tracking. It works without modifying the current HTTP protocol. Is this a viable solution to ETag tracking?

Instead of telling the server our ETag we ASK the server about its ETag, and we compare it to the one we already have.

Pseudo code:

If (file_not_in_cache)
{
    page=http_get_request();     
    page.display();
    page.put_in_cache();
}
else
{
    page=load_from_cache();
    client_etag=page.extract_etag();
    server_etag=http_HEAD_request().extract_etag();

    //Instead of saying "my etag is xyz",
    //the client says: "what is YOUR etag, server?"

    if (server_etag==client_etag)
    {
        page.display();
    }
    else
    {
        page.remove_from_cache();
        page=http_get_request();     
        page.display();
        page.put_in_cache();
    }
}

HTTP conversation example with my solution:

Client:

HEAD /posts/46328
host: security.stackexchange.com

Server:

HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
ETag: "EVIl_UNIQUE_TRACKING_ETAG"
Content-Type: text/html
Content-Length: 131

Case 1, Client has an identical ETag:

Connection closes, client loads page from cache.

Case 2, client has a mismatching ETag:

GET...... //and a normal http conversation begins.

Extras that do require modifying the HTTP specification

Think of the following as theoretical material, the HTTP spec probably won't change any time soon.

1. Removing HEAD overhead

It is worth noting that there is minor overhead, the server has to send the HTTP header twice: Once in response to the HEAD, and once in response to the GET. One theoretical workaround for this is modifying the HTTP protocol and adding a new method which requests header-less content. Then the client would request the HEAD only, and after that the content only, if the ETags mismatch.

2. Preventing cache based tracking (or at least making it a lot harder)

Although the workaround suggested by Sneftel is not an ETag tracking technique, it does track people even when they're using the "HEAD, GET" sequence I suggested. The solution would be restricting the possible values of ETags: Instead of being any sequence, the ETag has to be a checksum of the content. The client checks this, and in case there is a mismatch between the checksummed value and the value sent by the server, the cache is not used.

Side note: fix 2 would also eliminate the following Evercookie tracking techniques: pngData, etagData, cacheData. Combining that with Chrome's "Keep local data only until I quit my browser" eliminates all evercookie tracking techniques except Flash and Silverlight cookies.

Solution

It sounds reasonable, but workarounds exist. Suppose the front page was always given the same etag (so that returning visitors would always load it from cache), but the page itself referenced a differently-named image each time it was loaded. Your GET or HEAD request for this image would then uniquely identify you. Arguably this isn't an etag-based attack, but it still uses your cache to identify you.

OTHER TIPS

As long as any caching is used there's a potential exploit, even with the HTTP changes. Suppose the main page includes 100 images, each one randomly drawn from a potential pool of 2 images.

When a user returns to the site, her browser reloads the page (since the checksum doesn't match). On average, 25 of the 100 images will be cached from before. This combination can (almost certainly) be used to individually fingerprint the user.

Interestingly, this is almost exactly how DNA paternity testing works.

The server could detect that for a number of resources you do a HEAD request which is not followed by a GET for the same resource. That's a tell if you were playing poker.

Just by having some resources cached, you are storing information. That information can be deduced by the server any time you do not re-request a resource named on the page.

Protecting your privacy in this manner comes at the cost of having to download every resource on the page with every visit. If you ever cache anything then you are storing information that can be inferred from your requests to the server.

Especially on mobile, where your bandwidth is more expensive and often slower, downloading all page resources on every visit could be impractical. I think at some level you have to accept that there are patterns in your interaction with the website which could be detected and profiled to identify you.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow