Identify web-objects from redundant URIs using HTTP requests

https://stackoverflow.com/questions/15881519

02-04-2022
|

سؤال

I am struggling with an ill-constructed web-server log file, which I want to summarize to analyse attendance of the hosted site. Unfortunately for me, the architecture of the site is messy, so that there are no indexes of the hosted objects (html pages, jpg images, pdf document, etc.) while several URIs can refer to the same page. For example :

http://www.site.fr/main.asp?page=foo.htm
http://www.site.fr/storage-tree/foo.htm
http://www.site.fr/specific.asp?id=200
http://www.site.fr/specific.asp?path=/storage-tree/foo.htm

etc. without any obvious regularities between the duplicate URIs.

How, conceptually and pratically, can I efficiently identify the pages? As I see the problem, the idea is to construct an index linking log's URIs with a unique-object identifier constructed from http requests. There are three loose constraints :

I use R for the statistical part, and would therefore prefer to use it for http processing too
logs consist in hundreds of thousands of different URIs (among which forms, search and database queries) so that rapidity might be a matter
If I want to be able to tell, even in three days or a month, that this new URI is a known previously identified page, I have store the features I use to assess that two URIs refer to the same page. Then, storage space is a matter.

المحلول

This is pretty easy with httr:

library(httr)
HEAD("http://gmail.com")$url

You will probably also want to check the status_code returned by HEAD, as failures often won't be redirected.

(One advantage of using httr over RCurl here is that it automatically preserves the connection across multiple http calls to the same site, which makes things quite a bit faster)

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow