سؤال

I am struggling with an ill-constructed web-server log file, which I want to summarize to analyse attendance of the hosted site. Unfortunately for me, the architecture of the site is messy, so that there are no indexes of the hosted objects (html pages, jpg images, pdf document, etc.) while several URIs can refer to the same page. For example :

  • http://www.site.fr/main.asp?page=foo.htm
  • http://www.site.fr/storage-tree/foo.htm
  • http://www.site.fr/specific.asp?id=200
  • http://www.site.fr/specific.asp?path=/storage-tree/foo.htm

etc. without any obvious regularities between the duplicate URIs.

How, conceptually and pratically, can I efficiently identify the pages? As I see the problem, the idea is to construct an index linking log's URIs with a unique-object identifier constructed from http requests. There are three loose constraints :

  • I use R for the statistical part, and would therefore prefer to use it for http processing too
  • logs consist in hundreds of thousands of different URIs (among which forms, search and database queries) so that rapidity might be a matter
  • If I want to be able to tell, even in three days or a month, that this new URI is a known previously identified page, I have store the features I use to assess that two URIs refer to the same page. Then, storage space is a matter.
هل كانت مفيدة؟

المحلول

This is pretty easy with httr:

library(httr)
HEAD("http://gmail.com")$url

You will probably also want to check the status_code returned by HEAD, as failures often won't be redirected.

(One advantage of using httr over RCurl here is that it automatically preserves the connection across multiple http calls to the same site, which makes things quite a bit faster)

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top