Question

I have developed a Web Crawler and now i want to respect the robots.txt file of the websites that i am crawling.

I see that this is the robots.txt file structure:

User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

I can read, line by line and then use explode with space character as delimiter to find data.

Is there any other way that i can load the entire data ?

Does this kind of files have a language, like XPath has ?

Or do i have to interprete the entire file ?

Any help is welcomed, even links, duplicates if found ...

Was it helpful?

Solution

The structure is very simple, so the best thing you can do is probably parse the file on your own. i would read it line by line and as you said look for keywords like User-agent, Disallow etc.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top