Make PHP Web Crawler to Respect the robots.txt file of any website
-
26-06-2021 - |
题
I have developed a Web Crawler and now i want to respect the robots.txt file of the websites that i am crawling.
I see that this is the robots.txt file structure:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
I can read, line by line and then use explode with space character as delimiter to find data.
Is there any other way that i can load the entire data ?
Does this kind of files have a language, like XPath has ?
Or do i have to interprete the entire file ?
Any help is welcomed, even links, duplicates if found ...
解决方案
The structure is very simple, so the best thing you can do is probably parse the file on your own. i would read it line by line and as you said look for keywords like User-agent, Disallow etc.
不隶属于 StackOverflow