How can I gather all links on a site without content?

https://stackoverflow.com/questions/3405836

web-crawler
robots.txt

25-09-2019
|

Question

I would like to get all URLs a site links to (on the same domain) without downloading all of the content with something like wget. Is there a way to tell wget to just list the links it WOULD download?

For a little background of what I'm using this for if someone can come up with a better solution: I'm trying to build a robots.txt file that excludes all files that end with p[4-9].html but robots.txt doesn't support regular expressions. So I'm trying to get all links and then run a regular expression against them then put the result in the robots.txt. Any ideas?

Solution

My recommendation: combine wget and gawk in a (very) small shell script.

There's a good overview of AWK on wikipedia: http://en.wikipedia.org/wiki/AWK

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow