How can I gather all links on a site without content?
-
25-09-2019 - |
Question
I would like to get all URLs a site links to (on the same domain) without downloading all of the content with something like wget. Is there a way to tell wget to just list the links it WOULD download?
For a little background of what I'm using this for if someone can come up with a better solution: I'm trying to build a robots.txt file that excludes all files that end with p[4-9].html but robots.txt doesn't support regular expressions. So I'm trying to get all links and then run a regular expression against them then put the result in the robots.txt. Any ideas?
Solution
My recommendation: combine wget and gawk in a (very) small shell script.
There's a good overview of AWK on wikipedia: http://en.wikipedia.org/wiki/AWK
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow