Extract .co.uk urls from HTML file
Вопрос
Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that? pd: im learning bash
edit:
code sample:
<a href="http://www.mysite.co.uk/" target="_blank">32</a>
<tr><td id="Table_td" align="center"><a href="http://www.ultraguia.co.uk/motets.php?pg=2" target="_blank">23</a><a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>
note some repeat
important: i need all links, broken or 404 too
found this code somwhere in the net:
cat file.html | tr " " "\n" | grep .co.uk
output:
href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"
think im close
thanks!
Решение
The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:
links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
| tac \
| sed -e '/^Links:/,$ d' \
-e 's/[0-9]\+.[[:space:]]//' \
| grep '^http://[^/]\+[.]co[.]uk'
It works as follows:
links
(a text-based web browser) actually retrieves the site.- Using
-dump
causes the rendered page to be emitted to stdout. - Using
-html-numbered-links
requests a numbered table of links. - Using
-anonymous
tweaks defaults for added security.
- Using
tac
reverses the output from Links in a line-ordered listsed -e '/^Links:/,$ d'
deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsedsed -e 's/[0-9]\+.[[:space:]]//'
removes the numbered headings from the individual links.grep '^https\?://[^/]\+[.]co[.]uk'
finds only those links with their host parts ending in.co.uk
.
Другие советы
One way using awk
:
awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html
output:
http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2
If you are only interested in unique urls, pipe the output into sort -u
HTH
Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget
command to grab the URLs in your file. Normally, wget
is used to download from thos URLs, but by denying wget
time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:
wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"
If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed
, as in | sed 's/.*-- //'
.
If you do not have wget
, then you can get it here