Extract .co.uk urls from HTML file

https://stackoverflow.com/questions/11183285

16-06-2021
|

Question

Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that? pd: im learning bash

edit:

code sample:

<a href="http://www.mysite.co.uk/" target="_blank">32</a>
<tr><td id="Table_td" align="center"><a href="http://www.ultraguia.co.uk/motets.php?pg=2" target="_blank">23</a><a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>

note some repeat

important: i need all links, broken or 404 too

found this code somwhere in the net:

cat file.html | tr " " "\n" | grep .co.uk

output:

href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"

think im close

thanks!

Solution

The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:

links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
  | tac \
  | sed -e '/^Links:/,$ d' \
        -e 's/[0-9]\+.[[:space:]]//' \
  | grep '^http://[^/]\+[.]co[.]uk'

It works as follows:

links (a text-based web browser) actually retrieves the site.
- Using -dump causes the rendered page to be emitted to stdout.
- Using -html-numbered-links requests a numbered table of links.
- Using -anonymous tweaks defaults for added security.
tac reverses the output from Links in a line-ordered list
sed -e '/^Links:/,$ d' deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsed
sed -e 's/[0-9]\+.[[:space:]]//' removes the numbered headings from the individual links.
grep '^https\?://[^/]\+[.]co[.]uk' finds only those links with their host parts ending in .co.uk.

OTHER TIPS

One way using awk:

awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html

output:

http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2

If you are only interested in unique urls, pipe the output into sort -u

HTH

Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget command to grab the URLs in your file. Normally, wget is used to download from thos URLs, but by denying wget time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:

wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"

If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed, as in | sed 's/.*-- //'.

If you do not have wget, then you can get it here

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow