How do I extract a path from html with sed?

https://stackoverflow.com/questions/22728253

23-06-2023
|

Question

This has been driving me crazy. I'm trying to extract a path from some html using sed and som regex. my raw text is a file, sample.txt which looks like this:

<tr><td valign="top"><img src="/icon/file.ico" alt="[FILE]"></td><td><a href="/namespace/media/cloud-sync.xml">cloud&#x2d;sync&#x2e;xml</a></td><td align="right">Sat,&nbsp;29&nbsp;Mar&nbsp;2014&nbsp;06:08:13&nbsp;GMT</td><td align="right">8210</td></tr>
<tr><td valign="top"><img src="/icon/file.ico" alt="[FILE]"></td><td><a href="/namespace/media/levels-sync.xml">levels&#x2d;sync&#x2e;xml</a></td><td align="right">Sat,&nbsp;29&nbsp;Mar&nbsp;2014&nbsp;06:08:47&nbsp;GMT</td><td align="right">2203</td></tr>

First I tried:

cat sample.txt | sed -n ’s/(\/namespace\/media\/.*-sync.xml)/\1/p’

but that gives me: ｀sed: -e expression #1, char 40: invalid reference \1 on `s' command's RHS｀

Then I did:

cat sample.txt | sed -n 's/\(\/namespace\/media\/.*-sync.xml\)/\1/p'

But that just seems to return the entire file back to me.

My desired result is to get back

/namespace/media/nab-sync.xml
/namespace/media/levels-sync.xml

Any sed ninjas out there that can help me out?

Solution

Here is the correct sed command based on your particular input:

cat sample.txt | sed 's/.*\(\/namespace\/media\/.*-sync.xml\).*/\1/g'

In sed, the groups are captured in between \(...\) but you were using (...)

Also, I have added .* add the both end of your original regex to discard all other texts.

OTHER TIPS

This gnu awk will find the correct data on any location on the line.
Its not sed, but for this awk may be better, or simpler to understand.

awk -v RS='href="' -F\" 'NR>1 {print $1}' file
/namespace/media/cloud-sync.xml
/namespace/media/levels-sync.xml

This awk should work on any system:

awk -F\" '{for(i=1;i<=NF;i++) if ($i~"href=") print $(i+1)}' file
/namespace/media/cloud-sync.xml
/namespace/media/levels-sync.xml

This might work for you (GNU sed):

sed 's/.*href="\([^"]*\)".*/\1/' file

Look for href and extract the string between the next pair of double quotes.

I recommend to use gnu grep

grep -Po 'href="\K[^"]*' file

/namespace/media/cloud-sync.xml
/namespace/media/levels-sync.xml

this will do it for the moment :

cat sample.txt | awk -F'["]' '{print $8}'

i am not quite familiar with sed so i am posting an awk response.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow