How do I extract a path from html with sed?

https://stackoverflow.com/questions/22728253

23-06-2023
|

문제

This has been driving me crazy. I'm trying to extract a path from some html using sed and som regex. my raw text is a file, sample.txt which looks like this:

<tr><td valign="top"><img src="/icon/file.ico" alt="[FILE]"></td><td><a href="/namespace/media/cloud-sync.xml">cloud&#x2d;sync&#x2e;xml</a></td><td align="right">Sat,&nbsp;29&nbsp;Mar&nbsp;2014&nbsp;06:08:13&nbsp;GMT</td><td align="right">8210</td></tr>
<tr><td valign="top"><img src="/icon/file.ico" alt="[FILE]"></td><td><a href="/namespace/media/levels-sync.xml">levels&#x2d;sync&#x2e;xml</a></td><td align="right">Sat,&nbsp;29&nbsp;Mar&nbsp;2014&nbsp;06:08:47&nbsp;GMT</td><td align="right">2203</td></tr>

First I tried:

cat sample.txt | sed -n ’s/(\/namespace\/media\/.*-sync.xml)/\1/p’

but that gives me: ｀sed: -e expression #1, char 40: invalid reference \1 on `s' command's RHS｀

Then I did:

cat sample.txt | sed -n 's/\(\/namespace\/media\/.*-sync.xml\)/\1/p'

But that just seems to return the entire file back to me.

My desired result is to get back

/namespace/media/nab-sync.xml
/namespace/media/levels-sync.xml

Any sed ninjas out there that can help me out?

해결책

Here is the correct sed command based on your particular input:

cat sample.txt | sed 's/.*\(\/namespace\/media\/.*-sync.xml\).*/\1/g'

In sed, the groups are captured in between \(...\) but you were using (...)

Also, I have added .* add the both end of your original regex to discard all other texts.

다른 팁

This gnu awk will find the correct data on any location on the line.
Its not sed, but for this awk may be better, or simpler to understand.

awk -v RS='href="' -F\" 'NR>1 {print $1}' file
/namespace/media/cloud-sync.xml
/namespace/media/levels-sync.xml

This awk should work on any system:

awk -F\" '{for(i=1;i<=NF;i++) if ($i~"href=") print $(i+1)}' file
/namespace/media/cloud-sync.xml
/namespace/media/levels-sync.xml

This might work for you (GNU sed):

sed 's/.*href="\([^"]*\)".*/\1/' file

Look for href and extract the string between the next pair of double quotes.

I recommend to use gnu grep

grep -Po 'href="\K[^"]*' file

/namespace/media/cloud-sync.xml
/namespace/media/levels-sync.xml

this will do it for the moment :

cat sample.txt | awk -F'["]' '{print $8}'

i am not quite familiar with sed so i am posting an awk response.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow