Question

I want to grep the URL out of a .asx file. The file would normally look like this.

<ASX VERSION="3.0">
<ENTRY>
<TITLE>Blah Blah</TITLE>
<AUTHOR>Someone</AUTHOR>
<COPYRIGHT>(C)2014 Someone Else</COPYRIGHT>
<REF HREF="mms://www.example.com/video/FilmName/FilmName.wmv"/>
</ENTRY>
</ASX>

I want to get the URL without the quotes, and stripping off the mms://

I came up with a regex that uses lookarounds that does this successfully:

((?<=\/\/).*?).(?=\")

but of course I can't use this with grep. So what is another approach that would be flexible to capture whatever comes between the mms:// and the " that I could put into a grep -o command?

Was it helpful?

Solution

but of course I can't use this with grep.

Why not? Modern grep versions supports -P switch for PCRE regex support.

Try this:

grep -oP '((?<=//).*?).(?=")' file
www.example.com/video/FilmName/FilmName.wmv

OTHER TIPS

With BASH, you can use the left/right pattern matching:

url='<REF HREF="mms://www.example.com/video/FilmName/FilmName.wmv"/>'
url=${url#<REF HREF=\"}
url=${url%\"/>}
echo "URL is '$url'"   # Prints URL is 'mms://www.example.com/video/FilmName/FilmName.wmv'

${VAR#pattern} strips off of $VAR the shortest left hand side glob that matches pattern. ${VAR##pattern}strips off of $var the largest left hand side glob that matches pattern. And, ${VAR%pattern} and ${VAR%pattern} do the same for the right hand side of $VAR.

An easy way to remember is that # is to the left of % on the keyboard. David Korn taught me that.

Solution for OSX users, where grep (as of OSX 10.9) doesn't support -P and look-arounds are therefore not an option:

egrep -o '"[a-z]+://[^"]+' file | cut -d '/' -f 3-

Like this:

awk -F '[:"]' '/REF HREF/ {print substr($3,3)}' file
www.example.com/video/FilmName/FilmName.wmv
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top