but of course I can't use this with grep.
Why not? Modern grep versions supports -P
switch for PCRE regex support.
Try this:
grep -oP '((?<=//).*?).(?=")' file
www.example.com/video/FilmName/FilmName.wmv
Question
I want to grep the URL out of a .asx file. The file would normally look like this.
<ASX VERSION="3.0">
<ENTRY>
<TITLE>Blah Blah</TITLE>
<AUTHOR>Someone</AUTHOR>
<COPYRIGHT>(C)2014 Someone Else</COPYRIGHT>
<REF HREF="mms://www.example.com/video/FilmName/FilmName.wmv"/>
</ENTRY>
</ASX>
I want to get the URL without the quotes, and stripping off the mms://
I came up with a regex that uses lookarounds that does this successfully:
((?<=\/\/).*?).(?=\")
but of course I can't use this with grep. So what is another approach that would be flexible to capture whatever comes between the mms:// and the " that I could put into a grep -o command?
Solution
but of course I can't use this with grep.
Why not? Modern grep versions supports -P
switch for PCRE regex support.
Try this:
grep -oP '((?<=//).*?).(?=")' file
www.example.com/video/FilmName/FilmName.wmv
OTHER TIPS
With BASH, you can use the left/right pattern matching:
url='<REF HREF="mms://www.example.com/video/FilmName/FilmName.wmv"/>'
url=${url#<REF HREF=\"}
url=${url%\"/>}
echo "URL is '$url'" # Prints URL is 'mms://www.example.com/video/FilmName/FilmName.wmv'
${VAR#pattern}
strips off of $VAR
the shortest left hand side glob that matches pattern
. ${VAR##pattern}
strips off of $var
the largest left hand side glob that matches pattern
. And, ${VAR%pattern}
and ${VAR%pattern}
do the same for the right hand side of $VAR
.
An easy way to remember is that #
is to the left of %
on the keyboard. David Korn taught me that.
Solution for OSX users, where grep
(as of OSX 10.9) doesn't support -P
and look-arounds are therefore not an option:
egrep -o '"[a-z]+://[^"]+' file | cut -d '/' -f 3-
Like this:
awk -F '[:"]' '/REF HREF/ {print substr($3,3)}' file
www.example.com/video/FilmName/FilmName.wmv