Question

I have an output line that I regexed that looks like this:

<a href="google.com">"test link"</a><br>

how do I go about capturing google.com without quotes into a variable? Given the url could contain many '/' e.g. (random made up gibberish below)

http://www.google.com/search/something/lulz/here2;i=!mfo1iu489fn1o2jlk21m4098mdoi

EDIT: I would want the entire url string and not just www.google.com in the above case.

note: don't wish to load down 3rd party libraries etc. in order to perform this action.

Was it helpful?

Solution

Try this pure-bash regex solution

shopt -s nocasematch    #Dont care about the character case
text='<a href="hTTtp://www.google.com/search/something/lulz/here2;i=!mfo1iu489fn1o2jlk21m4098mdoi">"test link"</a><br>'
regex='(<a\ +href=\")([^\"]+)(\">)'
[[ $text =~ $regex ]] && echo ${BASH_REMATCH[2]}

OTHER TIPS

shopt -s nocasematch

TEXT='<a href="http://www.google.com/search/something/lulz/here2;i=!mfo1iu489fn1o2jlk21m4098mdoi">"test link"</a><br>'

TEXT=${TEXT##*href=\"}
TEXT=${TEXT%%\"*}
TEXT=${TEXT##*//}
TEXT=${TEXT%%/*}

echo $TEXT
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top