Question

I have dynamically-generated strings which contain HTML links. For example, the string might contain:

You can visit <a class="hyperlink" href="https://www.stackoverflow.com">Stack Overflow</a> or <a class="hyperlink" href="https://www.reddit.com">Reddit.com</a> for help, or you can visit <a href="hyperlink" href="https://www.google.com">https://www.google.com</a>

I'm trying to find all hrefs and instead generate new URLs for them, then insert them back into the string. So the example above, if run successfully, would look like:

You can visit <a class="hyperlink" href="https://mydomain.com/www.stackoverflow.com">Stack Overflow</a> or <a class="hyperlink" href="https://mydomain.com/reddit.com">Reddit.com</a> for help, or you can visit <a href="hyperlink" href="https://mydomain.com/www.google.com">https://www.google.com</a>

I'm very new to Python and haven't been successful at figuring out how to find an href and prepend its value with my "custom" domain. Any insight would be highly appreciated!

Was it helpful?

Solution

This is the perfect place to use a regular expression.

import re

text = 'You can visit <a class="hyperlink" href="https://www.stackoverflow.com">Stack Overflow</a>'
new_text = re.sub(r'href="http(s)?:\/\/(.+?)"', r'href="https://mydomain.com/\2"', text)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top