How to parse through string containing url changing them to proper links

https://stackoverflow.com/questions/4346895

30-09-2019
|

Question

Let's say I have a following string from twitter:

"This is my sample test blah blah http://t.co/pE6JSwG, hello all"

How I can parse through this string changing this link to <a href="link">link</a> ? Here's a code that parses user tags :

    tweet = s.text;
    user_regex = re.compile(r'@[0-9a-zA-Z+_]*',re.IGNORECASE)

    for tt in user_regex.finditer(tweet):
        url_tweet = tt.group(0).replace('@','')
        tweet = tweet.replace(tt.group(0),
            '<a href="http://twitter.com/'+
            url_tweet+'" title="'+
            tt.group(0)+'">'+
            tt.group(0)+'</a>')

And my current regex for url's:

    http_regex = re.compile(r'[A-Za-z]+:\/\/[A-Za-z0-9-_]+\.[A-Za-z0-9-_:%&\?\/.=]*', re.IGNORECASE)

Solution

>>> test = "This is my sample test blah blah http://t.co/pE6JSwG, hello all"

>>> re.sub('http://[^ ,]*', lambda t: "<a href='%s'>%s</a>" % (t.group(0), t.group(0)), test)

>>> This is my sample test blah blah <a href='http://t.co/pE6JSwG'>http://t.co/pE6JSwG</a>, hello all

This only works if you consider characters like the comma and space a valid stopping point for your url.

In general you should probably not use regexes for url matching, since there may not be a good way to know when a URL ends. If you are guaranteed to have a string with the same format every time, this solution will work. You may also always get URLs of the same length, in which case you can look for the http and collect the substring of that length afterward.

OTHER TIPS

Perhaps you could get inspiration from the source code of the django-oembed project.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow