Question

I'm trying to close all external urls with rel="nofollow" parametr:

I write this simple middleware:

import re

NOFOLLOW_RE = re.compile(u'<a (?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\']mysite\.com[\'"])',
                         re.UNICODE|re.IGNORECASE)

class NofollowLinkMiddleware(object):

    def process_response(self, request, response):
        if ("text" in response['Content-Type']):

            response.content = re.sub(NOFOLLOW_RE, u'<a rel="nofollow" ', response.content.decode('UTF8') )
            return response
        else:
            return response

it works, but closes all links internal and external. And I don't know how more add <noindex></noindex> tag to link.

Was it helpful?

Solution

At first, you forgot 'http://' and url path. So, you regexp should be:

NOFOLLOW_RE = re.compile(u'<a (?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\']http://mysite\.com(/[^\'"]*)?[\'"])',
                         re.U|re.I)

Then, you also need to consider hrefs starting from "/" and "#" as internal links:

NOFOLLOW_RE = re.compile(u'<a (?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\'](?:https?://mysite\.com(?:/[^\'"]*)|/[^\'"]*|#[^\'"]*)[\'"])',
                         re.U|re.I)

Also, you'll possibly wish to take in account 3rd level domain, and "https://" protocol.

For <noindex> tag you can use groups, look at re.sub() in Python docs:

NOFOLLOW_RE = re.compile(u'<a (?P<link>(?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\'](?:https?://mysite\.com(?:/[^\'"]*)|/[^\'"]*|#[^\'"]*)[\'"]).*?</a>)',
                         re.U|re.I)
...
response.content = NOFOLLOW_RE.sub(u'<noindex><a rel="nofollow" \g<link></noindex>', your_html)

This regexp is quirky. I strongly suggest you to write a test for it, with all possible combinations of <a> tags and it's attributes you can imagine. If you find some issue in this code afterwards, the test will help you not to break everything.

OTHER TIPS

I know that I am very late but I am leaving answer for others. @HighCat had given right answer for all of the cases except one. Above regex will also add nofollow to the link http://example.com

So regex in this case should be =>

import re

NOFOLLOW_RE = re.compile(u'<a (?P<link>(?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\'](?:https?://example\.com/?(?:[^\'"]*)|/[^\'"]*|#[^\'"]*)[\'"]).*?</a>)',
                         re.U|re.I)

class NofollowLinkMiddleware(object):

    def process_response(self, request, response):
        if ("text" in response['Content-Type']):

            response.content = NOFOLLOW_RE.sub(u'<a rel="nofollow" target="_blank" \g<link>', response.content.decode('UTF8') )
            return response
        else:
            return response

It is minor change. I should comment or edit but I have not enough reputation (for comment) and editing also requires 6+ chars change.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top