Django middleware for adding relnofollow for all external links
Question
I'm trying to close all external urls with rel="nofollow" parametr:
I write this simple middleware:
import re
NOFOLLOW_RE = re.compile(u'<a (?![^>]*rel=["\']nofollow[\'"])'\
u'(?![^>]*href=["\']mysite\.com[\'"])',
re.UNICODE|re.IGNORECASE)
class NofollowLinkMiddleware(object):
def process_response(self, request, response):
if ("text" in response['Content-Type']):
response.content = re.sub(NOFOLLOW_RE, u'<a rel="nofollow" ', response.content.decode('UTF8') )
return response
else:
return response
it works, but closes all links internal and external. And I don't know how more add <noindex></noindex> tag to link.
Solution
At first, you forgot 'http://' and url path. So, you regexp should be:
NOFOLLOW_RE = re.compile(u'<a (?![^>]*rel=["\']nofollow[\'"])'\
u'(?![^>]*href=["\']http://mysite\.com(/[^\'"]*)?[\'"])',
re.U|re.I)
Then, you also need to consider hrefs starting from "/" and "#" as internal links:
NOFOLLOW_RE = re.compile(u'<a (?![^>]*rel=["\']nofollow[\'"])'\
u'(?![^>]*href=["\'](?:https?://mysite\.com(?:/[^\'"]*)|/[^\'"]*|#[^\'"]*)[\'"])',
re.U|re.I)
Also, you'll possibly wish to take in account 3rd level domain, and "https://" protocol.
For <noindex> tag you can use groups, look at re.sub() in Python docs:
NOFOLLOW_RE = re.compile(u'<a (?P<link>(?![^>]*rel=["\']nofollow[\'"])'\
u'(?![^>]*href=["\'](?:https?://mysite\.com(?:/[^\'"]*)|/[^\'"]*|#[^\'"]*)[\'"]).*?</a>)',
re.U|re.I)
...
response.content = NOFOLLOW_RE.sub(u'<noindex><a rel="nofollow" \g<link></noindex>', your_html)
This regexp is quirky. I strongly suggest you to write a test for it, with all possible combinations of <a> tags and it's attributes you can imagine. If you find some issue in this code afterwards, the test will help you not to break everything.
OTHER TIPS
I know that I am very late but I am leaving answer for others. @HighCat had given right answer for all of the cases except one. Above regex will also add nofollow to the link http://example.com
So regex in this case should be =>
import re
NOFOLLOW_RE = re.compile(u'<a (?P<link>(?![^>]*rel=["\']nofollow[\'"])'\
u'(?![^>]*href=["\'](?:https?://example\.com/?(?:[^\'"]*)|/[^\'"]*|#[^\'"]*)[\'"]).*?</a>)',
re.U|re.I)
class NofollowLinkMiddleware(object):
def process_response(self, request, response):
if ("text" in response['Content-Type']):
response.content = NOFOLLOW_RE.sub(u'<a rel="nofollow" target="_blank" \g<link>', response.content.decode('UTF8') )
return response
else:
return response
It is minor change. I should comment or edit but I have not enough reputation (for comment) and editing also requires 6+ chars change.