Question

In an attempt to remove unwanted/unsafe tags and attributes from input, I am using the below code (almost entirely by http://djangosnippets.org/snippets/1655/):

def html_filter(value, allowed_tags = 'p h1 h2 h3 div span a:href:title img:src:alt:title table:cellspacing:cellpadding th tr td:colspan:rowspan ol ul li br'):
    js_regex = re.compile(r'[\s]*(&#x.{1,7})?'.join(list('javascript')))
    allowed_tags = [tag.split(':') for tag in allowed_tags.split()]
    allowed_tags = dict((tag[0], tag[1:]) for tag in allowed_tags)
    soup = BeautifulSoup(value)
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        comment.extract()
    for tag in soup.findAll(True):
        if tag.name not in allowed_tags:
            tag.hidden = True
        else:
            tag.attrs = [(attr, js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name]]
    return soup.renderContents().decode('utf8')

It works well for unwanted or whitelisted tags, attributes not whitelisted and even badly formatted html. However if any whitelisted attributes are present, it raises

'list' object has no attribute 'items'

at the last line, which is not helping me much. type(soup) is <class 'bs4.BeautifulSoup'> whether it raises an error or not, so I don't know what it's referring to.

Traceback:
[...]
File "C:\Users\Mark\Web\www\fnwidjango\src\base\functions\html_filter.py" in html_filter
  30.     return soup.renderContents().decode('utf8')
File "C:\Python27\lib\site-packages\bs4\element.py" in renderContents
  1098.             indent_level=indentLevel, encoding=encoding)
File "C:\Python27\lib\site-packages\bs4\element.py" in encode_contents
  1089.         contents = self.decode_contents(indent_level, encoding, formatter)
File "C:\Python27\lib\site-packages\bs4\element.py" in decode_contents
  1074.                                   formatter))
File "C:\Python27\lib\site-packages\bs4\element.py" in decode
  1021.             indent_contents, eventual_encoding, formatter)
File "C:\Python27\lib\site-packages\bs4\element.py" in decode_contents
  1074.                                   formatter))
File "C:\Python27\lib\site-packages\bs4\element.py" in decode
  1021.             indent_contents, eventual_encoding, formatter)
File "C:\Python27\lib\site-packages\bs4\element.py" in decode_contents
  1074.                                   formatter))
File "C:\Python27\lib\site-packages\bs4\element.py" in decode
  1021.             indent_contents, eventual_encoding, formatter)
File "C:\Python27\lib\site-packages\bs4\element.py" in decode_contents
  1074.                                   formatter))
File "C:\Python27\lib\site-packages\bs4\element.py" in decode
  983.             for key, val in sorted(self.attrs.items()):

Exception Type: AttributeError at /"nieuws"/article/3-test/
Exception Value: 'list' object has no attribute 'items'
Was it helpful?

Solution

Try replacing

tag.attrs = [(attr, js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name]]

with

tag.attrs = dict((attr, js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name])

OTHER TIPS

It looks like renderContents() expects you to set attrs to a dict (which would have an items method), rather than the list of tuples you pass. Hence it throws AttributeError when it tries to access it.

To fix the error, you can use a dict comprehension in Python 3:

tag.attrs = {attr: js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name]}

In Python 2, dict comprehensions aren't supported so you should pass an iterator to the dict constructor:

tag.attrs = dict((attr, js_regex.sub('', val)) for attr, val in tag.attrs.items() if attr in allowed_tags[tag.name]))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top