Domanda

I'm testing feedparser on my rss feed. It works like a charm and I get all the entries.

Some news has an embedded youtube player, but this just does not appear in the return values of feedparser.

My code is simply:

d = feedparser.parse('http://feeds.feedburner.com/NotciasPs3evita-Mypst')

This returns (excerpt):

 guidislink': False,
          'id': u'http://mypst.com.br/forum/index.php?/topic/17336-gamegen-call-of-duty-black-ops-2-ganha-trailer-com-acao-real-e-muitas-surpresas/',
          'link': u'http://mypst.com.br/forum/index.php?/topic/17336-gamegen-call-of-duty-black-ops-2-ganha-trailer-com-acao-real-e-muitas-surpresas/',
          'links': [{'href': u'http://mypst.com.br/forum/index.php?/topic/17336-gamegen-call-of-duty-black-ops-2-ganha-trailer-com-acao-real-e-muitas-surpresas/',
                     'rel': u'alternate',
                     'type': u'text/html'}],
          'published': u'Mon, 29 Oct 2012 14:53:58 +0000',
          'published_parsed': time.struct_time(tm_year=2012, tm_mon=10, tm_mday=29, tm_hour=14, tm_min=53, tm_sec=58, tm_wday=0, tm_yday=303, tm_isdst=0),
          'summary': u'A Activision revelou hoje um novo trailer de Call of Duty: Black Ops 2, substituindo as cenas de a\xe7\xe3o do jogo por cenas de a\xe7\xe3o na vida real. O trailer traz diversas \u201csurpresas\u201d e alguns zumbis.<br />\n<br />\n<br />\n<br />\nCall of Duty: Black Ops 2 chegar\xe1 no dia 13 de novembro nos Estados Unidos.<br />\n<br />\n<br />\n<em class="bbc"><strong class="bbc">Fonte: <a class="bbc_url" href="http://www.gamegen.com.br/playstation3/call-of-duty-black-ops-2-ganha-trailer-com-acao-real-e-muitas-surpresas/" rel="nofollow external" title="Link externo">GameGeneration</a></strong></em>',
          'summary_detail': {'base': u'http://feeds.feedburner.com/NotciasPs3evita-Mypst',
                             'language': None,
                             'type': u'text/html',
                             'value': u'A Activision revelou hoje um novo trailer de Call of Duty: Black Ops 2, substituindo as cenas de a\xe7\xe3o do jogo por cenas de a\xe7\xe3o na vida real. O trailer traz diversas \u201csurpresas\u201d e alguns zumbis.<br />\n<br />\n<br />\n<br />\nCall of Duty: Black Ops 2 chegar\xe1 no dia 13 de novembro nos Estados Unidos.<br />\n<br />\n<br />\n<em class="bbc"><strong class="bbc">Fonte: <a class="bbc_url" href="http://www.gamegen.com.br/playstation3/call-of-duty-black-ops-2-ganha-trailer-com-acao-real-e-muitas-surpresas/" rel="nofollow external" title="Link externo">GameGeneration</a></strong></em>'},
          'title': u'[GameGen] Call of Duty: Black Ops 2 ganha trailer com a\xe7\xe3o real e muitas surpresas',
          'title_detail': {'base': u'http://feeds.feedburner.com/NotciasPs3evita-Mypst',
                           'language': None,
                           'type': u'text/plain',
                           'value': u'[GameGen] Call of Duty: Black Ops 2 ganha trailer com a\xe7\xe3o real e muitas surpresas'}},

Everything is in place except for the youtube player <object> tag. Is this a feedparser bug or a problem on my rss? Is there other lib on python to do this?

È stato utile?

Soluzione

The feedparser sanitizes HTML input and <object>, <param> and <embed> tags are stripped by default.

You need to either disable sanitization (only if you truly trust the source), or whitelist the YouTube tags.

To disable sanitization, set SANITIZE_HTML to False:

feedparser.SANITIZE_HTML = False

To add to the whitelist, add elements to the _HTMLSanitizer.acceptable_elements set:

_HTMLSanitizer.acceptable_elements.update(['object', 'param', 'embed'])

Both methods have inherent risks, and you are opening yourself to attacks this way. The approach I'd use is to switch of sanitization altogether, but then use some other method to sanitize the HTML, probably using lxml.html.clean with a whitelist and listing YouTube in host_whitelist.

Altri suggerimenti

In response to the above solution, it looks as though more recently the SANITIZE_HTML setting has moved under feedparser.api:

feedparser.api.SANITIZE_HTML = False
feedparser.api.mixin.SANITIZE_HTML = False
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top