RSS/Python - Parsing Single Image URL
-
20-04-2021 - |
سؤال
I'm in the works of learning to parse xml and rss feeds correctly and have run in to a little problem. I'm using feedbarser in python to parse a specific entry from an RSS feed, but can't figure out how to parse just a single img src from the content section.
Here's what I have so far.
import dirFeedparser.feedparser as feedparser
feedurl = feedparser.parse('http://dustinheroin.chompblog.com/index.php?cat=22&feed=rss2')
statusupdate = feedurl.entries[0].content
print statusupdate
Now, when I print the content I get this:
[{'base': u'http://dustinheroin.chompblog.com/index.php?cat=22&feed=rss2', 'type': u'text/html', 'value': u'<p><a href="http://dustinheroin.chompblog.com/wp-content/uploads/2012/01/20120129-154945.jpg"><img alt="20120129-154945.jpg" class="alignnone size-full" src="http://dustinheroin.chompblog.com/wp-content/uploads/2012/01/20120129-154945.jpg" /></a></p>', 'language': None}]
What method would be best to get the IMG SRC from that? Any help is appreciated, thanks!
المحلول
@Lattyware, you have some problem with setting soap.
@user1130601, you can check the following code:
#!/usr/bin/python
from BeautifulSoup import BeautifulSoup
import feedparser
feedurl = feedparser.parse('http://dustinheroin.chompblog.com/index.php?cat=22&feed=rss2')
statusupdate = feedurl.entries[0].content
soup = BeautifulSoup(statusupdate[0]['value'])
print(soup.find("img")["src"])
Output:
http://dustinheroin.chompblog.com/wp-content/uploads/2012/01/20120129-171134.jpg
نصائح أخرى
You can also try lxml . With lxml you can use xpath expressions.
Here x is your statusupdate.
from lxml import etree
st = x[0]["value"]
doc = etree.fromstring(st)
value = doc.xpath("//img/@src") #xpath expr = //img/@src
"".join(value)
Output = 'http://dustinheroin.chompblog.com/wp-content/uploads/2012/01/20120129-154945.jpg'
If you want to get a good HTML parser, try BeautifulSoup.
It's easy to parse with it:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(statusupdate['value'])
url = soup.find('img').src
You then want to use a separate HTML parser to parse the HTML and get the img
's src
attribute. You might want to look into Beautiful Soup.
e.g:
from BeautifulSoup import BeautifulSoup
import feedparser
feedurl = feedparser.parse('http://dustinheroin.chompblog.com/index.php?cat=22&feed=rss2')
statusupdate = feedurl.entries[0].content[0]
soup = BeautifulSoup(statusupdate["value"])
print(soup.find("img")["src"])
Note that this simply uses the first img
tag it finds. If you need to be more selective, look at findall
.