Question

I'm trying to parse rss data from this feed: http://fulltextrssfeed.com/feeds.bbci.co.uk/news/rss.xml, which was generated using using the FullTextRssFeed site. The only problem is that when I try to get the description, I receive '<', everything else is normal!. I've tried to use JSoup with this, but I'm not sure how to. Could you suggest how? The code I've used is the same as that used in this tutorial, but I've substituted the RSS URL used. Thanks again! *Here is my RSS reader in action*

Was it helpful?

Solution 3

While searching the web for ideas on how to do this, I found out that doing this is actually illegal as it this method of getting content violates the terms of use of many of the web sources I was hoping to use. For now you will have to stick with short RSS feeds.

OTHER TIPS

Your issue is because the description inside of your RSS feed contains html, rather than plain text. Here is the description content:

<div><span class="story-date"><span class="date">3 April 2013</span> <span class="time-text">Last updated at</span> <span class="time">23:25 ET</span></span> <p><img src="http://news.bbcimg.co.uk/media/images/66739000/jpg/_66739180_philpotts.jpg" width="464" height="261" alt="Mick and Mairead Philpott, Paul Mosley"/><span class="c2">Mick and Mairead Philpott, and Paul Mosley, will be sentenced on Thursday</span></p> <p class="introduction" id="story_continues_1">A couple convicted of killing six of their children in a house fire in Derby are due to be sentenced later.</p> <p>Mick and Mairead Philpott will reappear at Nottingham Crown Court where they were found guilty of six counts of manslaughter, along with their friend Paul Mosley, on Tuesday.</p> <p>The maximum sentence for the crime is life imprisonment.</p> <p>Mrs Justice Thirlwall was due to pass sentence on Wednesday but needed more time to consider mitigation.</p> <p>The court was told that Philpott, 56, was jailed for seven years in 1978 for attempting to murder a previous girlfriend and given a concurrent five-year sentence for stabbing the woman's mother.</p> <p>In 1991 he received a conditional discharge for assault after he head-butted a colleague</p> <p>And in 2010 he was given a police caution after slapping Mairead and dragging her outside by her hair.</p> <p>When Philpott set fire to his house in Victory Road, Derby, he was also facing trial over a road rage incident in which he punched a motorist in the face.</p> <p>He had admitted common assault in relation to the incident but denied dangerous driving.</p> <span class="cross-head">Rape allegation</span> <p>Police have also confirmed that they intend to "thoroughly" investigate an allegation that Philpott raped a woman several years ago.</p> <p>She made the allegation after the death of Philpott's children, but police decided to wait until the end of the manslaughter trial before investigating the complaint further.</p> <p>On Tuesday the jury returned unanimous manslaughter verdicts on Philpott and Mosley, 46, while Mairead Philpott, 32, was convicted by a majority.</p> <p>Jade Philpott, 10, John, nine, Jack, eight, Jesse, six, and Jayden, five, died on the morning of the fire on 11 May 2012.</p> <p>Mairead Philpott's son from a previous relationship, 13-year-old Duwayne, died later in hospital.</p> </div><img src="http://pixel.quantserve.com/pixel/p-89EKCgBk8MZdE.gif" border="0" height="1" width="1" />

You'll need to alter the parser in some way that it can ignore the that are within the html content inside of description. Once you get the full html snippet out you can render it in a WebView. I think generally CDATA is used when there is some other type of XML content (in this case HTML) that is within an XML piece of data such as an RSS feed. Honestly though I am not to familiar with the ins and outs of it, I could be incorrect.

The HTML you get from myRssFeed.getDescription() looks something like this:

<div><span class="story-date"><span class="date">6 April 2013</span> <span class="time-text">Last updated at</span> <span class="time">08:57 ET</span></span> <p><img src="http://news.bbcimg.co.uk/media/images/51606000/jpg/_51606573_fa1d16c0-9c6c-4f82-b0b8-ab66ddd94f78.jpg" width="304" height="171" alt="Breaking news"/></p> <p class="introduction">Nelson Mandela has been discharged from hospital after treatment for pneumonia, South Africa's government has said.</p> <p>It said there had been "a sustained and gradual improvement in his condition".</p> <p>The 94-year-old was admitted on 27 March for a recurring lung infection and had fluid drained at the undisclosed hospital.</p> <p>Mr Mandela served as South Africa's first black president from 1994 to 1999 and is regarded by many as the father of the nation.</p> <p>The <a href="http://redirect.viglink.com?key=11fe087258b6fc0532a5ccfc924805c0&u=http%3A%2F%2Fwww.thepresidency.gov.za%2Fpebble.asp%3Frelid%3D15178">presidency statement read</a>: "Former President Nelson Mandela has been discharged from hospital today, 6 April, following a sustained and gradual improvement in his general condition.</p> <p>"The former president will now receive home-based high care. President [Jacob] Zuma thanks the hard working medical team and hospital staff for looking after Madiba so efficiently."</p> <p>Madiba is Mr Mandela's clan name.</p> <p>The statement continued: "[Mr Zuma] also extended his gratitude to all South Africans and friends of the Republic in Africa and around the world for support."</p> </div><img src="http://pixel.quantserve.com/pixel/p-89EKCgBk8MZdE.gif" border="0" height="1" width="1" />

Using Jsoup you can try this (untested):

Instead of

feedDescribtion.setText(myRssFeed.getDescription());

use this:

feedDescribtion.setText(extractDescriptionText(myRssFeed.getDescription());

with the following method:

private String extractDescriptionText(String description) {
    StringBuffer b = new StringBuffer();
    Document dom = Jsoup.parse(description);
    Elements paragraphs = dom.getElementsByTag("p");
    for (int i=1; i<paragraphs.size(); i++) { // start with 1 to skip the 'breaking news' paragraph
        Element p = paragraphs.get(i);
        b.append(p.text());
        b.append("\n"); // line-break after each paragraph
    }
    return b.toString();
}

This should work. Maybe some fine-tuning is necessary but that can be achieved quite easily with Jsoup's help.

EDIT:

This is what extractDescriptionText() gives for the example above:

Nelson Mandela has been discharged from hospital after treatment for pneumonia, South Africa's government has said. It said there had been "a sustained and gradual improvement in his condition". The 94-year-old was admitted on 27 March for a recurring lung infection and had fluid drained at the undisclosed hospital. Mr Mandela served as South Africa's first black president from 1994 to 1999 and is regarded by many as the father of the nation. The presidency statement read: "Former President Nelson Mandela has been discharged from hospital today, 6 April, following a sustained and gradual improvement in his general condition. "The former president will now receive home-based high care. President [Jacob] Zuma thanks the hard working medical team and hospital staff for looking after Madiba so efficiently." Madiba is Mr Mandela's clan name. The statement continued: "[Mr Zuma] also extended his gratitude to all South Africans and friends of the Republic in Africa and around the world for support."

I would comment but I don't have enough points.

I would recommend using yahoo pipes to redirect your rss feeds. You can even choose it to be redirected as json rather than xml.

http://pipes.yahoo.com/pipes/

If your parser is working OK on most websites you visited this would be the easiest way to fix your problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top