Question

I'm iterating through all the data at this webpage (sample xml below) and I'm confused as to exactly how to get the required values.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet title="XSL_formatting" type="text/xsl" href="/i/xml/xsl_formatting_rss.xml"?>
<rss xmlns:blogChannel="http://backend.userland.com/blogChannelModule" version="2.0">
    <channel>
        <title>Ariana Resources News</title>
        <link>http://www.iii.co.uk/investment/detail?code=cotn:AAU.L&amp;display=news</link>
        <description />
    <item>
        <title>Ariana Resources PLC - Environmental Impact Assessment Submitted for Kiziltepe</title>
        <link>http://www.iii.co.uk/investment/detail?code=cotn:AAU.L&amp;display=news&amp;action=article&amp;articleid=9084833&amp;from=rss</link>
        <description>Some Article information</description>
        <pubDate>Fri, 30 Aug 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
        <title>Ariana Resources PLC - Directors' Dealings and Holding in Company</title>
        <link>http://www.iii.co.uk/investment/detail?code=cotn:AAU.L&amp;display=news&amp;action=article&amp;articleid=9053338&amp;from=rss</link>
        <description>Some Article information</description>
        <pubDate>Wed, 31 Jul 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
        <title>Ariana Resources PLC - Directorship Changes</title>
        <link>http://www.iii.co.uk/investment/detail?code=cotn:AAU.L&amp;display=news&amp;action=article&amp;articleid=9046582&amp;from=rss</link>
        <description>Some Article information</description>
        <pubDate>Wed, 24 Jul 2013 09:31:00 GMT</pubDate>
    </item>
    <item>
        <title>Ariana Resources PLC - Ariana Resources plc : Capital Reorganisation</title>
        <link>http://www.iii.co.uk/investment/detail?code=cotn:AAU.L&amp;display=news&amp;action=article&amp;articleid=9038706&amp;from=rss</link>
        <description>Some Article information</description>
        <pubDate>Wed, 24 Jul 2013 09:31:00 GMT</pubDate>
    </item>
    <item>
</channel>
</rss>

I've had a look at the dom4j quickstart guide, although I suspect I'm just not quite getting it.

How can I iterate in such a fashion that I:

  1. Go through each if it has todays date and...
  2. Get the values of each specificall , and

At this point I've got the below, and I think it's very wrong on the second loop... any help is hugely appreciated:

    //Create a null Document Object
    Document theXML = null;

    //Get the document of the XML and assign to Document object
    theXML = parseXML(url);

    //Place the root element of theXML into a variable
    Element root = theXML.getRootElement();


    // iterate through child elements of root
    for ( Iterator i = root.elementIterator(); i.hasNext(); ) {
        Element element = (Element) i.next();
        // do something

        // iterate through child elements of root with element name "item"
        for ( Iterator j = root.elementIterator( "item" ); j.hasNext(); ) {
            Element foo = (Element) j.next();

            String rnsHeadline = "";
            String rnsLink = "";
            String rnsFullText = "";
            String rnsConstituentName = "";



            Rns rns = new Rns(null, null, null, null);

        }
Was it helpful?

Solution

With XPath functionality of dom4j:

// Place the root element of theXML into a variable
List<? extends Node> items =
        (List<? extends Node>)theXML.selectNodes("//rss/channel/item");

// RFC-dictated date format used with RSS
DateFormat dateFormatterRssPubDate =
        new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss Z", Locale.ENGLISH);

// today started at this time
DateTime timeTodayStartedAt = new DateTime().withTimeAtStartOfDay();

for (Node node: items) {
     String pubDate = node.valueOf( "pubDate" );
     DateTime date = new DateTime(dateFormatterRssPubDate.parse(pubDate));
     if (date.isAfter(timeTodayStartedAt)) {
         // it's today, do something!
         System.out.println("Today: " + date);
     } else {
         System.out.println("Not today: " + date);
     }
}

Dom4j needs jaxen dependency for XPath to work. I used JodaTime to compare the dates, as it's a lot cleaner than using java builtin dates. Here's the full example.

Note that dom4j is not really maintained, so you might be also interested in this discussion about dom4j alternatives.

OTHER TIPS

There is nothing wrong with the second loop, you have to navigate through the elements hierarchy to come to the once you are interested in, so you were already on right path. And here is how you could continue:

public class Dom4JRssParser {

    private void parse(Date day) throws DocumentException, ParseException {
        Date dayOnly = removeTime(day);

        // Fri, 30 Aug 2013 07:00:00 GMT
        SimpleDateFormat sdfXml = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss z", Locale.ENGLISH);
        System.out.println("Day: " + sdfXml.format(dayOnly));

        SAXReader reader = new SAXReader();
        Document doc = reader.read(getClass().getResourceAsStream("/com/so/dom4j/parser/rss/example_01.xml"));
        Element root = doc.getRootElement(); // rss
        for(Iterator rootIt = root.elementIterator("channel"); rootIt.hasNext(); ) {
            Element channel = (Element) rootIt.next();
            for(Iterator itemIt = channel.elementIterator("item"); itemIt.hasNext(); ) {
                Element item = (Element) itemIt.next();
                Element pubDate = item.element("pubDate");
                if(pubDate != null) {
                    if(removeTime(sdfXml.parse(pubDate.getTextTrim())).equals(dayOnly)) {
                        Rns rns = new Rns(item.element("title"), 
                                item.element("link"), 
                                item.element("description"), 
                                item.element("constituent"));
                        System.out.println(rns.toString());
                        System.out.println();
                    }
                }
            }
        }
    }

    private Date removeTime(Date day) {
        Calendar c = Calendar.getInstance(Locale.ENGLISH);
        c.setTime(day);
        c.set(Calendar.HOUR_OF_DAY, 0);
        c.set(Calendar.MINUTE, 0);
        c.set(Calendar.SECOND, 0);
        c.set(Calendar.MILLISECOND, 0);
        return c.getTime();
    }

    public static void main(String... args) throws ParseException, DocumentException {
        Dom4JRssParser o = new Dom4JRssParser();
        if(args.length == 0) {
            o.parse(new Date());
        } else {
            SimpleDateFormat sdfInput = new SimpleDateFormat("yyyyMMdd");
            for(String arg : args) {
                o.parse(sdfInput.parse(arg));
            }
        }
    }
}

Test run with the argument

20130731

Output

Day: Wed, 31 Jul 2013 00:00:00 CEST
Rns [rnsHeadline=Ariana Resources PLC - Directors' Dealings and Holding in Company
rnsLink=http://www.iii.co.uk/investment/detail?code=cotn:AAU.L&display=news&action=article&articleid=9053338&from=rss
rnsFullText=Some Article information
rnsConstituentName=]

Also you may consider using the XPath API (section Powerful Navigation with XPath in the quick-start link you have posted) as it is more comfortable, see eis's answer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top