Question

I am extracting metadata from PNG images using javax.imageio. This works fine. But the getAsTree method to get to the actual metadata returns XML that is invalid. So I don't know how to parse this XML in order to get certain metadata:

run:
Format name: javax_imageio_png_1.0
<javax_imageio_png_1.0>
    <IHDR width="256" height="256" bitDepth="8" colorType="RGBAlpha" compressionMethod="deflate" filterMethod="adaptive" interlaceMethod="none"/>
    <cHRM whitePointX="31269" whitePointY="32899" redX="63999" redY="33001" greenX="30000" greenY="60000" blueX="15000" blueY="5999"/>
    <gAMA value="45454"/>
    <iTXt>
        <iTXtEntry keyword="XML:com.adobe.xmp" compressionFlag="FALSE" compressionMethod="0" languageTag="" translatedKeyword="" text="<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01        ">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about=""
    xmlns:xmp="http://ns.adobe.com/xap/1.0/"
    xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
    xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
    xmlns:lr="http://ns.adobe.com/lightroom/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmp:MetadataDate="2012-12-05T21:36:19+01:00"
   xmpMM:InstanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
   xmpMM:DocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE"
   xmpMM:OriginalDocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE">
   <xmpMM:History>
    <rdf:Seq>
     <rdf:li
      stEvt:action="saved"
      stEvt:instanceID="xmp.iid:FC7F11740720681192B0AE5890E66CAE"
      stEvt:when="2012-12-04T00:23:34+01:00"
      stEvt:changed="/metadata"/>
     <rdf:li
      stEvt:action="saved"
      stEvt:instanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
      stEvt:when="2012-12-05T21:36:19+01:00"
      stEvt:changed="/metadata"/>
    </rdf:Seq>
   </xmpMM:History>
   <lr:hierarchicalSubject>
    <rdf:Bag>
     <rdf:li>Component|Software</rdf:li>
     <rdf:li>Places|Paris</rdf:li>
     <rdf:li>Product|Christensen</rdf:li>
     <rdf:li>Product|Simba</rdf:li>
    </rdf:Bag>
   </lr:hierarchicalSubject>
   <dc:subject>
    <rdf:Bag>
     <rdf:li>Christensen</rdf:li>
     <rdf:li>Paris</rdf:li>
     <rdf:li>Simba</rdf:li>
     <rdf:li>Software</rdf:li>
    </rdf:Bag>
   </dc:subject>
  </rdf:Description>
 </rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>"/>
    </iTXt>
    <pHYs pixelsPerUnitXAxis="2835" pixelsPerUnitYAxis="2835" unitSpecifier="meter"/>
</javax_imageio_png_1.0>
Format name: javax_imageio_1.0
<javax_imageio_1.0>
    <Chroma>
        <ColorSpaceType name="RGB"/>
        <NumChannels value="4"/>
        <Gamma value="0.45453998"/>
        <BlackIsZero value="TRUE"/>
    </Chroma>
    <Compression>
        <CompressionTypeName value="deflate"/>
        <Lossless value="TRUE"/>
        <NumProgressiveScans value="1"/>
    </Compression>
    <Data>
        <PlanarConfiguration value="PixelInterleaved"/>
        <SampleFormat value="UnsignedIntegral"/>
        <BitsPerSample value="8 8 8 8"/>
    </Data>
    <Dimension>
        <PixelAspectRatio value="1.0"/>
        <ImageOrientation value="Normal"/>
        <HorizontalPixelSize value="0.35273367"/>
        <VerticalPixelSize value="0.35273367"/>
    </Dimension>
    <Text>
        <TextEntry keyword="XML:com.adobe.xmp" value="<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01        ">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about=""
    xmlns:xmp="http://ns.adobe.com/xap/1.0/"
    xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
    xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
    xmlns:lr="http://ns.adobe.com/lightroom/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmp:MetadataDate="2012-12-05T21:36:19+01:00"
   xmpMM:InstanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
   xmpMM:DocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE"
   xmpMM:OriginalDocumentID="xmp.did:FC7F11740720681192B0AE5890E66CAE">
   <xmpMM:History>
    <rdf:Seq>
     <rdf:li
      stEvt:action="saved"
      stEvt:instanceID="xmp.iid:FC7F11740720681192B0AE5890E66CAE"
      stEvt:when="2012-12-04T00:23:34+01:00"
      stEvt:changed="/metadata"/>
     <rdf:li
      stEvt:action="saved"
      stEvt:instanceID="xmp.iid:EF7F11740720681192B08F682498C71D"
      stEvt:when="2012-12-05T21:36:19+01:00"
      stEvt:changed="/metadata"/>
    </rdf:Seq>
   </xmpMM:History>
   <lr:hierarchicalSubject>
    <rdf:Bag>
     <rdf:li>Component|Software</rdf:li>
     <rdf:li>Places|Paris</rdf:li>
     <rdf:li>Product|Christensen</rdf:li>
     <rdf:li>Product|Simba</rdf:li>
    </rdf:Bag>
   </lr:hierarchicalSubject>
   <dc:subject>
    <rdf:Bag>
     <rdf:li>Christensen</rdf:li>
     <rdf:li>Paris</rdf:li>
     <rdf:li>Simba</rdf:li>
     <rdf:li>Software</rdf:li>
    </rdf:Bag>
   </dc:subject>
  </rdf:Description>
 </rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>" language="" compression="none"/>
    </Text>
    <Transparency>
        <Alpha value="nonpremultipled"/>
    </Transparency>
</javax_imageio_1.0>
BUILD SUCCESSFUL (total time: 3 seconds)

The invalid XML starts at the iTXtEntry element, which has the xpacket bit and encloses child elements although it has a self-closing tag format, instead of an end tag. So when I try to parse this using a DOM document and xpath, I get an error saying that this element cannot contain ">" in the content of the element.

I have disabled DTD validation on the DocumentBuilderFactory. This doesn't help. I feel like I'm down to using regex, but that doesn't seem right. Why do I get invalid XML in the first place from the getAsTree method in imageio, and what can I do about this?

Was it helpful?

Solution

Your question is nonsensical because IIOMetaData.getAsTree() returns a DOM Node object which is the root of a Node tree. This is an in-memory representation of XML. It's not parsed from anywhere, so it can't be invalid. An xml document string can be invalid, but there's no string here that's being parsed. The getAsTree method created the XML directly, in-memory.

The problem is with your output producing invalid XML. Whatever is serializing your Node from getAsTree() is doing so incorrectly. Namely, it is not properly escaping the value of the text attribute, which is itself an XML document string.

Below is a complete example that demonstrates how to get image metadata and serialize to a (valid) XML string.

import java.io.*;
import java.util.*;

// for imageio metadata
import javax.imageio.*;
import javax.imageio.stream.*;
import javax.imageio.metadata.*;

// for xml handling
import org.w3c.dom.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;

public class imgmeta {
    // Very lazy exception handling
    // This is just a quick example
    public static void main(String[] args) throws Exception {
        String filename = args[0];

        File file = new File(filename);
        ImageInputStream imagestream = ImageIO.createImageInputStream(file);

        // get a reader which is able to read this file
        Iterator<ImageReader> readers = ImageIO.getImageReaders(imagestream);
        ImageReader reader = readers.next();

        // feed image to reader
        reader.setInput(imagestream, true);

        // get metadata of first image
        IIOMetadata metadata = reader.getImageMetadata(0);

        // get any metadata format name
        // (you should prefer the native one, but not all images have one)
        // String mdataname = metadata.getNativeMetadataFormatName(); // might be null
        String[] mdatanames = metadata.getMetadataFormatNames();

        String mdataname = mdatanames[0];

        Node metadatadom = metadata.getAsTree(mdataname);

        // metadatadom is now a DOM Node root of a DOM tree
        // representing metadata in the image
        // Since it's in-memory, it can't be "invalid"
        // because it's already been parsed


        // now let's serialize to an XML string
        // javax.xml.transform.Transformer takes xml sources
        // in one representation and transforms them to xml
        // in another representation
        // Representations include: DOM, JAXB, SAX, stream, etc
        DOMSource source = new DOMSource(metadatadom);

        StringWriter writer = new StringWriter();
        StreamResult result = new StreamResult(writer);

        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.transform(source, result);

        // THIS is what you want:
        String metadata_in_xml = writer.toString();

        // now print it:
        System.out.print(metadata_in_xml);
    }
}

This is test output run using an image I had around:

$ java imgtest testimage.png | xmllint --format -
<?xml version="1.0" encoding="UTF-8"?>
<javax_imageio_png_1.0>
  <IHDR width="149" height="237" bitDepth="8" colorType="RGBAlpha" compressionMethod="deflate" filterMethod="adaptive" interlaceMethod="none"/>
  <iTXt>
    <iTXtEntry keyword="XML:com.adobe.xmp" compressionFlag="0" compressionMethod="0" languageTag="" translatedKeyword="" text="&lt;?xpacket begin=&quot;?&quot; id=&quot;W5M0MpCehiHzreSzNTczkc9d&quot;?&gt; &lt;x:xmpmeta xmlns:x=&quot;adobe:ns:meta/&quot; x:xmptk=&quot;Adobe XMP Core 5.0-c061 64.140949, 2010/12/07-10:57:01        &quot;&gt; &lt;rdf:RDF xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot;&gt; &lt;rdf:Description rdf:about=&quot;&quot; xmlns:xmp=&quot;http://ns.adobe.com/xap/1.0/&quot; xmlns:xmpMM=&quot;http://ns.adobe.com/xap/1.0/mm/&quot; xmlns:stRef=&quot;http://ns.adobe.com/xap/1.0/sType/ResourceRef#&quot; xmp:CreatorTool=&quot;Adobe Photoshop CS5.1 Macintosh&quot; xmpMM:InstanceID=&quot;xmp.iid:D281E43D34DC11E2BFE69DA1E5D17E5F&quot; xmpMM:DocumentID=&quot;xmp.did:D281E43E34DC11E2BFE69DA1E5D17E5F&quot;&gt; &lt;xmpMM:DerivedFrom stRef:instanceID=&quot;xmp.iid:D281E43B34DC11E2BFE69DA1E5D17E5F&quot; stRef:documentID=&quot;xmp.did:D281E43C34DC11E2BFE69DA1E5D17E5F&quot;/&gt; &lt;/rdf:Description&gt; &lt;/rdf:RDF&gt; &lt;/x:xmpmeta&gt; &lt;?xpacket end=&quot;r&quot;?&gt;"/>
  </iTXt>
  <tEXt>
    <tEXtEntry keyword="Software" value="Adobe ImageReady"/>
  </tEXt>
</javax_imageio_png_1.0>

The XML produced is valid:

$ java imgmeta testimage.png | xmllint --noout -
$

(No output means valid.)

Notice how the value of the iTXtEntry's text attribute is escaped. If you want to retrieve the data inside this attribute, you need to retrieve the string and then parse that as its own XML document to get another DOM (or whatever) tree. This attribute: keyword="XML:com.adobe.xmp" is a signal that the value of the text attribute is an XML document with XMP data in it.

UPDATE: parsing XMP data

Here is some sample code demonstrating extracting the attribute value and parsing it to and from XML and a DOM tree.

public class XMPExample {
public static String transformXML(Node xml) throws Exception {
    StringWriter writer = new StringWriter();

    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.transform(new DOMSource(xml), new StreamResult(writer));

    return writer.toString();
}

public static Document transformXML(String xml) throws Exception {
    StringReader reader = new StringReader(xml);
    Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
    Transformer transformer = TransformerFactory.newInstance().newTransformer();

    transformer.transform(new StreamSource(reader), new DOMResult(doc));
    return doc;
}

public static String getXMP(Element metadata_dom) throws Exception {
            // (Element) type because getElementsByTagName() method is required

    // There are many more robust ways of selecting nodes
    // (e.g. javax.xml.xpath), but this is for a simple example
    // that only uses the native DOM methods

    // This is very brittle because we're making assumptions about
    // the metadata_dom structure. There are two sources of brittleness:

    // 1. The metadata format from `metadata.getMetadataFormatNames()`.
    //    You should probably settle on a standard one you know will
    //    exist, like 'javax_imageio_1.0'
    // 2. How the image stores the metadata. Usually XMP data will
    //    be in a text field with keyword 'XML:com.adobe.xmp', but
    //    I don't know that this is *always* the case.

    // the code below assumes "javax_imageio_png_1.0" format
    NodeList iTXtEntries = metadata_dom.getElementsByTagName("iTXtEntry");
    Element iTXtEntry = null;
    Element entry = null;
    for (int i = 0; i < iTXtEntries.getLength(); i++) {
        entry = (Element) iTXtEntries.item(i);
        if (entry.getAttribute("keyword").equals("XML:com.adobe.xmp")) {
            iTXtEntry = entry;
            break;
        }
    }
    if (iTXtEntry == null) {
        return null;
    }

    String xmp_xml_doc = iTXtEntry.getAttribute("text");

    return xmp_xml_doc;

}
}
// Use like so:
Node metadatanode = metadata.getAsTree(metadataname);

String xmp_xml = XMPExample.getXMP((Element) metadatanode);

// xmp_xml is now an xml document STRING
System.out.print(xmp_xml);

// If you want to parse it as an XML document, use an XML parser.
Document xmp_dom = XMPExample.transformXML(xmp_xml);

// ...and you can serialize it again when you are done.
String xmp_xml_roundtripped = XMPExample.transformXML(xmp_dom);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top