Question

The following small program should be taking the formatted XML file and print it to another file with no new lines or tabs. However I can't figure out how the resulting file always contains tabs and new lines, instead of just a string of XML.

When I print to console the new lines and tabs are removed but the file always contains tabs and new lines.

open FH, ">tst.out";
MakeSourceFile($ARGV[0]);
close FH;

sub MakeSourceFile
{
    my $sourceFile  = shift;

    eval { require XML::Parser; import XML::Parser; };
    return if $@;

    my $parser = new XML::Parser();
    $parser->setHandlers(
        Start   => \&start,
        End     => \&end,
        Char    => \&data
    );
    $parser->parsefile($sourceFile);
}

sub start
{
    my ($parseinst, $element, %attrs) = @_;
    print FH "<$element";
    my $attrStr = "";
    map { $attrStr .= " $_=\"$attrs{$_}\""; } keys %attrs;
    print FH "$attrStr>";
}

sub data
{
    my ($parseinst, $data) = @_;
    print FH $data;
}

sub end
{
    my ($parseinst, $element, %attrs) = @_;
    print FH "</$element>";
}

input file (test.xml):

<stuff>
    <Profile id="a"></Profile>
    <Profile id="b"></Profile>
    <Profile id="theprofile" extends="a"></Profile>
    <Group>
        <Group>
            <elem stuff="st">stuff here</elem>
        </Group>
    </Group>
</stuff>

output file (tst.out):

<stuff>
    <Profile id="a"></Profile>
    <Profile id="b"></Profile>
    <Profile id="theprofile" extends="a"></Profile>
    <Group>
        <Group>
            <elem stuff="st">stuff here</elem>
        </Group>
    </Group>
</stuff>

expected file output (tst.out):

<stuff><Profile id="a"></Profile><Profile id="b"></Profile><Profile id="theprofile" extends="a"></Profile><Group><Group><elem stuff="st">stuff here</elem></Group></Group></stuff>

I considered that when I open the file in VI there is some kind of auto formatting but that isn't the case; I can also tell perl to just write the output to a file when XML::Parser is not involved and it is not formatted. What is going on here?

Was it helpful?

Solution

Whitespace is character data just the same as any other text content.

If you want to remove whitespace-only nodes then write

print FH $data if $data =~ /\S/;

You may want to go further and remove leading and trailing whitespace from $data.

OTHER TIPS

It seems that (I don't know XML spec perfectly) whitespace is considered data by either XML spec or the library.

if ($data =~ /\S/){ 
    print FH $data;   
}

That fixes your specific issue.

XML::Twig will automatically strip extraneous whitespace when parsing and printing an XML file.

use strict;
use warnings;

use XML::Twig;

my $data = do { local $/; <DATA> };

my $t = XML::Twig->new();
$t->parse( $data );
$t->print;

__DATA__
<stuff>
    <Profile id="a"></Profile>
    <Profile id="b"></Profile>
    <Profile id="theprofile" extends="a"></Profile>
    <Group>
        <Group>
            <elem stuff="st">stuff here</elem>
        </Group>
    </Group>
</stuff>

Outputs:

<stuff><Profile id="a"></Profile><Profile id="b"></Profile><Profile extends="a" id="theprofile"></Profile><Group><Group><elem stuff="st">stuff here</elem></Group></Group></stuff>

In fact, to get it to use whitespace, you must pass the following to the constructor: pretty_print => 'indented',

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top