Question

I am trying to parse an HTML with meta-tag as :

<meta name="id" content=""12345.this.is.a.sample:id:required.67890"@abc.com">

The html::parser returns this "" empty value instead of the actual value required. This is my code depicting the start event handler:

sub start { 
    my ($self, $tagname, $attr, $attrseq, $origtext) = @_;
    if ($tagname eq 'meta') {
        print "Meta found: ", $attr->{ name }, $attr->{content}, "\n";

    } 
}

Any ideas on how to get the required value?

Was it helpful?

Solution 2

Okay, found out a way to get the content for this particular problem. The $origtext variable above gets the value of the argspec identifier text which is defined in the documentation as :

Text causes the source text (including markup element delimiters) to be passed.

So, basically,

print $origtext;

would give me the source text as output:

<meta name="id" content=""12345.this.is.a.sample:id:required.67890"@abc.com">

I can use a regex to exploit this value contained in $origtext and get the desired stuff.

OTHER TIPS

I think a quote from Charles Babbage is appropriate here:

On two occasions I have been asked, — “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” […] I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

Passages from the Life of a Philosopher (1864)

This is known as the Garbage-In, Garbage-Out (GIGO) principle. In your case, you have malformed HTML. If you feed that to an HTML parser, you'll necessarily get bogus output. The HTML standard is already quite lax to deal with all kinds of common errors, but your example is much more broken.

There is, of course, one solution: Don't treat your input as HTML, but as some derived format where your example happens to be legal input. You'd have to write a custom parser of your own or adapt an existing HTML parser to your needs, but that would work.

However, I think that fixing the source of the input would be easier than writing your own parser. All that is needed is the quotes inside the attribute to be escaped, or for the attribute to use single quotes:

<meta name="id" content="&quot;12345.this.is.a.sample:id:required.67890&quot;@abc.com">
<meta name="id" content='"12345.this.is.a.sample:id:required.67890"@abc.com'>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top