how to parse malformed HTML using HTML::Parser

Question 1

Okay, found out a way to get the content for this particular problem. The $origtext variable above gets the value of the argspec identifier text which is defined in the documentation as :

Text causes the source text (including markup element delimiters) to be passed.

So, basically,

print $origtext;

would give me the source text as output:

<meta name="id" content=""12345.this.is.a.sample:id:required.67890"@abc.com">

I can use a regex to exploit this value contained in $origtext and get the desired stuff.

Question 2

I think a quote from Charles Babbage is appropriate here:

On two occasions I have been asked, — “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” […] I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

– Passages from the Life of a Philosopher (1864)

This is known as the Garbage-In, Garbage-Out (GIGO) principle. In your case, you have malformed HTML. If you feed that to an HTML parser, you'll necessarily get bogus output. The HTML standard is already quite lax to deal with all kinds of common errors, but your example is much more broken.

There is, of course, one solution: Don't treat your input as HTML, but as some derived format where your example happens to be legal input. You'd have to write a custom parser of your own or adapt an existing HTML parser to your needs, but that would work.

However, I think that fixing the source of the input would be easier than writing your own parser. All that is needed is the quotes inside the attribute to be escaped, or for the attribute to use single quotes:

<meta name="id" content="&quot;12345.this.is.a.sample:id:required.67890&quot;@abc.com">
<meta name="id" content='"12345.this.is.a.sample:id:required.67890"@abc.com'>