HTML Lexer in Java

https://stackoverflow.com/questions/4394728

10-10-2019
|

Question

I am trying to make a simple Lexer to understand how they work. I am trying to figure out a good POSIX String that could catch Opening HTML Tags of any type. I made one which almost worked but fails on more complex tags like meta tags and such. So far this is what I have:

"<\\p{Alnum}+(\\p{Space}\\p{Alnum}+\\p{Space}*=\"*\\p{Space}*\\p{Alnum}+\"*)*\\p{Space}*>"

This POSIX String catches a lot of tags but misses some like meta tags and DOC tags. Here is a tag that it failed on:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Any help would be much appreciated. I know this might not be the best way to make a Lexer but this is just to help me understand how Regex works.

Solution

Anything except quotes

For the value of an attribute the correct way to scan is to match anything that's not a quote. The regex for just that portion would look like:

    \"[^\"]*\"

I am not sure why you have \"*; the quotes cannot be repeated. There are other issues like allowing whitespace everywhere possible or accepting single quotes in addition to double quotes (name='value' is an alternative to name="value"). But there's a bigger issue so I won't nitpick.

Overreaching lexer

A more important concern is that you are cramming too much parsing into your lexer. A lexer's job is to turn a stream of characters into a stream of tokens. Tokens are the small, indivisible units in a text. I would not be trying to parse an entire opening tag, element name, attributes, and all, as a single token.

Instead, you should pry out the smaller pieces of a tag: open angle bracket, identifier, identifier, equal sign, string, close angle bracket. Have the lexer recognize those pieces and leave it to the parser to figure out that those tokens in that order constitutes an element tag.

OTHER TIPS

In your POSIX string "<\\p{Alnum}+(\\p{Space}\\p{Alnum}+\\p{Space}*=\"*\\p{Space}*\\p{Alnum}+\"*)*\\p{Space}*>" it seems you are not taking care of hyphen in http-equiv

EDIT A very crude regular expression can be written as follows:

"</?\\w+((\\s+(\\w|\\w[\\w-]*\\w)(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>"

So for Input like this:

<html>
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   </head>
   <body>
     <h4>Test Page</h4>
   </body>
</html>

The out put will be:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <h4>
    </h4>
  </body>
</html>

Take care if you use the above regular expression as Processing Instructions, CDATA and #Text nodes are not taken into account.

Hope this will help.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow