I am looking at regexes to validate and parse well-known text, which is a format used to transfer spatial data and looks like:

POLYGON((51.124 -3.973, 51.1 -3.012, ....))

or

MULTIPOLYGON(((POLYGON((51.124 -3.973, 51.1 -3.012, ....)),POLYGON((50.14 -13.973, 51.1 -13.012, ....))

among other variations.

There is a good answer here: Parsing a WKT-file which uses the regex:

\d+(?:\.\d*)?

From other places I have also seen

\d*\.\d+|\d+

and

(\d*\.)?\d+

These all seem to do the same thing, but it got me wondering about the relative workings of these 3 regexes, and if there are any performance issues or subtleties under the hood to be aware of.

To be clear, I am aware that there are libraries for parsing WKT in various languages. My question is purely about the relative behavior of number extracting regexes.

有帮助吗?

解决方案

It depends what number formats you need to allow, example:

 format 1:   22
 format 2:   22.2
 format 3:   .2
 format 4:   2.
  • the 1st pattern \d+(?:\.\d*)? matches 1,2,4
  • the 2nd pattern \d*\.\d+|\d+ matches 1,2,3
  • the 3rd pattern (\d*\.)?\d+ matches 1,2,3 (and have an uneeded capturing group)

Note: pattern 2 and 3 are slower to succeed than the first if the number is an integer, because they must match all digits until the dot, backtrack to the start and retry the same digits one more time. (see the schema below)

str  |  pattern       |  state
-----+----------------+-----------------------------
123  |  \d*\.\d+|\d+  |  START
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  go to the next alternative
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK => SUCCESS

if you want to match the four cases, you can use:

\.\d+|\d+(?:\.\d*)?

(+) if the number doesn't begin with a dot, the first alternative fails immediatly and the second alternative will match all other cases. The backtracking is limited to the minimum.
(-) if you have few numbers that start with a dot the first alternative will be tested and will fail each times. However, the first alternative fails quickly.(in other words, for the same reason). In this case, it is better to use \d+(?:\.\d*)?|\.\d+

Obviously, if you want to support negative values you need to add -?:

-?(?:\.\d+|\d+(?:\.\d*)?)
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top