Question

I have several small place marks such as 'א,א' 'א,ב'. If we use the comma as the center point, i need at most 2 characters before the comma, and up to the next space after the comma.

I have (.-,.-)%s but its not doing what I need. Any idea?

Also as you can see there not latin letters so using %l will not work.

Was it helpful?

Solution

There are couple of issues here. First, a minor issue: .-, will match as little as possible before the coma, that is zero characters. You should anchor the beginning of the matched string.

The more complicated issue is that you use Hebrew letters. The problem is that Lua has no concept of multi-byte characters.

If you use a 8-bit encoding such as Windows-1255, or ISO-8859-8, then you probably can simply match against a character class [ת-א]. If you have properly set Hebrew locale, %l should work fine for you.

If you use UTF-8 or any other encoding that uses multi-byte characters, then you must construct a regex that has all the Hebrew alphabet escaped as a sequence of octets. The aleph is U+05D0x, which in UTF-8 will be represented as 0xD7 0x90. The tav is U+05EA, which will be encoded as 0xD7 0xAA.

In Lua you can escape any 8-bit character with a backslash + decimal code. All the hebrew characters encoded in UTF-8 have the first byte the same -- 0xD7, that is "\215". The second character can be anything from "\144" to "\170". Thus, the regex that will match a single Hebrew letter is: "\215[\144-\170]". Put that in your original regex, where you had single dots that match any character.

Of course, the above reasoning must be modified for encodings different than UTF-8. Right-to-left writing direction in Hebrew is another thing to keep in mind.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top