grep uppercase words to lowercase while excluding Roman numerals

https://stackoverflow.com/questions/9271156

29-04-2021
|

Question

I'm trying to write a single regular expression to convert all uppercase words to lowercase while excluding uppercase Roman numerals from being converted.

The only way I found was to convert all uppercased words that are followed by a space, comma, or period, as well as hyphenated words into lowercase. Then convert all Roman numerals back to uppercase.

I used this to convert to lowercase:

(\u+[ ,.-])

Then I had to go through and find and replace all suspected Roman numerals.

What is a better way to do this? I tried negative lookahead expressions with no luck but I'm not very strong at writing them.

The sample that I'm testing this on is the U.S. Constitution. Here's a sample of the input:

WE, the PEOPLE of the UNITED STATES, in order to form a more perfect union, establish justice, ensure domestic tranquility, provide for the common defence, promote the general welfare, and secure the blessings of liberty to ourselves and our posterity, do ordain and establish this Constitution for the United States of America.

ARTICLE I.

Sect. 1. ALL legislative powers, herein granted, shall be vested in a Congress of the United > States, which shall consist of a Senate and House of Representatives.

Sect. 2. The House of Representatives shall be composed of Members chosen every second year by all the people of the several States, and the Electors in each State shall have the qualifications requisite for Electors of the most numerous branch of the State Legislature. No person shall be a Representative who shall not have attained to the age of twenty-five years, and been seven years a citizen of the United States, and who shall not, when elected, be an inhabitant of that State in which he shall be chosen.

ARTICLE IV.

ARTICLE V.

ARTICLE VI.

Solution

if the regex flavour supports negative lookaheads, you could try:

\b(?![LXIVCDM]+\b)([A-Z]+)\b

which says "any whole upper-case words that aren't entirely composed of L, X, I, V, C, D, M" (the roman numerals).

It also conveniently stops the word "I" from being converted. (As an aside, if you wanted to prevent one-letter capital words from being converted, use [A-Z]{2,} -- this would prevent a capital "A" (at the start of a sentence) and I being converted, which you usually want to stay in their normal case).

It would stop words consisting entirely of these letters being matched though -- the only ones I can think of are "DID", and perhaps "DIV" (as in HTML), "DIM" (as in dimension), "MID", "MIDI", "VIC" (as in Victoria?)...

Although, you could certainly alter the roman numerals regex to be a little more considerate of the rules, e.g.

(?=[MDCLXVI])M{0,3}(C[DM]|DC{0,3}|C{1,3})?(X[LC]|LX{0,3}|X{1,3})?(I[XV]|VI{0,3}|I{1,3})?

Explanation:

(?=[MDCLXVI])           # make sure we match at least something
                        # (since everything in this regex is optional)
M{0,3}                  # Can have 0 to 3 Ms, being thousands
(C[DM]|DC{0,3}|C{1,3})? # for the hundreds column can have CD, CM, 
                        # C, CC, CCC, D, DC, DCC, DCCC
(X[LC]|LX{0,3}|X{1,3})? # for the tens column can have XL, XC, 
                        # L, LX, LXX, LXXX, X, XX, XXX
(I[XV]|VI{0,3}|I{1,3})? # for the ones column can have IX, IV,
                        # V, VI, VII, VIII, I, II, III.

I think that covers all possible roman numerals....

If your regex flavour doesn't support negative lookaheads, maybe you could do something like:

\b((ROMAN_NUMERAL_REGEX)|([A-Z]+))\b

And replace with "$2$3_converted_to_lower_case" (sorry - I don't know how to do the actual conversion itself).

The above would work because the regex only ever matches either the roman numeral regex (and is captured in $2), or the other regex (captured in $3). So one of $2 or $3 is always empty.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow