Question

Is it possible to use Regex lookbehind expression to match all the words preceding whatever is in square brackets, matching words consectetur and libero in this example?

Lorem ipsum dolor sit amet, consectetur [adipiscing] elit. Nunc eu tellus vel nunc pretium lacinia. Proin sed lorem. Cras sed ipsum. Nunc a libero [quis] risus sollicitudin imperdiet.

I want to delimit dictionary entries in MS Word from the entry contents. Entries are followed by a phonetic transcription in square brackets, and once they are selected this way I'd increase their font and thus have them differentiated from the rest of the text and delimited from the content.

EDIT: The expression that Kent gave works perfectly with one-word entries, e.g:

boiling ['boilin] adj 1. vreo, uzavreo, kipući 2. razjaren, uzrujan

with hyphenated two-word entries such as:

boiling-point ['boilin point] s vrelište

but the first word of phrasal verbs and other two-word entries is left out, which means that in the entries such as:

bolt out ['bault'aut] vt isključiti; izlanuti

the match is out and not bolt out, as I would need it.

Since this is a dictionary and I can apply the regex expression for each letter range separately, I'd be able to solve this problem if I had a regular expression that will search for the first word starting with a specific letter which precedes the brackets, match that word and the word that follows. For "B" entries as in my examples, that would mean that the expression would match either single words beginning with letter B, hyphenated two-word entries as boiling-point and would match "bolt" in the phrasal verbs such as "bolt out" along with the preposition that follows it, i.e "out" in this case.

There may be only a few, if any, two-word entries in my dictionary where the words in these entries begin with the same letter, and I really can live with such a small margin of error.

EDIT2: I put paragraph breaks before square brackets and now I have my entries at the end of the previous line, like this:

[aidwulf] s zool vrsta hijene (Proteles cristata) Aron's beard

[earanzrod] s bot divizma (Ver- bascum Thapsus) Abacca

[a'baid'on] vi biti na pomoći, stajati uz bok abide with

Aaron's beard is an entry for the second line beginning with square brackets, Abacca is the entry for the third line beginning with square brackets and so on.

To solve my problem I need two regular expressions.First I need a regular expression to match every letter Aa in the words that begin with Aa and only in words before the last in every line. In my examples that would match A in Aaron's in the first example and a in abide in the third example. Then I'd replace this letter with an asterisk to get *ron's beard and *bide with

The second regular expression would match every last word, (including hyphenated two-words compounds) in every line and words that begin with asterisk I previously created.

Thank you for the help.

Was it helpful?

Solution

you need look ahead, not look-behind:

\w+(?=\s*\[[^]])

test with grep:

kent$  echo "Lorem ipsum dolor sit amet, consectetur [adipiscing] elit. Nunc eu tellus vel nunc pretium lacinia. Proin sed lorem. Cras sed ipsum. Nunc a libero [quis] risus sollicitudin imperdiet."|grep -Po '\w+(?=\s*\[[^\]])'
consectetur
libero

EDIT

try this regex:

[bB].+?(?=\s*\[[^]])

still test with grep:

kent$  cat file
boiling ['boilin] adj 1. vreo, uzavreo, kipući 2. razjaren, uzrujan
with hyphenated two-word entries such as:
boiling-point ['boilin point] s vrelište
but the first word of phrasal verbs and other two-word entries is left out, which means that in the entries such as:
bolt out ['bault'aut] vt isključiti; izlanuti

kent$  grep -oP '[bB].+?(?=\s*\[[^]])' file
boiling
boiling-point
bolt out
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top