Question

I want to replace a phonetic symbol between phonetic transcription slashes like this:

/anycharacter*ou*anycharacter/

to

/anycharacter*au*anycharacter/

I mean I want to replace "ou" by "au" between any two phonetic slashes in all cases. For example:

<font size=+2 color=#E66C2C> jocose /dʒə'kous/</font>
    =  suj vour ver / suwj dduaf 

into

<font size=+2 color=#E66C2C> jocose /dʒə'kaus/</font>
    =  suj vour ver / suwj dduaf  
  • The text file contains HTML code and some text forward slashes (like A/B instead of A or B)
  • The string "anycharacter" can be any characters, one or more or no character. For example: /folou/, /houl/, /sou/, /dʒə'kousnis/...

So far, I have been using:

Find: \/(.*?)\bou*\b(.*?)\/\s
Replace: /\1au\2\3\4/ 

but it finds all the strings between any /.../ including the normal forward slashes and HTLM slashes, and when replacing it bypasses the items such as /gou/, /tou/, ect. As with the above example, the output is:

<font size=+2 color=#E66C2C> jocose /dʒə'kaus/</font>
    =  suj vaur ver / suwj dduaf 

Note: that "vour" before normal slash is replaced by "vaur" is not my purpose.

Could you please guide me how to solve the above problem? Thanks a lot.

Was it helpful?

Solution

The simplest match expression that might satisfy your needs (POSIX ERE compliant) is:

(/[^ \t/<>]*?)ou([^ \t/<>]*?/)

broken down, this means:

(             # Capture the following into back-reference #1
  /           #   match a literal '/'
  [^ \t<>]    #   match any character that is not a space, tab, slash, or angle bracket...
    *?        #     ...any number of times (even zero times), being reluctant
)             # end capture
ou            # match the letters 'ou'
(             # Capture the following into back-reference #2
  [^ \t/<>]   #   match any character that is not a space, tab, slash, or angle bracket...
    *?        #     ...any number of times (even zero times), being reluctant
  /           #   match a literal '/'
)             # end capture

Then use the replace expression \1au\2

This will ignore text between / characters if there is a space, tab, angle brackets (< and >) or another forward slash (/) in between them. if there are other characters you know will not occur in one of these expressions, add it into the character classes (the [] groups)

In my emulator, it turns this text:

<font size=+2 color=#E66C2C> jocose /dʒə'kous/</font>
    =  suj vour ver / suwj dduaf. 
Either A/B or B/C might happen, but <b>at any time</b> C/D might also occur

...into this text:

<font size=+2 color=#E66C2C> jocose /dʒə'kaus/</font>
    =  suj vour ver / suwj dduaf. 
Either A/B or B/C might happen, but <b>at any time</b> C/D might also occur

Just ask if there is something that you don't understand! If you would like, I can also explain a few problems with the one you were trying to use before.

EDIT:

The above expression matches the entire phonetic transcription set, and replaces it entirely, using certain parts of the match and replacing others. The next attempt at a match will begin after the current match.

For this reason, if ou might occur more than once in a / delimited phonetic expression, the above regex will need to be run multiple times. For a once-through execution, a language or tool needs to support both variable-length look-ahead and look-behind (collectively look-around)

As far as I know, this is only Microsoft's .Net Regex and the JGSoft "flavor" of regex (in tools such as EditPad Pro and RegexBuddy). POSIX (which UNIX grep requires) does not support any kind of look-around and Python (which I THINK TextWrangler uses) does not support variable-length look-around. I believe it would not be possible without variable length look-around.

An expression that requires variable-length look-around and does what you need could be like this:

(?<=/[^ \t/<>]*?)ou(?=[^ \t/<>]*?/)

...and the replacement expression will need to be modified as well, since you are matching (and thus replacing) only the characters which are to be replaced:

au

It works much the same except that it only matches the ou, then runs a check (called a zero-width assertion) to make sure that it is immediately preceded by a / and any number of certain characters, and immediately followed by any number of certain characters then a /.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top