Question

I have a file like this:

(a) a. lo mfana
(20) Juan - il- -ech (lik) !
EB1: Incwadi [esi-yi-funda-yo: isitshudeni] in-de
Papa-wu-rna parlapiya-nyi paja-lkura.
b) kupi-Nku Nia taca-mu
i. gaan1 fong2 hak1 maa1 maa1, nei5 dim2 tai2 dou2 syu1 gaa3 i. 
4. a. ngo5 lou5gung1 ci3ci3 faan1lei4. dou1 haak1 saai3 gam2
(n-ngwathel-l-e )
M: sik6-saai3 laa3!   (17)
U3 :      O?i=j se-si-ni           duy-ur-am-yor-du (17)

And I want to remove the bullets-like substring at the start of the string. These bullets are either enclosed in rounded brackets of start with a maximum of 3 alphanumeric character(s) and end with either a . , ) or :.

The desired output is:

a. lo mfana
Juan - il- -ech (lik) !
Incwadi [esi-yi-funda-yo: isitshudeni] in-de
Papa-wu-rna parlapiya-nyi paja-lkura.
kupi-Nku Nia taca-mu
gaan1 fong2 hak1 maa1 maa1, nei5 dim2 tai2 dou2 syu1 gaa3 i. 
a. ngo5 lou5gung1 ci3ci3 faan1lei4. dou1 haak1 saai3 gam2
(n-ngwathel-l-e )
sik6-saai3 laa3!   (17)
O?i=j se-si-ni           duy-ur-am-yor-du (17)

I've been doing it as such but my regexes. But i fail because:

  • Using src = re.sub(r'\([^)]*\)', '', src), I was removing more than the heading (...)

    [in]: (20) Juan - il- -ech (lik) !

    [out]: Juan - il- -ech !

    [need]: Juan - il- -ech (lik) !

  • Using src = re.sub(r'^\([^)]*\)', '', src), I was able to specify the start of line with ^ in the regex but it didn't get the maximum of 3 alphanumeric condition.

    [in]: (n-ngwathel-l-e )

    [out]:

    [need]: (n-ngwathel-l-e )

    [in]: U3 : O?i=j se-si-ni duy-ur-am-yor-du (17)

    [out]:

    [need]: O?i=j se-si-ni duy-ur-am-yor-du (17)

  • Using re.sub(r'^:[^)]*\)', '',src) and re.sub(r'^\.[^)]*\)', '',src), I was not able to make the regex detect [0-9a-zA-z][0-9a-zA-z][0-9a-zA-z] followed by a . or :

    [in]: 4. a. ngo5 lou5gung1 ci3ci3 faan1lei4. dou1 haak1 saai3 gam2

    [out]: 4. a. ngo5 lou5gung1 ci3ci3 faan1lei4. dou1 haak1 saai3 gam2

    [need]: a. ngo5 lou5gung1 ci3ci3 faan1lei4. dou1 haak1 saai3 gam2

    [in]: EB1: Incwadi [esi-yi-funda-yo: isitshudeni] in-de

    [out]: EB1: Incwadi [esi-yi-funda-yo: isitshudeni] in-de

    [need]: Incwadi [esi-yi-funda-yo: isitshudeni] in-de

How should I form a single regex / chain of regex substitutions such that i don't break the other regex conditions for other?

Was it helpful?

Solution

Solution

^\(?\w{1,3}\s*[):.]\s*

Description

Regular expression visualization

Demo

http://regexr.com?37j1p

Discussion

I'll take each of your regexes and output the problem in each:

1. Using src = re.sub(r'([^)])', '', src), I was removing more than the heading (...)*

The * quantifier is greedy. It will try to the most possible characters that are not ). This is why you get more than the heading.

2. Using src = re.sub(r'^\([^)]*\)', '', src), I was able to specify the start of line with ^ in the regex but it didn't get the maximum of 3 alphanumeric condition.

The * quantifier means zero or more. If you want 3 alphanumeric max, you need to use this quantifier: {1,3}. It means 1, 2 or three times.

3.Using re.sub(r'^:[^)])', '',src) and re.sub(r'^.[^)])', '',src), I was not able to make the regex detect [0-9a-zA-z][0-9a-zA-z][0-9a-zA-z] followed by a . or :

Here the regex is missing the . and : for matching those characters respectively.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top