Question

I wish to read in a text, use regex to find all instances of a pattern, then print the matching strings. If I use the re.search() method, I can successfully grab and print the first instance of the desired pattern:

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

match = re.search(r'(cello|Cello)(\W{1,80}\w{1,60}){0,9}\W{0,20}(lillian|Lillian)', text)
print match.group()

Unfortunately, the re.search() method only finds the first instance of the desired pattern, so I substituted re.findall():

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

match = re.findall(r'(cello|Cello)(\W{1,80}\w{1,60}){0,9}\W{0,20}(lillian|Lillian)', text)
print match

This routine finds both instances of the target pattern in the sample text, but I can't find a way to print the sentences in which the patterns occur. The print function of this latter bit of code yields: ('Cello', ' with', 'Lillian'), ('Cello', ' yellow', 'Lillian'), instead of the output I desire: "Cello is a yellow parakeet who sings with Lillian. Cello is a yellow Lillian."

Is there a way to modify the second bit of code so as to obtain this desired output? I would be most grateful for any advice any can lend on this question.

Was it helpful?

Solution 2

I would just make a big capturing group around the two endpoints:

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

for match in re.findall(r'(Cello(?:\W{1,80}\w{1,60}){0,9}\W{0,20}Lillian)', text, flags=re.I):
    print match

Now, you get the two sentences:

Cello is a yellow parakeet who sings with Lillian
Cello is a yellow Lillian

Some tips:

  • flags=re.I makes the regex case-insensitive, so Cello matches both cello and Cello.
  • (?:foo) is just like (foo), except that the captured text won't appear as a match. It's useful for grouping things without making them match.

OTHER TIPS

Description

Use a forward lookahead like in this regex which will capture complete sentences which contain both Cello and Lillian.

(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))

enter image description here

The expression is broken down like to these functional components:

  • (?:(?<=\.)\s+|^) start matching this sentence at after a . followed by any number of spaces or at the start to of the string
  • ( start capture group 1 which will capture the this entire sentence
  • (?= start the look ahead
    • (?:(?!\.(?:\s|$)).)*? ensure the regex engine doesn't leave this sentence by forcing it acknowledge a . followed by either white space or an end of string
    • \b matcht the word break
    • [Cc]ello match the desired text either all lower case or with a capital initial
    • (?=\s|\.|$) look ahead to ensure the string has a trailing space, ., or the end of the string
    • ) end of the look ahead
  • (?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)) this essentially does the same but for Lillian
  • .*?\.(?=\s|$) capture the rest of the sentence upto and including the period, and make sure the period is followed by either white space or the end of the string
  • ) end of the sentence capture group 1

Code example

I don't know python well enough so I offer a PHP example. Note in match statement I'm using the s option which allows the . expression to match new line characters

Input text

Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs.  Cello has no friends. And Lillian also hasn't met anyone.

Code

<?php
$sourcestring="your source string";
preg_match_all('/(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))/s',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

$matches Array:
(
    [0] => Array
        (
            [0] => Cello is a yellow parakeet who sings with Lillian.
            [1] =>  Cello is a yellow Lillian.
            [2] => 
Cello likes Lillian and kittens.
            [3] => 
Lillian likes Cello and dogs.
        )

    [1] => Array
        (
            [0] => Cello is a yellow parakeet who sings with Lillian.
            [1] => Cello is a yellow Lillian.
            [2] => Cello likes Lillian and kittens.
            [3] => Lillian likes Cello and dogs.
        )

)

If you absolutly need to match sentences where the string Cello appears before Lillian, then you use an expression like this. Here I've simply moved a single close parentheses.

(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$)(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$))).*?\.(?=\s|$))

enter image description here

Input text

Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs.  Cello has no friends. And Lillian also hasn't met anyone.

Output for capture group 1

[1] => Array
    (
        [0] => Cello is a yellow parakeet who sings with Lillian.
        [1] => Cello is a yellow Lillian.
        [2] => Cello likes Lillian and kittens.
    )
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top