Question

Short question:

I have a string:

title="Announcing Elasticsearch.js For Node.js And The Browser"

I want to find all pairs of words where each word is properly capitalized.

So, expected output should be:

['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']

What I have right now is this:

'[A-Z][a-z]+[\s-][A-Z][a-z.]*'

This gives me the output:

['Announcing Elasticsearch.js', 'For Node.js', 'And The']

How can I change my regex to give desired output?

Was it helpful?

Solution

You can use this:

#!/usr/bin/python
import re

title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'

print re.findall(pattern, title)

A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.

OTHER TIPS

There's probably a more efficient way to do this, but you could use a regex like this:

(\b[A-Z][a-z.-]+\b)

Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).

Working example:

import re

title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
    for i in range(len(m)):
        if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
            matchlist.append([m[i - 1], m[i]])

print matchlist

Output:

[
    ['Browser', 'Announcing'], 
    ['Announcing', 'Elasticsearch.js'], 
    ['Elasticsearch.js', 'For'], 
    ['For', 'Node.js'], 
    ['Node.js', 'And'], 
    ['And', 'The'], 
    ['The', 'Browser']
]

If your Python code at the moment is this

title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)

then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:

m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)

Now just combine results and result2 together.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top