Domanda

I'm trying to use regex in Python to match acronyms separated by periods. I have the following code:

import re
test_string = "U.S.A."
pattern = r'([A-Z]\.)+'
print re.findall(pattern, test_string)

The result of this is:

['A.']

I'm confused as to why this is the result. I know + is greedy, but why is are the first occurrences of [A-Z]\. ignored?

È stato utile?

Soluzione 2

The (...) in regex creates a group. I suggest changing to:

pattern = r'(?:[A-Z]\.)+'

Altri suggerimenti

Description

This regex will:

  • capture all the acronyms like U.S.A. in a sentence
  • avoids matching uppercase words at the end of a sentence

(?:(?<=\.|\s)[A-Z]\.)+

enter image description here

Example

Live Example: http://www.rubular.com/r/9bslFxvfzQ

Sample Text

This is the U.S.A. we have RADAR.

Matches

U.S.A
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top