Regex: How to find and extract acronyms and corresponding definition of acronym from a text?

StackOverflow https://stackoverflow.com/questions/18313419

  •  25-06-2022
  •  | 
  •  

Domanda

I would like to do something like suggested in this question – but on a more general level: Regular Expression for Acronyms

Input example:

"In a seminal set of papers, Feddersen and Pesendorfer (1996, 1999), hereafter FP, incorporate ... has been labeled the “swing voter’s curse,” from now on SVC. The prediction ... the best way to begin using a Static Application Security Testing (SAST) tool.. from Latin ante meridiem (A.M.) meaning before noon..."

Result:

  1. ['Feddersen and Pesendorfer', 'FP']
  2. ['swing voter’s curse', 'SVC']
  3. ['Static Application Security Testing', 'SAST']
  4. ['ante meridiem', 'A.M.']

There are of course many possible 'signals' of an acronym. I've listed some below.:

  • Parenthesis: ... (...)
  • ... hereafter ...
  • ... from now on ...
  • ... after this ...
  • ... refered to as ...
  • ... subsequently ...
  • ... hence ...
  • ... henceforth ...
  • ... hereinafter ...
  • etc.

Perhaps it would be beneficial to have two regular expressions; one for the parenthesis, and one for all the others, since they differ quite substantially in their structure.

Only focusing on first letter acronyms, ie. ignoring cases such as sonar, created from SOund Navigation And Ranging.

Is it possible to do such a think with regex, and if so how would you go about it?

È stato utile?

Soluzione

Yes

It is possible. I would first define all the individual rules which describe a Series of Words followed by an Acronym Definition (SOWFBAAD), then stitch these rules together in a define statement.

For example if you were looking for an email address you could use this Perl Compliant Regular Expression (PCRE) which first defines all the rules from RFC 5322 then looks for things which look like email addresses:

(?x)
    (?(DEFINE)

        (?<addr_spec> (?&local_part) @ gbase\.tt )
        (?<local_part> (?&dot_atom) | (?&quoted_string) | (?&obs_local_part) )
        (?<domain> (?&dot_atom) | (?&domain_literal) | (?&obs_domain) )
        (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dtext) )* (?&FWS)? \] (?&CFWS)? )
        (?<dtext> [\x21-\x5a] | [\x5e-\x7e] | (?&obs_dtext) )
        (?<quoted_pair> \\ (?: (?&VCHAR) | (?&WSP) ) | (?&obs_qp) )
        (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)? )
        (?<dot_atom_text> (?&atext) (?: \. (?&atext) )* )
        (?<atext> [a-zA-Z0-9!#$%&''*+/=?^_`{|}~-]+ )
        (?<atom> (?&CFWS)? (?&atext) (?&CFWS)? )
        (?<word> (?&atom) | (?&quoted_string) )
        (?<quoted_string> (?&CFWS)? "" (?: (?&FWS)? (?&qcontent) )* (?&FWS)? "" (?&CFWS)? )
        (?<qcontent> (?&qtext) | (?&quoted_pair) )
        (?<qtext> \x21 | [\x23-\x5b] | [\x5d-\x7e] | (?&obs_qtext) )

        # comments and whitespace
        (?<FWS> (?: (?&WSP)* \r\n )? (?&WSP)+ | (?&obs_FWS) )
        (?<CFWS> (?: (?&FWS)? (?&comment) )+ (?&FWS)? | (?&FWS) )
    #   (?<ccontent> (?&ctext) | (?&quoted_pair) )
        (?<ccontent> (?&ctext) | (?&quoted_pair) | (?&comment) )
        (?<ctext> [\x21-\x27] | [\x2a-\x5b] | [\x5d-\x7e] | (?&obs_ctext) )

        # obsolete tokens
        (?<obs_domain> (?&atom) (?: \. (?&atom) )* )
        (?<obs_local_part> (?&word) (?: \. (?&word) )* )
        (?<obs_dtext> (?&obs_NO_WS_CTL) | (?&quoted_pair) )
        (?<obs_qp> \\ (?: \x00 | (?&obs_NO_WS_CTL) | \n | \r ) )
        (?<obs_FWS> (?&WSP)+ (?: \r\n (?&WSP)+ )* )
        (?<obs_ctext> (?&obs_NO_WS_CTL) )
        (?<obs_qtext> (?&obs_NO_WS_CTL) )
        (?<obs_NO_WS_CTL> [\x01-\x08] | \x0b | \x0c | [\x0e-\x1f] | \x7f )

        # character class definitions
        (?<VCHAR> [\x21-\x7E] )
        (?<WSP> [ \t] )
    )
    ((?&addr_spec))

Of course this expression does use recursion which doens't play well with many flavors of regex. To resolve that, you could simply comment out the ccontent and uncomment comment the other ccontent providing you accept that the expression will no longer find recursive comments.

However

Constructing this as a regex alone may leave you with an expression which is incredibly difficult to read, debug, or modify later. So you would probably be better off looping through a list of SOWFBAAD definitions.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top