Pyparsing: extract variable length, variable content, variable whitespace substring

https://stackoverflow.com/questions/10855951

12-06-2021
|

Question

I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to another number. Humans typed these in over two decades. Various conventions of whitespace and modifiers are included. Below is my Backus-Naur form so far, and two example records. Just for prostatectomies, we're looking at upwards of a thousand cases.

I am using pyparsing because I'm learning python, and have no fond memories of my very limited exposure to regex writing.

My question: how can I pluck out these Gleason grades without parsing every single other optional piece of data that may or may not be in these final diagnoses?

num = Word(nums)
record ::= accessionDate + accessionNumber + patMedicalRecordNum + finalDxText
accessionDate ::= num + "/" + num + "/" num
accessionNumber ::= "S" + num + "-" + num
patMedicalRecordNum ::= num + "/" + num + "-" + num + "-" + num
finalDxText ::= listOfParts + optionalComment + optionalpTNMStage
listOfParts ::= OneOrMore(part)
part ::= <multiline idiosyncratic freetext which may contain a Gleason score I want> + optionalpTNMStage
optionalComment ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
optionalpTNMStage ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>


01/01/11  S11-55555 20/444-55-6666 A.  PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                           
                                   -  ADENOCARCINOMA.                                                      

                                   TOTAL GLEASON SCORE:  GLEASON 5+4=9                                     
                                   TUMOR LOCATION:  BILATERAL                                              
                                   TUMOR QUANTITATION:  15% OF PROSTATE INVOLVED BY TUMOR                  
                                   EXTRAPROSTATIC EXTENSION:  PRESENT AT RIGHT POSTERIOR                   
                                   SEMINAL VESICLE INVASION:  PRESENT                                      
                                   MARGINS:  UNINVOLVED                                                    
                                   LYMPHOVASCULAR INVASION:  PRESENT                                       
                                   PERINEURAL INVASION:  PRESENT                                           
                                   LYMPH NODES (SPECIMENS B AND C):                                        
                                      NUMBER EXAMINED:  25                                                 
                                      NUMBER INVOLVED:  1                                                  
                                      DIAMETER OF LARGEST METASTASIS:  1.7 mm                              
                                   ADDITIONAL FINDINGS:  HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,   
                                      ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVE    
                                      CARCINOMA                                                            

                                   PATHOLOGIC STAGE:  pT3b N1 MX                                           

                               B.  LYMPH NODES, RIGHT PELVIC, EXCISION:                                    
                                   -  ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).         

                               C.  LYMPH NODES, LEFT PELVIC, EXCISION:                                     
                                   -  EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).                     
01/02/11  S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                               
                                  - ADENOCARCINOMA.                                                        
                                    GLEASON SCORE:  3 + 3 = 6 WITH TERTIARY PATTERN OF 5.                                             
                                    TUMOR QUANTITATION:  APPROXIMATELY 10% BY VOLUME.                      
                                    TUMOR LOCATION:  BILATERAL.                                            
                                    EXTRAPROSTATIC EXTENSION:  NOT IDENTIFIED.                             
                                    MARGINS:  NEGATIVE.                                                    
                                    PERINEURAL INVASION:  IDENTIFIED.                                      
                                    LYMPH-VASCULAR INVASION:  NOT IDENTIFIED.                              
                                    SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.              
                                    LYMPH NODES:  NONE SUBMITTED.                                          
                                    OTHER:  HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.                
                               PATHOLOGIC STAGE (pTNM):  pT2c NX.

Full disclosure: I'm a physician doing research; this is my first real work with python. I have read Lutz's Learning Python, Shaw's Learning Python the Hard Way, and worked through various problem sets. I have reviewed numerous pyparsing related questions on this forum, the pyparsing wiki, and I bought and read Mr McGuire's Getting Started with Pyparsing. Perhaps I am asking a question when I should really be told I am standing at "The death spiral of frustation that is so common when you have to write parsers" (McGuire, 17)? I don't know. So far I'm just happy to be working on what may actually be a real project.

Solution

Here is a sample to pull out the patient data and any matching Gleason data.

from pyparsing import *
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
assert 'GLEASON 5+4=9' == gleason
assert 'GLEASON SCORE:  3 + 3 = 6' == gleason

patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
assert '01/02/11  S11-4444 20/111-22-3333' == patientData

partMatch = patientData("patientData") | gleason("gleason")

lastPatientData = None
for match in partMatch.searchString(data):
    if match.patientData:
        lastPatientData = match
    elif match.gleason:
        if lastPatientData is None:
            print "bad!"
            continue
        print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
                        lastPatientData.patientData, match.gleason
                        )

Prints:

01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)

OTHER TIPS

Take a look at the SkipTo parse element in pyparsing. If you define a pyparsing structure for the num+num=num part, you should be able to use SkipTo to skip anything between "Gleason" and that. Roughly like this (untested pseuo-pyparsing):

score = num + "+" + num + "=" num
Gleason = "Gleason" + SkipTo(score) + score

PyParsing by default skips whitespace anyway, and with SkipTo you can skip anything that doesn't match your desired format.

gleason = re.compile("gleason\d+\d=\d")
scores = set()
for record in records:
    for line in record.lower().split("\n"):
        if "gleason" in line:
            scores.add(gleason.match(line.replace(" ", "")).group(0)[7:])

Or something

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow