Finding content between two words withou RegEx, BeautifulSoup, lXml … etc

https://stackoverflow.com/questions/1116172

12-09-2019
|

Question

How to find out the content between two words or two sets of random characters?

The scraped page is not guaranteed to be Html only and the important data can be inside a javascript block. So, I can't remove the JavaScript.

consider this:

<html>
<body>
<div>StartYYYY "Extract HTML", ENDYYYY

</body>

Some Java Scripts code STARTXXXX "Extract JS Code" ENDXXXX.

</html>

So as you see the html markup may not be complete. I can fetch the page, and then without worrying about anything, I want to find the content called "Extract the name" and "Extract the data here in a JavaScript".

What I am looking for is in python:

Like this:

data = FindBetweenText(UniqueTextBeforeContent, UniqueTextAfterContent, page)

Where page is downloaded and data would have the text I am looking for. I rather stay away from regEx as some of the cases can be too complex for RegEx.

Solution

Here's my attempt, this is tested. While recursive, there should be no unnecessary string duplication, although a generator might be more optimal

def bracketed_find(s, start, end, startat=0):
    startloc=s.find(start, startat)
    if startloc==-1:
        return []
    endloc=s.find(end, startloc+len(start))
    if endloc == -1:
        return [s[startloc+len(start):]]
    return [s[startloc+len(start):endloc]] + bracketed_find(s, start, end, endloc+len(end))

and here is a generator version

def bracketed_find(s, start, end, startat=0):
    startloc=s.find(start, startat)
    if startloc==-1:
        return
    endloc=s.find(end, startloc+len(start))
    if endloc == -1:
        yield s[startloc+len(start):]
        return
    else:
        yield s[startloc+len(start):endloc]

    for found in bracketed_find(s, start, end, endloc+len(end)):
        yield found

OTHER TIPS

if you are sure your markers are unique, do something like this

s="""
<html>
<body>
<div>StartYYYY "Extract HTML", ENDYYYY

</body>

Some Java Scripts code STARTXXXX "Extract JS Code" ENDXXXX.

</html>
"""

def FindBetweenText(startMarker, endMarker, text):
    startPos = text.find(startMarker)
    if startPos < 0: return
    endPos = text.find(endMarker)
    if endPos < 0: return

    return text[startPos+len(startMarker):endPos]

print FindBetweenText('STARTXXXX', 'ENDXXXX', s)

Well, this is what it would be in PHP. No doubt there's a much sexier Pythonic way.

function FindBetweenText($before, $after, $text) {
    $before_pos = strpos($text, $before);
    if($before_pos === false)
        return null;
    $after_pos = strpos($text, $after);
    if($after_pos === false || $after_pos <= $before_pos)
        return null;
    return substr($text, $before_pos, $after_pos - $before_pos);
}

[Slightly tested]

def bracketed_find_first(prefix, suffix, page, start=0):
    prefixpos = page.find(prefix, start)
    if prefixpos == -1: return None # NOT ""
    startpos = prefixpos + len(prefix)
    endpos = page.find(suffix, startpos) # DRY
    if endpos == -1: return None # NOT ""
    return page[startpos:endpos]

Note: the above returns only the first occurrence. Here is a generator which yields each occurrence.

def bracketed_finditer(prefix, suffix, page, start_at=0):
    while True:
        prefixpos = page.find(prefix, start_at)
        if prefixpos == -1: return # StopIteration
        startpos = prefixpos + len(prefix)
        endpos = page.find(suffix, startpos)
        if endpos == -1: return
        yield page[startpos:endpos]
        start_at = endpos + len(suffix)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow