Question

I have a file that contains this:

<html>
  <head>
    <title> Hello! - {{ today }}</title>
  </head>
  <body>
    {{ runner_up }} 
         avasd
         {{ blabla }}
        sdvas
        {{ oooo }}
   </body>
</html>

What is the best or most Pythonic way to extract the {{today}}, {{runner_up}}, etc.?

I know it can be done with splits/regular expressions, but I wondered if there were another way.

PS: consider the data loaded in a variable called thedata.

Edit: I think that the HTML example was bad, because it directed some commenters to BeautifulSoup. So, here is a new input data:

Fix grammatical or {{spelling}} errors.

Clarify meaning without changing it.

Correct minor {{mistakes}}.

Add related resources or links.

Always respect the original {{author}}.

Output:

spelling
mistakes
author
Was it helpful?

Solution

Mmkay, well here's a generator solution that seems to work well for me. You can also provide different open and close tags if you like.

def get_tags(s, open_delim  ='{{', 
                close_delim ='}}' ):

   while True:

      # Search for the next two delimiters in the source text
      start = s.find(open_delim)
      end   = s.find(close_delim)

      # We found a non-empty match
      if -1 < start < end:

         # Skip the length of the open delimiter
         start += len(open_delim)

         # Spit out the tag
         yield s[start:end].strip()

         # Truncate string to start from last match
         s = s[end+len(close_delim):]

      else:
         return

Run against your target input like so:

# prints: today, runner_up, blabla, oooo
for tag in get_tags(html):
    print tag

Edit: it also works against your new example :). In my obviously quick testing, it also seemed to handle malformed tags in a reasonable way, though I make no guarantees of its robustness!

OTHER TIPS

try templatemaker, a reverse-template maker. it can actually learn them automatically from examples!

I know you said no regex/split, but I couldn't help but try for a one-liner solution:

import re
for s in re.findall("\{\{.*\}\}",thedata):
        print s.replace("{","").replace("}","")

EDIT: JFS

Compare:

>>> re.findall('\{\{.*\}\}', '{{a}}b{{c}}')
['{{a}}b{{c}}']
>>> re.findall('{{(.+?)}}', '{{a}}b{{c}}')
['a', 'c']

If the data is that straightforward, a simple regex would do the trick.

J.F. Sebastian wrote this in a comment but I thought it was good enough to deserve its own answer:

re.findall(r'{{(.+?)}}', thestring)

I know the OP was asking for a way that didn't involve splits or regexes - so maybe this doesn't quite answer the question as stated. But this one line of code definitely gets my vote as the most Pythonic way to accomplish the task.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top