Question

How do I merge the bigrams below to a single string?

_bigrams=['the school', 'school boy', 'boy is', 'is reading']
_split=(' '.join(_bigrams)).split()
_newstr=[]
_filter=[_newstr.append(x) for x in _split if x not in _newstr]
_newstr=' '.join(_newstr)
print _newstr

Output:'the school boy is reading'....its the desired output but the approach is too long and not quite efficient given the large size of my data. Secondly, the approach would not support duplicate words in the final string ie 'the school boy is reading, is he?'. Only one of the 'is' will be permitted in the final string in this case.

Any suggestions on how to make this work better? Thanks.

Was it helpful?

Solution

# Multi-for generator expression allows us to create a flat iterable of words
all_words = (word for bigram in _bigrams for word in bigram.split())

def no_runs_of_words(words):
    """Takes an iterable of words and returns one with any runs condensed."""
    prev_word = None
    for word in words:
        if word != prev_word:
            yield word
        prev_word = word

final_string = ' '.join(no_runs_of_words(all_words))

This takes advantage of generators to lazily evaluate and not keep the entire set of words in memory at the same time until generating the one final string.

OTHER TIPS

If you really wanted a oneliner, something like this could work:

' '.join(val.split()[0] for val in (_bigrams)) + ' ' +  _bigrams[-1].split()[-1]

Would this do it? It does simply take the first word up to the last entry

_bigrams=['the school', 'school boy', 'boy is', 'is reading']

clause = [a.split()[0] if a != _bigrams[-1] else a for a in _bigrams]

print ' '.join(clause)

Output

the school boy is reading

However, concerning performance probably Amber's solution is a good option

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top