I'm still not sure what exactly your problem is, or what your code is supposed to do.
But this line seems to be the key:
test = [w[0] for w in dictSentCheck(sentCheck)]
That gives you a list of all words. It includes things like lt
and gt
as words. And you want to strip out anything inside an lt
and gt
pair.
And, as you say in your comments, "I may set the required number of consecutive words to 7".
So, something like this:
def split_on_angle_brackets(words):
para = []
bracket_stack = 0
for word in words:
if bracket_stack:
if word == 'gt':
bracket_stack -= 1
elif word == 'lt':
bracket_stack += 1
else:
if word == 'lt':
if len(para) >= 7:
yield ' '.join(para)
para = []
bracket_stack = 1
else:
para.append(word)
if para:
yield ' '.join(para)
If you use it with your sample data:
print('\n'.join(split_on_angle_brackets(test)))
You get this:
English cricket cuts ties with Zimbabwe Wednesday June text
print EMAIL THIS ARTICLE your name your email address recipient's name recipient's email address
add another recipient your comment Send Mail
The England and Wales Cricket Board ECB announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year
That doesn't match your sample output, but I can't think of any rule that would provide your sample output, so instead I'm trying to implement the rule you described.