Split HTML after N words in python

https://stackoverflow.com/questions/360036

21-08-2019
|

Question

Is there any way to split a long string of HTML after N words? Obviously I could use:

' '.join(foo.split(' ')[:n])

to get the first n words of a plain text string, but that might split in the middle of an html tag, and won't produce valid html because it won't close the tags that have been opened.

I need to do this in a zope / plone site - if there is something as standard in those products that can do it, that would be ideal.

For example, say I have the text:

<p>This is some text with a 
  <a href="http://www.example.com/" title="Example link">
     bit of linked text in it
  </a>.
</p>

And I ask it to split after 5 words, it should return:

<p>This is some text with</p>

7 words:

<p>This is some text with a 
  <a href="http://www.example.com/" title="Example link">
     bit
  </a>
</p>

Solution

Take a look at the truncate_html_words function in django.utils.text. Even if you aren't using Django, the code there does exactly what you want.

OTHER TIPS

I've heard that Beautiful Soup is very good at parsing html. It will probably be able to help you get correct html out.

I was going to mention the base HTMLParser that's built in Python, since I'm not sure what the end-result your trying to get to is, it may or may not get you there, you'll work with the handlers primarily

You can use a mix of regex, BeautifulSoup or Tidy (I prefer BeautifulSoup). The idea is simple - strip all the HTML tags first. Find the nth word (n=7 here), find the number of times the nth word appears in the string till n words - coz u are looking only for the last occurrence to be used for truncation.

Here is a piece of code, though a bit messy but works

import re
from BeautifulSoup import BeautifulSoup
import tidy

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

input_string='<p>This is some text with a <a href="http://www.example.com/" '\
    'title="Example link">bit of linked text in it</a></p>'

s=remove_html_tags(input_string).split(' ')[:7]

###required to ensure that only the last occurrence of the nth word is                                                                                      
#  taken into account for truncating.                                                                                                                       
#  coz if the nth word could be 'a'/'and'/'is'....etc                                                                                                       
#  which may occur multiple times within n words                                                                                                            
temp=input_string
k=s.count(s[-1])
i=1
j=0
while i<=k:
    j+=temp.find(s[-1])
    temp=temp[j+len(s[-1]):]
    i+=1
####                                                                                                                                                        
output_string=input_string[:j+len(s[-1])]

print "\nBeautifulSoup\n", BeautifulSoup(output_string)
print "\nTidy\n", tidy.parseString(output_string)

The output is what u want

BeautifulSoup
<p>This is some text with a <a href="http://www.example.com/" title="Example link">bit</a></p>

Tidy
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 6 November 2007), see www.w3.org">
<title></title>
</head>
<body>
<p>This is some text with a <a href="http://www.example.com/"
title="Example link">bit</a></p>
</body>
</html>

Hope this helps

Edit: A better regex

`p = re.compile(r'<[^<]*?>')`

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow