Python에서 N 단어 후에 HTML을 분할합니다.

https://stackoverflow.com/questions/360036

21-08-2019
|

문제

N 단어 뒤에 긴 HTML 문자열을 분할할 수 있는 방법이 있습니까?분명히 다음을 사용할 수 있습니다.

' '.join(foo.split(' ')[:n])

일반 텍스트 문자열의 처음 n 단어를 얻으려면 html 태그 중간에서 분할될 수 있으며 열린 태그를 닫지 않기 때문에 유효한 html을 생성하지 않습니다.

나는 zope/plone 사이트에서 이 작업을 수행해야 합니다. 해당 제품에 이를 수행할 수 있는 표준이 있다면 그것은 이상적일 것입니다.

예를 들어 다음과 같은 텍스트가 있다고 가정해 보겠습니다.

<p>This is some text with a 
  <a href="http://www.example.com/" title="Example link">
     bit of linked text in it
  </a>.
</p>

그리고 5단어 후에 분할하도록 요청하면 다음과 같이 반환되어야 합니다.

<p>This is some text with</p>

7 단어:

<p>This is some text with a 
  <a href="http://www.example.com/" title="Example link">
     bit
  </a>
</p>

해결책

다음을 살펴보세요. truncate_html_words django.utils.text의 함수입니다.Django를 사용하지 않더라도 거기에 있는 코드는 여러분이 원하는 것을 정확하게 수행합니다.

다른 팁

나는 그것을 들었다 아름다운 수프 html을 파싱하는 데 매우 능숙합니다.아마도 올바른 HTML을 얻는 데 도움이 될 것입니다.

베이스를 언급하려고 했는데 HTML파서 그것은 Python으로 구축되었습니다. 여러분이 얻으려는 최종 결과가 무엇인지 확신할 수 없기 때문에 거기에 도달할 수도 있고 그렇지 않을 수도 있습니다. 주로 핸들러를 사용하여 작업하게 됩니다.

정규식, BeautifulSoup 또는 Tidy를 혼합하여 사용할 수 있습니다(저는 BeautifulSoup을 선호합니다).아이디어는 간단합니다. 먼저 모든 HTML 태그를 제거하세요.n번째 단어(여기서 n=7)를 찾고 n번째 단어가 n 단어까지 문자열에 나타나는 횟수를 찾으십시오. 왜냐하면 u는 잘림에 사용될 마지막 항목만 찾고 있기 때문입니다.

다음은 약간 지저분하지만 작동하는 코드입니다.

import re
from BeautifulSoup import BeautifulSoup
import tidy

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

input_string='<p>This is some text with a <a href="http://www.example.com/" '\
    'title="Example link">bit of linked text in it</a></p>'

s=remove_html_tags(input_string).split(' ')[:7]

###required to ensure that only the last occurrence of the nth word is                                                                                      
#  taken into account for truncating.                                                                                                                       
#  coz if the nth word could be 'a'/'and'/'is'....etc                                                                                                       
#  which may occur multiple times within n words                                                                                                            
temp=input_string
k=s.count(s[-1])
i=1
j=0
while i<=k:
    j+=temp.find(s[-1])
    temp=temp[j+len(s[-1]):]
    i+=1
####                                                                                                                                                        
output_string=input_string[:j+len(s[-1])]

print "\nBeautifulSoup\n", BeautifulSoup(output_string)
print "\nTidy\n", tidy.parseString(output_string)

출력은 당신이 원하는 것입니다

BeautifulSoup
<p>This is some text with a <a href="http://www.example.com/" title="Example link">bit</a></p>

Tidy
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 6 November 2007), see www.w3.org">
<title></title>
</head>
<body>
<p>This is some text with a <a href="http://www.example.com/"
title="Example link">bit</a></p>
</body>
</html>

도움이 되었기를 바랍니다

편집하다: 더 나은 정규식

`p = re.compile(r'<[^<]*?>')`

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow