قم بإنشاء جدول محتويات من HTML مع Python

https://stackoverflow.com/questions/2514931

22-09-2019
|

سؤال

أحاول إنشاء جدول محتويات من كتلة من HTML (وليس ملفًا كاملًا - محتوى فقط) بناءً على ذلك <h2> و <h3> العلامات.

كانت خطتي حتى الآن:

استخراج قائمة من الرؤوس باستخدام beautifulsoup
استخدم regex على المحتوى لوضع روابط مرساة قبل/داخل علامات الرأس (حتى يتمكن المستخدم من النقر على جدول المحتويات) - قد تكون هناك طريقة للاستبدال في الداخل beautifulsoup?
إخراج قائمة متداخلة من الروابط إلى الرؤوس في بقعة محددة مسبقا.

يبدو الأمر سهلاً عندما أقول ذلك على هذا النحو ، لكنه يثبت أنه كان بعض الألم في العمق.

هل هناك شيء ما يفعل كل هذا بالنسبة لي في واحدة ، لذلك لا أضيع الساعتين القادمة في إعادة اختراع العجلة؟

مثال:

<p>This is an introduction</p>

<h2>This is a sub-header</h2>
<p>...</p>

<h3>This is a sub-sub-header</h3>
<p>...</p>

<h2>This is a sub-header</h2>
<p>...</p>

المحلول

يستخدم lxml.html.

يمكن التعامل مع HTML غير صالح بخير.
أنه سريع جدا.
يسمح لك بذلك قم بإنشاء العناصر المفقودة بسهولة ونقل العناصر بين الأشجار.

نصائح أخرى

سرعان ما اخترق البعض قطعة قبيحة من الكود:

soup = BeautifulSoup(html)

toc = []
header_id = 1
current_list = toc
previous_tag = None

for header in soup.findAll(['h2', 'h3']):
    header['id'] = header_id

    if previous_tag == 'h2' and header.name == 'h3':
        current_list = []
    elif previous_tag == 'h3' and header.name == 'h2':
        toc.append(current_list)
        current_list = toc

    current_list.append((header_id, header.string))

    header_id += 1
    previous_tag = header.name

if current_list != toc:
    toc.append(current_list)


def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#%s">%s</a></li>' % item)
    result.append("</ul>")
    return "\n".join(result)

# Table of contents
print list_to_html(toc)

# Modified HTML
print soup

كيف يمكنني إنشاء جدول محتويات لنص HTML في بيثون؟

لكنني أعتقد أنك على المسار الصحيح وأن إعادة اختراع العجلة سيكون ممتعًا.

لقد جئت مع نسخة ممتدة من الحل الذي اقترحته łukasz.

def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#{}">{}</a></li>'.format(item[0], item[1]))
    result.append("</ul>")
    return "\n".join(result)

soup = BeautifulSoup(article, 'html5lib')

toc = []
h2_prev = 0
h3_prev = 0
h4_prev = 0
h5_prev = 0

for header in soup.findAll(['h2', 'h3', 'h4', 'h5', 'h6']):
    data = [(slugify(header.string), header.string)]

    if header.name == "h2":
        toc.append(data)
        h3_prev = 0
        h4_prev = 0
        h5_prev = 0
        h2_prev = len(toc) - 1
    elif header.name == "h3":
        toc[int(h2_prev)].append(data)
        h3_prev = len(toc[int(h2_prev)]) - 1
    elif header.name == "h4":
        toc[int(h2_prev)][int(h3_prev)].append(data)
        h4_prev = len(toc[int(h2_prev)][int(h3_prev)]) - 1
    elif header.name == "h5":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)].append(data)
        h5_prev = len(toc[int(h2_prev)][int(h3_prev)][int(h4_prev)]) - 1
    elif header.name == "h6":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)][int(h5_prev)].append(data)

toc_html = list_to_html(toc)

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow