بيثون العادية التعبير عن تحليل HTML (BeautifulSoup)

https://stackoverflow.com/questions/55391

09-06-2019
|

سؤال

أريد أن انتزاع قيمة خفية حقل الإدخال في HTML.

<input type="hidden" name="fooId" value="12-3456789-1111111111" />

أريد أن أكتب تعبير عادي في بيثون التي ستعود قيمة fooId, بالنظر إلى أن أعرف الخط في HTML يلي شكل

<input type="hidden" name="fooId" value="**[id is here]**" />

يمكن للشخص تقديم مثال في بيثون تحليل HTML القيمة ؟

المحلول

عن هذه القضية بالذات ، BeautifulSoup أصعب كتابة من regex ، وإنما هو أكثر من ذلك بكثير قوية...أنا فقط المساهمة مع BeautifulSoup سبيل المثال ، بالنظر إلى أن كنت تعرف مسبقا التي regexp الاستخدام :-)

from BeautifulSoup import BeautifulSoup

#Or retrieve it from the web, etc. 
html_data = open('/yourwebsite/page.html','r').read()

#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId.attrs[2][1] #The value of the third attribute of the desired tag 
                          #or index it directly via fooId['value']

نصائح أخرى

أنا أتفق مع فينكو BeautifulSoup هو الطريق للذهاب.ومع ذلك أقترح استخدام fooId['value'] إلى الحصول على سمة بدلا من الاعتماد على القيمة كونها ثالث السمة.

from BeautifulSoup import BeautifulSoup
#Or retrieve it from the web, etc.
html_data = open('/yourwebsite/page.html','r').read()
#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId['value'] #The value attribute

import re
reg = re.compile('<input type="hidden" name="([^"]*)" value="<id>" />')
value = reg.search(inputHTML).group(1)
print 'Value is', value

تحليل هي واحدة من تلك المناطق حيث كنت حقا لا أريد أن لفة الخاص بك إذا كان يمكنك تجنب ذلك ، كما عليك أن تكون مطاردة من الحافة حالات الخلل لسنوات تذهب وتأتي

أنصح باستخدام BeautifulSoup.لديها سمعة جيدة جدا و يبدو من مستندات مثل فإنه من السهل جدا للاستخدام.

Pyparsing جيدة خطوة مرحلية بين BeautifulSoup و regex.ومن أكثر قوة من مجرد regexes منذ HTML الوسم تحليل يفهم الاختلافات في القضية, بيضاء, السمة حضور/غياب/النظام ، ولكن أسهل أن تفعل هذا النوع من الأساسية الوسم استخراج من استخدام BS.

المثال الخاص بك هو بسيطة خاصة ، لأن كل شيء تبحث عنه في سمات فتح "الإدخال" الوسم.هنا هو pyparsing سبيل المثال تظهر العديد من الاختلافات على المدخلات الخاصة بك العلامة التي من شأنها أن تعطي regexes يناسب, و يظهر أيضا كيف لا تطابق الوسم إذا كان ضمن التعليق:

html = """<html><body>
<input type="hidden" name="fooId" value="**[id is here]**" />
<blah>
<input name="fooId" type="hidden" value="**[id is here too]**" />
<input NAME="fooId" type="hidden" value="**[id is HERE too]**" />
<INPUT NAME="fooId" type="hidden" value="**[and id is even here TOO]**" />
<!--
<input type="hidden" name="fooId" value="**[don't report this id]**" />
-->
<foo>
</body></html>"""

from pyparsing import makeHTMLTags, withAttribute, htmlComment

# use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for
# opening and closing tags, we're only interested in the opening tag
inputTag = makeHTMLTags("input")[0]

# only want input tags with special attributes
inputTag.setParseAction(withAttribute(type="hidden", name="fooId"))

# don't report tags that are commented out
inputTag.ignore(htmlComment)

# use searchString to skip through the input 
foundTags = inputTag.searchString(html)

# dump out first result to show all returned tags and attributes
print foundTags[0].dump()
print

# print out the value attribute for all matched tags
for inpTag in foundTags:
    print inpTag.value

طباعة:

['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
- empty: True
- name: fooId
- startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
  - empty: True
  - name: fooId
  - type: hidden
  - value: **[id is here]**
- type: hidden
- value: **[id is here]**

**[id is here]**
**[id is here too]**
**[id is HERE too]**
**[and id is even here TOO]**

يمكنك أن ترى أن ليس فقط لا pyparsing تطابق هذه الاختلافات لا يمكن التنبؤ بها ، تقوم بإرجاع البيانات في كائن يجعل من السهل قراءة الفرد سمات علامة وقيمها.

/<input type="hidden" name="fooId" value="([\d-]+)" \/>/

/<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>/

>>> import re
>>> s = '<input type="hidden" name="fooId" value="12-3456789-1111111111" />'
>>> re.match('<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>', s).groups()
('fooId', '12-3456789-1111111111')

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow