python - 使用 xpath 语法使用 lxml.html 解析 html 表单

https://stackoverflow.com//questions/22027488

21-12-2019
|

题

这是表格。相同的确切形式在源代码中出现两次。

<form method="POST" action="/login/?tok=sess">
<input type="text" id="usern" name="username" value="" placeholder="Username"/>
<input type="password" id="passw" name="password" placeholder="Password"/>
<input type="hidden" name="ses_token" value="token"/>
<input id="login" type="submit" name="login" value="Log"/>
</form>

我用这个 py 代码获取“action”属性

import lxml.html
tree = lxml.html.fromstring(pagesource)
print tree.xpath('//action')
raw_input()

由于有两种形式，因此它会打印这两个属性

['/login/?session=sess', '/login/?session=sess']

我怎样才能让它只打印一张？我只需要一个，因为它们的形式完全相同。

我还有第二个问题

我怎样才能获得代币的价值？我正在谈论这一行：

 <input type="hidden" name="ses_token" value="token"/>

我尝试类似的代码，

import lxml.html
tree = lxml.html.fromstring(pagesource)
print tree.xpath('//value')
raw_input()

但是，由于有多个名为 value 的属性，因此它会打印出来

['', 'token', 'Log In', '', 'token', 'Log In'] # or something close to that

我怎样才能只获得令牌？只有一个吗？

有一个更好的方法吗？

解决方案

使用 find() 代替 xpath(), ，自从 find() 仅返回第一个匹配项。

这是基于您提供的代码的示例：

import lxml.html


pagesource = """<form method="POST" action="/login/?session=sess">
<input type="text" id="usern" name="username" value="" placeholder="Username"/>
<input type="password" id="passw" name="password" placeholder="Password"/>
<input type="hidden" name="ses_token" value="token"/>
<input id="login" type="submit" name="login" value="Log In"/>
</form>
<form method="POST" action="/login/?session=sess">
<input type="text" id="usern" name="username" value="" placeholder="Username"/>
<input type="password" id="passw" name="password" placeholder="Password"/>
<input type="hidden" name="ses_token" value="token"/>
<input id="login" type="submit" name="login" value="Log In"/>
</form>
"""

tree = lxml.html.fromstring(pagesource)
form = tree.find('.//form')

print "Action:", form.action
print "Token:", form.find('.//input[@name="ses_token"]').value

印刷：

Action: /login/?session=sess
Token: token

希望有帮助。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow