Вопрос

How can I strip this out and leave the rest using python and beautiful soup, other items in td need to be kept

<td style="background:#aaccff" width="50"></td>
<td align="left" style="background:#aaccff" width="150">Device Type</td>
<td align="left" style="background:#aaccff" width="115">IP Address</td>
<td align="left" style="background:#aaccff" width="100">Device Name</td>
<td align="left" style="background:#aaccff" width="215">Notes</td>
<td width="50"></td>

here is the full code

<td style="background:#aaccff" width="50"></td>
<td align="left" style="background:#aaccff" width="150">Device Type</td>
<td align="left" style="background:#aaccff" width="115">IP Address</td>
<td align="left" style="background:#aaccff" width="100">Device Name</td>
<td align="left" style="background:#aaccff" width="215">Notes</td>
<td width="50"></td>
<td align="left" width="150">AudioCodes Gateway</td>
<td align="left" width="115">172.31.31.2</td>
<td align="left" width="100"></td>
<td align="left" width="215">FXO</td>
<td style="background:#aaccff" width="50"></td>
<td align="left" style="background:#aaccff" width="150">Device Type</td>
<td align="left" style="background:#aaccff" width="115">IP Address</td>
<td align="left" style="background:#aaccff" width="100">Device Name</td>
<td align="left" style="background:#aaccff" width="215">Notes</td>
<td width="50"></td>
<td align="left" width="150">IC Server</td>
<td align="left" width="115">172.31.56.151</td>
<td align="left" width="100">IND056GIC151</td>
<td align="left" width="215">NAT'd IP = PENDING MPLS, Voice IP = 172.31.52.151</td>
<td width="50"></td>
<td align="left" width="150">IC Server</td>
<td align="left" width="115">172.31.56.152</td>
<td align="left" width="100">IND056GIC152</td>
<td align="left" width="215">NAT'd IP = PENDING MPLS, Voice IP = 172.31.52.152</td>
<td width="50"></td>
<td align="left" width="150">Media Server</td>
<td align="left" width="115">IND1106HMS07</td>
<td align="left" width="100">IND1106HMS07</td>
<td align="left" width="215"></td>
<td width="50"></td>
<td align="left" width="150">Media Server</td>
<td align="left" width="115">IND1106HMS07</td>
<td align="left" width="100">IND1106HMS07</td>
<td align="left" width="215"></td>

here is what I have so far code wise

from ntlm import HTTPNtlmAuthHandler
from bs4 import BeautifulSoup
import requests, os, bleach, urllib2, cookielib

os.system('clear')
user = 'user'
password = "pass"
url = "url"

cookies = cookielib.CookieJar()
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman))

pagedata=opener.open(url)
soup=BeautifulSoup(pagedata)

def myfunction(b):
table = b.find('ul', {'class': 'dfwp-column dfwp-list'})

for a in table.findAll('a'):
    [a.decompose() for a in table("a")]
for tr in table.findAll('tr'):
    for td in tr.findAll('td'):

        print td

myfunction(soup)

Here is the current output

Device Type IP Address Device Name Notes

AudioCodes Gateway 172.31.31.2

FXO

Device Type IP Address Device Name Notes

IC Server 172.31.56.151 IND056GIC151 NAT'd IP = PENDING MPLS, Voice IP = 172.31.52.151

IC Server 172.31.56.152 IND056GIC152 NAT'd IP = PENDING MPLS, Voice IP = 172.31.52.152

Media Server IND1106HMS07 IND1106HMS07

Media Server IND1106HMS07 IND1106HMS07

Это было полезно?

Решение

Generally when people ask about how to "remove" something with bs4, they're really just asking how to not include it in a find operation.

You want to exclude the extra spaces (i.e. tags with tag.text == '') and those four "column header" tags. You can do the latter through CSS selectors, but the former needs to be explicitly filtered. So it's easiest to do both at once, and is more declarative in my opinion:

soup = BeautifulSoup(that_long_html_you_gave)

blacklist = {'Device Type','IP Address','Device Name','Notes'}

table = soup.body # to match your variable name.  I think.

table.find_all(lambda tag: tag.text and tag.text not in blacklist)
Out[45]: 
[<td align="left" width="150">AudioCodes Gateway</td>,
 <td align="left" width="115">172.31.31.2</td>,
 <td align="left" width="215">FXO</td>,
 <td align="left" width="150">IC Server</td>,
 <td align="left" width="115">172.31.56.151</td>,
 <td align="left" width="100">IND056GIC151</td>,
 <td align="left" width="215">NAT'd IP = PENDING MPLS, Voice IP = 172.31.52.151</td>,
 <td align="left" width="150">IC Server</td>,
 <td align="left" width="115">172.31.56.152</td>,
 <td align="left" width="100">IND056GIC152</td>,
 <td align="left" width="215">NAT'd IP = PENDING MPLS, Voice IP = 172.31.52.152</td>,
 <td align="left" width="150">Media Server</td>,
 <td align="left" width="115">IND1106HMS07</td>,
 <td align="left" width="100">IND1106HMS07</td>,
 <td align="left" width="150">Media Server</td>,
 <td align="left" width="115">IND1106HMS07</td>,
 <td align="left" width="100">IND1106HMS07</td>]
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top