سؤال

Here is what I have so far

import urllib2
from ntlm import HTTPNtlmAuthHandler
from bs4 import BeautifulSoup
import requests
import os
import bleach
def stripAllTags( html ):
    if html is None:
            return None
    return ''.join( BeautifulSoup( html ).findAll( text = True ) ) 
os.system('clear')

user = '<user>'
password = "<pass>"
url = "<some url>"

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
# create the NTLM authentication handler
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)

# create and install the opener
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)

data = urllib2.urlopen(url)

soup = BeautifulSoup(data)

table = soup.find('ul', {'class': 'dfwp-column dfwp-list'})
td = table.findAll('td')
tr = table.findAll('tr')
   for td in table:
      for tr in td:
        clean = bleach.clean(tr, tags=[], strip=True)
        print clean

How can I properly turn this into a function

table = soup.find('ul', {'class': 'dfwp-column dfwp-list'})
td = table.findAll('td')
tr = table.findAll('tr')
   for td in table:
      for tr in td:
        clean = bleach.clean(tr, tags=[], strip=True)
        print clean

I want to call it in a 'for'

هل كانت مفيدة؟

المحلول

Okay firstly you have created a urllib2.opener in your code and then you call the webpage by using urllib2.urlopen().....so you arent even using your opener or any of the extra items you went through the trouble of creating. Also with a username and password being specified in your code I'm assuming you'll be logging into a website at some point. if thats the case then you'll also be in a world of hurt without cookie handling. I've reorganized a bit of your code and think that the following should be a polished starting off point for you.

Also, here is a walkthrough of the function performing the operations that you that you specified...

  • searches an entire beautifulsoup object for an unordered list with a class of dfwp-column dfwp-list
  • td variable = all 'td' tags in that match
  • tr variable = all 'tr' tags in that same match
  • even though you haven't done anything with those two variables yet....you destroy them by creating a loop that uses those same variable names, overwriting the values meaning they meant absolutely nothing...
  • for every table with that classname: (hint there is only 1 table defined and in that format the "for tr in td" does absolutely nothing) print the clean of the result....

it doesn't do what it looks like it does.

to avoid this.... the new function with those operations you specified...

def myfunction(b):
    """param is a soup instance"""
    table=b.find('ul', {'class':'dfwp-column dfwp-list'})
    for td in table.findAll('td'):
        for tr in td.findAll('tr'):
            print bleach.clean(tr,tags=[], strip=True)

much less code....and this way it finds the correct data and iterates correctly. like so:

  • table is the unordered list with 'dfwp-column dfwp-list' class
  • it prints the bleach operation on every 'tr' tag found in every 'td' tag found in the table

Just trying to be helpful...I've cleaned up and reordered your code to eliminate some waste and added the things already mentioned. Try this for now:

from ntlm import HTTPNtlmAuthHandler
from bs4 import BeautifulSoup
import requests, os, bleach, urllib2, cookielib

user='XXX'
password='XXX'
url='URL'

cookies = cookielib.CookieJar()
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman))

pagedata=opener.open(url)
soup=BeautifulSoup(pagedata)
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top