Question

I am trying to remove all of the tag from the link that i got from crawling.

here is the code

request = urllib2.Request("http://sport.detik.com/sepakbola/")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)

   for a in soup.findAll('a'):
   if 'http://sport.detik.com/sepakbola/read/' in a['href']:
            urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', a)

unfortunately, it does not work, and it says expected string or buffer in findAll(), is it like the output from for each is not a string? any help will be appriciated

thanks

Was it helpful?

Solution 2

a in your loop is not a string, it's a dictionary (or, specifically, a BeautifulSoup.Tag). In your if statement you correctly get the href string from the dictionary to compare with, but when matching the regex you're not.

Simply using the string a['href'] instead of the dictionary a in the regex match will fix your runtime error;

for a in soup.findAll('a'):
  if 'http://sport.detik.com/sepakbola/read/' in a['href']:
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', a['href'])

OTHER TIPS

Indentation of the code is not correct here. Please fix it. Change the last line as:

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', a['href'])

a here is <class 'bs4.element.Tag'> type, not a string. So you are getting the error. Change it to a['href'] which is a <type 'str'>.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top