Unicode object error in parsing XML using BeautifulSoup

https://stackoverflow.com/questions/23264479

08-07-2023
|

Frage

Parsing the contents of 'name' tag in the XML output using BeautifulSoup gives me the following error:

AttributeError: 'unicode' object has no attribute 'get_text'

XML Output:

<show>
  <stud>
    <__readonly__>
      <TABLE_stud>
        <ROW_stud>
          <name>rice</name>
          <dept>chem</dept>
          .
          .
          .
        </ROW_stud>
      </TABLE_stud>
    </__readonly__>
  </stud>
</show>

However if I access the contents of other tags like 'dept' it seems to work fine.

stud_info = output_xml.find_all('row_stud')
for eachStud in range(len(stud_info)):

    print stud_info[eachStud].dept.get_text()   #Gives 'chem'
    print stud_info[eachStud].name.get_text()   #---Unicode Error---

Can any python/BeautifulSoup experts help me to resolve this? (I know BeautifulSoup is not ideal for parsing XML. But lets just say I'm compelled to use it )

Lösung

Tag.name is an attribute containing the tag name; it's value here is row_stud.

Attribute access to contained tags is a shortcut for .find(attributename), but only works if there isn't already an attribute in the API with the same name. Use .find() instead:

print stud_info[eachStud].find('name').get_text()

You can loop over the stud_info result list directly, no need to use range() here:

stud_info = output_xml.find_all('row_stud')
for eachStud in stud_info:
    print eachStud.dept.get_text()
    print eachStud.find('name').get_text()

I notice that you are searching for row_stud in lower-case. If you are parsing XML with BeautifulSoup, make sure that you have lxml installed and tell BeautifulSoup it is XML you are processing, so that it won't HTML-ize your tags (lowercase them):

soup = BeautifulSoup(source, 'xml')

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow