As BeautifulSoup doesn't support Xpath, the best way would be to use lxml.
Extracting specific data with BeautifulSoup
-
11-10-2022 - |
Question
I want to extract a bit of data from this snippet:
<div id="information_content">
<b>Name:</b> file.rar <br>
<b>Date Modified:</b> 2 days ago <br>
<b>Size:</b> 212.19 MB <br>
<b>Type:</b> Archive <br>
<b>Permissions:</b> Public </div>
</div>
I want to extract only 212.19 MB
.
I have extracted the snippet using soup.find('div', attrs={'id': 'information_content'})
but I can't figure out how to drill further down to get what I need.
Can anybody help?
No correct solution
OTHER TIPS
If the DIV has always the same structure, you can follow this instructions, using BeautifulSoup. Once you get the DIV extracted, create a new LIST with the text, splitted by '\n'. Then, just select the right element of the list.
I've done something similar and here I explained everything I did: Python and BeautifulSoup: extracting prizes from Quiniela - http://www.manejandodatos.es/2014/2/python-beautifulsoup-extracting-prizes-quiniela
I hope it helps!
As said previously, if the structure of these divs is always the same, the size will be in the third string if you split.
>>>> x = '<div id="information_content"> <b>Name:</b> file.rar <br> <b>Date Modified:</b> 2 days ago <br> <b>Size:</b> 212.19 MB <br> <b>Type:</b> Archive <br> <b>Permissions:</b> Public </div> </div>'
>>>> x.split('<br>')[2]
' <b>Size:</b> 212.19 MB '
From there you can use regular expressions to get just the part you need. For example this pattern matches all values of this kind of formatting:
\d+.\d\d\s.B
it matches 10.00 kB as well as 1000.34 TB