Question

I want to extract a bit of data from this snippet:

<div id="information_content">
    <b>Name:</b> file.rar <br>
    <b>Date Modified:</b> 2 days ago <br>
    <b>Size:</b> 212.19 MB <br>
    <b>Type:</b> Archive <br>
    <b>Permissions:</b> Public </div>
</div>

I want to extract only 212.19 MB.

I have extracted the snippet using soup.find('div', attrs={'id': 'information_content'}) but I can't figure out how to drill further down to get what I need.

Can anybody help?

No correct solution

OTHER TIPS

As BeautifulSoup doesn't support Xpath, the best way would be to use lxml.

If the DIV has always the same structure, you can follow this instructions, using BeautifulSoup. Once you get the DIV extracted, create a new LIST with the text, splitted by '\n'. Then, just select the right element of the list.

I've done something similar and here I explained everything I did: Python and BeautifulSoup: extracting prizes from Quiniela - http://www.manejandodatos.es/2014/2/python-beautifulsoup-extracting-prizes-quiniela

I hope it helps!

As said previously, if the structure of these divs is always the same, the size will be in the third string if you split.

>>>> x = '<div id="information_content"> <b>Name:</b> file.rar <br> <b>Date Modified:</b> 2 days ago <br> <b>Size:</b> 212.19 MB <br> <b>Type:</b> Archive <br> <b>Permissions:</b> Public </div> </div>'
>>>> x.split('<br>')[2]
' <b>Size:</b> 212.19 MB '

From there you can use regular expressions to get just the part you need. For example this pattern matches all values of this kind of formatting:

\d+.\d\d\s.B

it matches 10.00 kB as well as 1000.34 TB

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top