Question

I have the following repeated piece of the web-page:

<div class="txt ext">
 <strong class="param">param_value1</strong>            
 <strong class="param">param_value2</strong>                                                
</div>

I would like to extract separately values param_value1 and param_value2 using Xpath. How can I do it?

I have tried the following constructions:

'//strong[@class="param"]/text()[0]'
'//strong[@class="txt ext"]/strong[@class="param"][0]/text()'
'//strong[@class="param"]'

none of which returned me separately param_value1 and param_value2.

P.S. I am using Python 2.7 and the latest version of Scrapy.

Was it helpful?

Solution

Here is my testing code:

test_content = '<div class="txt ext"><strong class="param">param_value1</strong><strong class="param">param_value2</strong></div>'

sel = HtmlXPathSelector(text=test_content)
sel.select('//div/strong[@class="param"]/text()').extract()[0]   
sel.select('//div/strong[@class="param"]/text()').extract()[1]

OTHER TIPS

// means descendant or self. You are selecting any strong element in any context. [...] is a predicate which restricts your selection according to some boolean test. There is no strong element with a class attribute which equals txt ext, so you can exclude your second expression.

Your last expression will actually return a node-set of all the strong elements which have a param attribute. You can then extract individual nodes from the node set (use [1], [2]) and then get their text contents (use text()).

Your first expression selects the text contents of both nodes but it's also wrong. It's in the wrong place and you can't select node zero (it doesn't exist). If you want the text contents of the first node you should use:

//strong[@class="param"][1]/text()

and you can use

//strong[@class="param"][2]/text()

for the second text.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top