Scrapy selectors have built-in support for regular expressions.
Call re()
after getting link texts:
sel.xpath('//a/text()').re(r'(?:Pen|Graphite Pencil)\s(.*)')
where:
sel
is yourSelector
instance(?:Pen|Graphite Pencil)
is a non-capturing group(.*)
is a capturing group
UPD:
If you want to get only the following word after Pen
or Graphite Pencil
, use this regular expression: r'(?:Pen|Graphite Pencil)\s(\w+)
, where only set of alphanumeric (and _
) characters are captured after Pen
or Graphite Pencil
and a space.
Demo using scrapy shell
:
$ scrapy shell index.html
>>> sel.xpath('//a/text()').re(r'(?:Pen|Graphite Pencil)\s(\w+)')
[u'Bic', u'Kohinoor', u'Parker', u'Bic', u'Palomino']