Question

I'm scraping the website using Scrapy which has product list on it. What I want to do is to remove unwanted words from the product titles string with regex. There are 2 different repeating words I want to remove: Pen and Graphite Pencil and scrape only the brand names.

Any suggestions?

<a name=“this-link”> href=“some url here”>Pen Bic Crystal</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Graphite Pencil Kohinoor Carpenter</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Pen Parker Jotter</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Pen Bic Other Model</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Graphite Pencil Palomino Blackwing Pearl</a>
Was it helpful?

Solution

Scrapy selectors have built-in support for regular expressions.

Call re() after getting link texts:

sel.xpath('//a/text()').re(r'(?:Pen|Graphite Pencil)\s(.*)')

where:

UPD:

If you want to get only the following word after Pen or Graphite Pencil, use this regular expression: r'(?:Pen|Graphite Pencil)\s(\w+), where only set of alphanumeric (and _) characters are captured after Pen or Graphite Pencil and a space.

Demo using scrapy shell:

$ scrapy shell index.html
>>> sel.xpath('//a/text()').re(r'(?:Pen|Graphite Pencil)\s(\w+)')
[u'Bic', u'Kohinoor', u'Parker', u'Bic', u'Palomino']
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top