removing words from string using regex

https://stackoverflow.com/questions/23473750

15-07-2023
|

Question

I'm scraping the website using Scrapy which has product list on it. What I want to do is to remove unwanted words from the product titles string with regex. There are 2 different repeating words I want to remove: Pen and Graphite Pencil and scrape only the brand names.

Any suggestions?

<a name=“this-link”> href=“some url here”>Pen Bic Crystal</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Graphite Pencil Kohinoor Carpenter</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Pen Parker Jotter</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Pen Bic Other Model</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Graphite Pencil Palomino Blackwing Pearl</a>

Solution

Scrapy selectors have built-in support for regular expressions.

Call re() after getting link texts:

sel.xpath('//a/text()').re(r'(?:Pen|Graphite Pencil)\s(.*)')

where:

sel is your Selector instance
(?:Pen|Graphite Pencil) is a non-capturing group
(.*) is a capturing group

UPD:

If you want to get only the following word after Pen or Graphite Pencil, use this regular expression: r'(?:Pen|Graphite Pencil)\s(\w+), where only set of alphanumeric (and _) characters are captured after Pen or Graphite Pencil and a space.

Demo using scrapy shell:

$ scrapy shell index.html
>>> sel.xpath('//a/text()').re(r'(?:Pen|Graphite Pencil)\s(\w+)')
[u'Bic', u'Kohinoor', u'Parker', u'Bic', u'Palomino']

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow