문제

I'm scraping the website using Scrapy which has product list on it. What I want to do is to remove unwanted words from the product titles string with regex. There are 2 different repeating words I want to remove: Pen and Graphite Pencil and scrape only the brand names.

Any suggestions?

<a name=“this-link”> href=“some url here”>Pen Bic Crystal</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Graphite Pencil Kohinoor Carpenter</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Pen Parker Jotter</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Pen Bic Other Model</a>

some divs and other DOM structure

<a name=“this-link”> href=“some url here”>Graphite Pencil Palomino Blackwing Pearl</a>
도움이 되었습니까?

해결책

Scrapy selectors have built-in support for regular expressions.

Call re() after getting link texts:

sel.xpath('//a/text()').re(r'(?:Pen|Graphite Pencil)\s(.*)')

where:

UPD:

If you want to get only the following word after Pen or Graphite Pencil, use this regular expression: r'(?:Pen|Graphite Pencil)\s(\w+), where only set of alphanumeric (and _) characters are captured after Pen or Graphite Pencil and a space.

Demo using scrapy shell:

$ scrapy shell index.html
>>> sel.xpath('//a/text()').re(r'(?:Pen|Graphite Pencil)\s(\w+)')
[u'Bic', u'Kohinoor', u'Parker', u'Bic', u'Palomino']
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top