문제

I am using Scrapy and XPath to parse web-site in Russian language.

In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?

Here is the xpath expression:

//*[text()="Param1_name_in_russian"]/following-sibling::text()

Html snippet:

<div class="obj-params">
            <div class="wrap">
                <div class="obj-params-col" style="min-width:50%;">
                      <p>
                         <b>Param1_name_in_russian</b>" Param1_value"</p>
                      <p>
                         <strong>Param2_name_in_russian</strong>" Param2_value</p>
                      <p>
                         <strong>Param3_name_in_russian</strong>" Param3_value"</p>
                </div>
              </div>
            <div class="wrap">
                <div class="obj-params-col">
                    <p>
                       <b>Param4_name_in_russian</b>Param4_value</p>
                <div class="inline-popup popup-hor left">
                   <b>Param5_name</b>
                      <a target="_blank" href="link">Param5_value</a></div></div>

EDITED based on comments

I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:

  1. link to the web-site: link to real-estate web site
  2. screenshot of what I need to parse:

screen_shot

도움이 되었습니까?

해결책

Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.

I'll be using lxml instead of Scrapy below, but the logic is the same.

Code:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

from lxml import html

markup = """div class="obj-params">
            <div class="wrap">
                <div class="obj-params-col" style="min-width:50%;">
                      <p>
                         <b>Некий текст</b>" Param1_value"</p>
                      <p>
                         <strong>Param2_name_in_russian</strong>" Param2_value</p>
                      <p>
                         <strong>Param3_name_in_russian</strong>" Param3_value"</p>
                </div>
              </div>
            <div class="wrap">
                <div class="obj-params-col">
                    <p>
                       <b>Param4_name_in_russian</b>Param4_value</p>
                <div class="inline-popup popup-hor left">
                   <b>Param5_name</b>
                      <a target="_blank" href="link">Param5_value</a></div></div>"""

tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")

print pone_val

Result:

['" Param1_value"']
[Finished in 0.5s]

Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as @warwaruk's comment in your question.

Let us know if this helps.

EDIT:

Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.

import requests as rq
from lxml import html

url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)

divs = tree.xpath("//div[@class='obj-left']")

for div in divs:

    name = div.xpath("./h3/span/a/text()")[0]
    details = div.xpath(".//div[@class='obj-params-col'][1]")[0]
    room = details.xpath("./p[1]/text()[last()]")[0]
    square = details.xpath("./p[2]/text()[last()]")[0]
    floor = details.xpath("./p[3]/text()[last()]")[0]

    print name.encode("utf-8")
    print room.encode("utf-8")
    print square.encode("utf-8")
    print floor.encode("utf-8")

This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.

Let us know what you think.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top