How can I use non-ASCII characters?

Question

Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.

I'll be using lxml instead of Scrapy below, but the logic is the same.

Code:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

from lxml import html

markup = """div class="obj-params">
            <div class="wrap">
                <div class="obj-params-col" style="min-width:50%;">
                      <p>
                         <b>Некий текст</b>" Param1_value"</p>
                      <p>
                         <strong>Param2_name_in_russian</strong>" Param2_value</p>
                      <p>
                         <strong>Param3_name_in_russian</strong>" Param3_value"</p>
                </div>
              </div>
            <div class="wrap">
                <div class="obj-params-col">
                    <p>
                       <b>Param4_name_in_russian</b>Param4_value</p>
                <div class="inline-popup popup-hor left">
                   <b>Param5_name</b>
                      <a target="_blank" href="link">Param5_value</a></div></div>"""

tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")

print pone_val

Result:

['" Param1_value"']
[Finished in 0.5s]

Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as @warwaruk's comment in your question.

Let us know if this helps.

EDIT:

Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.

import requests as rq
from lxml import html

url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)

divs = tree.xpath("//div[@class='obj-left']")

for div in divs:

    name = div.xpath("./h3/span/a/text()")[0]
    details = div.xpath(".//div[@class='obj-params-col'][1]")[0]
    room = details.xpath("./p[1]/text()[last()]")[0]
    square = details.xpath("./p[2]/text()[last()]")[0]
    floor = details.xpath("./p[3]/text()[last()]")[0]

    print name.encode("utf-8")
    print room.encode("utf-8")
    print square.encode("utf-8")
    print floor.encode("utf-8")

This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.

Let us know what you think.