Question

Here is the web-site, I would like to parse: [web-site in russian][1]

Here is the code that extracts the info that I need:

# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from flats.items import FlatsItem

class DmozSpider(Spider):
name = "dmoz"
start_urls = ['http://rieltor.ua/flats-sale/?ncrnd=6510']

def parse(self, response):
    sel=Selector(response)
    flats=sel.xpath('//*[@id="content"]')
    flats_stored_info=[]
    flat_item=FlatsItem()
    for flat in flats:
        flat_item['square']=[s.encode("utf-8") for s in sel.xpath('//div/strong[@class="param"][1]/text()').extract()]
        flat_item['rooms_floor_floors']=[s.encode("utf-8") for s in sel.xpath('//div/strong[@class="param"][2]/text()').extract()]
        flat_item['address']=[s.encode("utf-8") for s in flat.xpath('//*[@id="content"]//h2/a/text()').extract()]
        flat_item['price']=[s.encode("utf-8") for s in flat.xpath('//div[@class="cost"]/strong/text()').extract()]
        flat_item['subway']=[s.encode("utf-8") for s in flat.xpath('//span[@class="flag flag-location"]/a/text()').extract()]
        flats_stored_info.append(flat_item)
    return  flats_stored_info

How I dump to json file

scrapy crawl dmoz -o items.json -t json

The problem is when I replace the code above to print in console the extracted info i.e. like this:

    flat_item['square']=sel.xpath('//div/strong[@class="param"][1]/text()').extract()
    for bla in flat_item['square']:
        print bla

the script properly displays the information in russian.

But, when I use to dump the scraped information using the first version of the script (with encoding to utf-8), it writes to the json file something like this:

[{"square": ["2-\u043a\u043e\u043c\u043d., 16 \u044d\u0442\u0430\u0436 16-\u044d\u0442. \u0434\u043e\u043c", "1-\u043a\u043e\u043c\u043d., 

How can I dump information into json file in russian? Thank you for your advises. [1]: http://rieltor.ua/flats-sale/?ncrnd=6510

Was it helpful?

Solution

It is correctly encoded, it's just that the json library escapes non-ascii characters by default.

You can load the data and use it (copying data from your example):

>>> import json
>>> print json.loads('"2-\u043a\u043e\u043c\u043d., 16 \u044d\u0442\u0430\u0436 16-\u044d\u0442. \u0434\u043e\u043c"')
2-комн., 16 этаж 16-эт. дом
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top