Question

I am writing a crawler in java that examines an IMDB movie page and extracts some info like name, year etc. User writes (or copy/pastes) the link of the tittle and my program should do the rest.

After examining html sources of several (imdb) pages and browsing on how crawlers work I managed to write a code.

The info I get (for example title) is in my mother tongue. If there is no info in my mother tongue I get the original title. What I want is to get the title in a specific language of my choosing.

I'm fairly new to this so correct me if I'm wrong but I get the results in my mother tongue because imdb "sees" that I'm from Serbia and than customizes the results for me. So basically I need to tell it somehow that I prefer results in English? Is that possible (i imagine it is) and how do I do it?

edit: Program crawls like this: it gets the url path in String, converts it to url, reads all of the source with bufferedreader and inspects what it gets. I'm not sure if that is the right way to do it but it's working (minus the language problem) code:

public static Info crawlUrl(String urlPath) throws IOException{
        Info info = new Info();

        //
        URL url = new URL(urlPath);
        URLConnection uc = url.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                uc.getInputStream(), "UTF-8"));
        String inputLine;
        while ((inputLine = in.readLine()) != null){
            if(inputLine.contains("<title>")) System.out.println(inputLine);
        }
        in.close();
        //
        return info;
    }

this code goes trough a page and prints the main title on console.

Was it helpful?

Solution 2

Try to look at the request headers used by your crawler, mine is containing Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4 so I get the title in French.

EDIT :

I checked with ModifyHeaders add-on on Google Chrome and the value en-US is getting me the English title for the movie =)

OTHER TIPS

You don't need to crawl IMDB, you can use the dumps they provide: http://www.imdb.com/interfaces

There's also a parser for the data they provide: https://code.google.com/p/imdbdumpimport/ it's not perfect but maybe it will help you (you can expect spending some effort to make it work).

An alternative parser: https://github.com/dedeler/imdb-data-parser

EDIT You're saying you want to crawl IMDB anyway for learning purposes. So you'll probably have to go with http://en.wikipedia.org/wiki/Content_negotiation as suggested in the other answer:

uc.setRequestProperty("Accept-Language", "de; q=1.0, en; q=0.5");
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top