Вопрос

I'm trying to understand the idea of noSQL databases, to be more precise, the concept behind neo4j graph database. I have experience with SQL databases (MySQL, MS SQL), but the limitations of managing hierarchical data made me to expand my knowledge. But now I have some questions and I can't find their answers (maybe I don't know what to search).

Imagine we have list of countries in the world. Each country has it's GDP every year. Each country has it's GDP calculated by different sources - World Bank, their government, CIA etc. What's the best way to organise data in this case?

The simplest thing which came in mind is to have the node (the values are imaginary):

China:
  GDPByWorldBank2012: 999,
  GDPByCIA2011: 994,
  GDPByGovernment2012: 1102,

In relational database, I would split the data in three tables: Countries, Sources and Values, where in Values I would have value of GDP, year, id of the country and id of the source.

Other thing which came in mind is to create nodes CIA, World bank, but node Government looks really weird. Even though, the idea is to have relationships (valueIfGDP):

CIA -> valueOfGDP - {year: 2011, value: 994} -> China
World Bank -> valueOfGDP - {year: 2012, value: 999} -> China

This looks pretty weird for me, what is more, what happens when we add the values for all the years from one source? We would have multiple relationships or what?

I'm sorry if my questions are too dumb and I would be happy if someone explain me or show me what book/article to read.

Thanks in advance. :)

Это было полезно?

Решение

Your questions are very legit and you're not the only one having difficulties to grasp graph modelling at first ;)

It is always easier to start thinking about the questions you wanna answer with your data before modelling it up front.

Let's imagine you wanna retrieve the GDP of year 2012 computed by CIA of all countries.

A simple way to achieve this is to label country nodes uniformly, and set an attribute name that obviously depends on the country name.

Moreover, CIA/WorldBank/Government in this domain are all "sources", let's label them uniformly as well.

For instance, that could give something like:

(ORGANIZATION {name: CIA})-[:HAS_COMPUTED_GDP {year:2011, value:994}]->(COUNTRY {name:China})

With Cypher Query Language, following this model, you would execute the following query:

START cia = node:nodes(name = "CIA")
MATCH cia-[gdp:HAS_COMPUTED_GDP]->(country)
WHERE gdp.year = 2012
RETURN cia, country, gdp

In this query, I used an index lookup as a starting point (rather than IDs which are a internal technical notion that shouldn't be used) to retrieve CIA by name and match the relevant subgraph to finally return CIA, the GDP relationships and their linked countries matching the input constraints.

Although Neo4J is totally schemaless, this does not mean you should necessarily have a totally flexible data model. Having a little structure will always help to make your queries or traversals easier to read.

If you're not familiar with Cypher Query Language (which is not the only way to read or write data into the graph), have a look at the excellent documentation of Neo4J (Cypher: http://docs.neo4j.org/chunked/stable/cypher-query-lang.html, complete: http://docs.neo4j.org/chunked/stable/index.html) and try some queries there: http://console.neo4j.org/!

And to answer your second question, if you wanna add another year of GDP computations, this will just boil down to adding new relationship "HAS_COMPUTED_GDP" between the organizations and the countries, no more no less.

Hope it helps :)

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top