Neo4j linked list performance when retrieving by date

https://stackoverflow.com/questions/23483523

16-07-2023
|

Pergunta

I dont think the title exactly represents the question I have in mind.

I am currently designing an app which for all intents and purposes is exactly like an RSS feed reader. I am using neo4j because of social features which will be integrated later in the life of the app.

The way I have structured my app is like so:

user:USER -[:HAS_FEED]-> feed:FEED
feed:FEED -[:HAS_SUBSCRIPTION]-> subscription:SUBSCRIPTION
subscription:SUBSCRIPTION -[:NEXT_POST]-> post:POST -[:NEXT_POST]-> post:POST etc etc

Essentially, each subscription has a linked list of posts. Each post has a date (unix time). The reasons for which I originally decided to build a linked list is because the posts are organized by date and retrieved by date as well (for obvious reasons). For not so obvious reasons, RSS feeds are not always necessarily ordered by the date of their containing posts however, and since I want to display their feeds in order they were release and not in the order I retrieved them, it made me question performance issues as the app grows and linked list could have tends of thousands of posts under each subscription (a feed can also be made up of many subscriptions, which might further affect performance.

Currently i fetch the feeds using the following cypher query

START feed = node({id})
OPTIONAL MATCH feed -[:HAS_SUBSCRIPTION]-> (subscriptions:SUBSCRIPTION) <-[:NEXT_POST*1..]- posts
WHERE HAS (posts.date) AND posts.date > 00000000000000
RETURN DISTINCT posts
ORDER BY posts.date DESC
LIMIT 100

Essentially I am wondering how efficient is this query.

My first question is - Will neo4j necessarily travel the entire linked list, get every post, and then filter by date ?

My second question is - How will this scale with feeds that have a hundred or more subscription with subscriptions which may have tens of thousands of posts.

My third question is - Will it be more efficient, in this case, to ditch the linked list and instead connect each post directly to its subscription (i.e subscription:SUBSCRIPTION -[:HAS_POST]-> post:POST) ?

If you have alternatives as to how i can organize this I am open to suggestion, however, ditching neo4j is not an options (particularly since I spent the time to write my own neo4j driver for node.js for this particular project!)

I am using neo4j > 2.0

Solução

In this case, yes, Neo4j will traverse the entire list, filtering things out by the date predicate. If you select very small date ranges and you have very large linked lists or heavy load, this may become an issue.

Fundamentally, with the data structure you lay out here, any implementation will have to scan the whole list. Since the dates are not sorted, you can never know if you've found the last entry with the given date until you've looked at all entries.

There are two things you can do here:

You add some second date property that you can order, like separating between "createDate" and "publishDate", where publishDate is sorted. Because cypher does not know publishDate is sorted, you will still need to use something like the traversal framework to write an imperative traversal through this chain that stops at the right point.
You add a time index structure to the graph, one for each feed, which you can then use to match sections of a feed by arbitrary time spans. With this approach you can use cypher, but will need to keep the time index up to date. See this SO questions for details on this approach: Time-based data in neo4j

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow