Question

Recently, I've been building a search page where the user can input their search criteria and the server side will use a scoring and sorting algorithm to return the 'best' results first. The current flow is like so:

  1. User inputs their search criteria.
  2. Server-side code (PHP) builds a query to fetch all the results that match that criteria.
  3. The server uses a function to assign a 'score' to each result.
  4. The server loops through all the results and orders the results from highest score to lowest score using a "quick sort" algorithm.
  5. The array is spliced in order to only take the top 15 results. These are then passed into the webpage to be displayed as results.

At the bottom of the results is a 'Load More' button. When this is clicked, it runs the same process as above (via AJAX), but instead offsets the array by a certain amount, but again only taking the next 15 results and returning them to the webpage as JSON, before appending them to the results via JavaScript.

After consideration, this seems really inefficient. I'm querying the database for matching results each time the page is loaded or 'Load More' is clicked, scoring and then sorting each and every one, only to take 15 of these results.

Of course, I have to query all the matching results in order to provide the 'best' results for the whole table, rather than out of that 15.

I was thinking maybe a better way to do it, is sort and score the entire table via a cronjob and give each row a 'position rating'. When the user comes to load the webpage, it can just query the database for matching results and in the same SQL statement, order them by the 'position rating' column and then limit the results to 15 (and using offset for the AJAX query).

Would this process be much quicker? The only downside I could think of is that the ratings would be outdated, depending on how often the cronjob sorted the results.

Is it worth changing to this new process, or is my efficiency concern here not well founded?

Was it helpful?

Solution

Is my efficiency concern here not well founded?

Before you go into a route of changing things that could be expensive, I would suggest you benchmark your current solution to see if it is sufficiently performant to be released as is for the moment. It could or could not be the case based on the number of items of your database, the number of users of your service, and the algorithm used. Measure or estimate production workload on your servers, test response delay to make certain whether or not your algorithm is too slow, because if it only can be chances are it's not a development priority.


If you find out it really is, there several routes you can go from there:

  • Tailor your model to index your rows with full text rankings (as suggested in the comments by Dan Wilson). This is a least changing path that could lead to great efficiency gains. I am not a specialist of pure SQL text search solutions, but expect some limits in terms of pertinence scoring flexibility, though.

  • Use a text search oriented database for pertinence matching e.g. Elasticsearch (not affiliated). This is what I would favor if I want a fine-grain pertinence search rather than doing raw text matching, but this comes at a price of splitting the data in two bases, which has a lot of associated constraints.

  • Rank and filter in your javascript. This is a sensible route if there is few, fixed set of items (on the below 10k spectrum), focusing on time response for a high number of clients. If your set is likely to expand, this could be too big of a hit on page load, though.

  • Use a search API or service for this problem, e.g. Algolia (not affiliated). This is a sensible route if you want to minimize development time without compromising features and quality, but relying on an external service could not be what you are after.

OTHER TIPS

I was thinking maybe a better way to do it, is sort and score the entire table via a cronjob and give each row a 'position rating'.

The crux of the issue here is that your proposed solution is only valid if score is independent of the search query. And that seems like an bad situation.

Does this mean that in your current scoring algorithm, when a specific row is "hit" (i.e. returned), its score is already set in stone no matter how well it matches the search criteria?

I can't think of a scoring system in which this would be a good idea. For example:

You have two books in your database: "Hello" (written by World Man) and "Hello World" (written by Man). Your cron job has run and has arbitrarily decided that "Hello" has a higher score than "Hello World".

The user queries for "Hello World". Both books are registered as a hit.

Your predetermined score is going to put "Hello" over "Hello World"

This makes no sense. The book whose name exactly matches the search criteria should be ranked higher than the book whose name and author receive a partial match.

The score should exist specifically to express how well the hit matches the search criteria. Logically, it's impossible to decide a score before you know what the user's search criteria are.

Licensed under: CC-BY-SA with attribution
scroll top