Fastest way to find objects from a collection matched by condition on string member

https://stackoverflow.com/questions/97329

01-07-2019
|

Question

Suppose I have a collection (be it an array, generic List, or whatever is the fastest solution to this problem) of a certain class, let's call it ClassFoo:

class ClassFoo
{
    public string word;
    public float score;
    //... etc ...
}

Assume there's going to be like 50.000 items in the collection, all in memory. Now I want to obtain as fast as possible all the instances in the collection that obey a condition on its bar member, for example like this:

List<ClassFoo> result = new List<ClassFoo>();
foreach (ClassFoo cf in collection)
{
    if (cf.word.StartsWith(query) || cf.word.EndsWith(query))
        result.Add(cf);
}

How do I get the results as fast as possible? Should I consider some advanced indexing techniques and datastructures?

The application domain for this problem is an autocompleter, that gets a query and gives a collection of suggestions as a result. Assume that the condition doesn't get any more complex than this. Assume also that there's going to be a lot of searches.

Solution

With the constraint that the condition clause can be "anything", then you're limited to scanning the entire list and applying the condition.

If there are limitations on the condition clause, then you can look at organizing the data to more efficiently handle the queries.

For example, the code sample with the "byFirstLetter" dictionary doesn't help at all with an "endsWith" query.

So, it really comes down to what queries you want to do against that data.

In Databases, this problem is the burden of the "query optimizer". In a typical database, if you have a database with no indexes, obviously every query is going to be a table scan. As you add indexes to the table, the optimizer can use that data to make more sophisticated query plans to better get to the data. That's essentially the problem you're describing.

Once you have a more concrete subset of the types of queries then you can make a better decision as to what structure is best. Also, you need to consider the amount of data. If you have a list of 10 elements each less than 100 byte, a scan of everything may well be the fastest thing you can do since you have such a small amount of data. Obviously that doesn't scale to a 1M elements, but even clever access techniques carry a cost in setup, maintenance (like index maintenance), and memory.

EDIT, based on the comment

If it's an auto completer, if the data is static, then sort it and use a binary search. You're really not going to get faster than that.

If the data is dynamic, then store it in a balanced tree, and search that. That's effectively a binary search, and it lets you keep add the data randomly.

Anything else is some specialization on these concepts.

OTHER TIPS

var Answers = myList.Where(item => item.bar.StartsWith(query) || item.bar.EndsWith(query));

that's the easiest in my opinion, should execute rather quickly.

Not sure I understand... All you can really do is optimize the rule, that's the part that needs to be fastest. You can't speed up the loop without just throwing more hardware at it.

You could parallelize if you have multiple cores or machines.

I'm not up on my Java right now, but I would think about the following things.

How you are creating your list? Perhaps you can create it already ordered in a way which cuts down on comparison time.

If you are just doing a straight loop through your collection, you won't see much difference between storing it as an array or as a linked list.

For storing the results, depending on how you are collecting them, the structure could make a difference (but assuming Java's generic structures are smart, it won't). As I said, I'm not up on my Java, but I assume that the generic linked list would keep a tail pointer. In this case, it wouldn't really make a difference. Someone with more knowledge of the underlying array vs linked list implementation and how it ends up looking in the byte code could probably tell you whether appending to a linked list with a tail pointer or inserting into an array is faster (my guess would be the array). On the other hand, you would need to know the size of your result set or sacrifice some storage space and make it as big as the whole collection you are iterating through if you wanted to use an array.

Optimizing your comparison query by figuring out which comparison is most likely to be true and doing that one first could also help. ie: If in general 10% of the time a member of the collection starts with your query, and 30% of the time a member ends with the query, you would want to do the end comparison first.

For your particular example, sorting the collection would help as you could binarychop to the first item that starts with query and terminate early when you reach the next one that doesn't; you could also produce a table of pointers to collection items sorted by the reverse of each string for the second clause.

In general, if you know the structure of the query in advance, you can sort your collection (or build several sorted indexes for your collection if there are multiple clauses) appropriately; if you do not, you will not be able to do better than linear search.

If it's something where you populate the list once and then do many lookups (thousands or more) then you could create some kind of lookup dictionary that maps starts with/ends with values to their actual values. That would be a fast lookup, but would use much more memory. If you aren't doing that many lookups or know you're going to be repopulating the list at least semi-frequently I'd go with the LINQ query that CQ suggested.

You can create some sort of index and it might get faster.

We can build a index like this:

Dictionary<char, List<ClassFoo>> indexByFirstLetter;
foreach (var cf in collection) {
  indexByFirstLetter[cf.bar[0]] = indexByFirstLetter[cf.bar[0]] ?? new List<ClassFoo>();
  indexByFirstLetter[cf.bar[0]].Add(cf);
  indexByFirstLetter[cf.bar[cf.bar.length - 1]] = indexByFirstLetter[cf.bar[cf.bar.Length - 1]] ?? new List<ClassFoo>();
  indexByFirstLetter[cf.bar[cf.bar.Length - 1]].Add(cf);
}

Then use the it like this:

foreach (ClasssFoo cf in indexByFirstLetter[query[0]]) {
  if (cf.bar.StartsWith(query) || cf.bar.EndsWith(query))
    result.Add(cf);
}

Now we possibly do not have to loop through as many ClassFoo as in your example, but then again we have to keep the index up to date. There is no guarantee that it is faster, but it is definately more complicated.

Depends. Are all your objects always going to be loaded in memory? Do you have a finite limit of objects that may be loaded? Will your queries have to consider objects that haven't been loaded yet?

If the collection will get large, I would definitely use an index.

In fact, if the collection can grow to an arbitrary size and you're not sure that you will be able to fit it all in memory, I'd look into an ORM, an in-memory database, or another embedded database. XPO from DevExpress for ORM or SQLite.Net for in-memory database comes to mind.

If you don't want to go this far, make a simple index consisting of the "bar" member references mapping to class references.

If the set of possible criteria is fixed and small, you can assign a bitmask to each element in the list. The size of the bitmask is the size of the set of the criteria. When you create an element/add it to the list, you check which criteria it satisfies and then set the corresponding bits in the bitmask of this element. Matching the elements from the list will be as easy as matching their bitmasks with the target bitmask. A more general method is the Bloom filter.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow