Domanda

I am somewhat struggling with the terminology and complexity of my explanations here, feel free to edit it.

I have 1.000 - 20.000 objects. Each one can contain several name words (first, second, middle, last, title...) and normalized numbers(home, business...), email adresses or even physical adresses and spouse names.

I want to implement a search that enables users to freely combine word parts and number parts.
When I search for "LL 676" I want to find all objects that contain any String with "LL" AND "676".
Currently I am iterating over every object and every objects property, split the searchString on " " and do a stringInstance.Contains(searchword).

This is too slow, so I am looking for a better solution.

What is the appropriate language agnostic data structure for this?
In my case I need it for C#.

Is the following data structure a good solution?


It's based on a HashMap/Dictionary.
At first I create a String that contains all name parts and phone numbers I want to look through, one example would be: "William Bill Henry Gates III 3. +436760000 billgatesstreet 12":
Then I split on " " and for every word x I create all possible substrings y that fullfill x.contains(y). I put every of those substrings inside the hashmap/dictionary.
On lookup/search I just need to call the search for every searchword and the join the results. Naturally, the lookup speed is blazingly fast (native Hashmap/Dictionary speed).

EDIT: Inserts are very fast as well (insignificant time) now that I use a smarter algorithm to get the substrings.

È stato utile?

Soluzione

It's possible I've misunderstood your algorithm or requirement, but this seems like it could be a potential performance improvement:

foreach (string arg in searchWords)
{
    if (String.IsNullOrEmpty(arg))
        continue;
    tempList = new List<T>();

    if (dictionary.ContainsKey(arg))
        foreach (T obj in dictionary[arg])
        if (list.Contains(obj))
                tempList.Add(obj);
    list = new List<T>(tempList);
}

The idea is that you do the first search word separately before this, and only put all the subsequent words into the searchWords list.

That should allow you to remove your final foreach loop entirely. Results only stay in your list as long as they keep matching every searchWord, rather than initially having to pile everything that matches a single word in then filter them back out at the end.

Altri suggerimenti

In case anyone cares for my solution:
Disclaimer:
This is only a rough draft.
I have only done some synthetic testing and I have written a lot of it without testing it again.
I have revised my code: Inserts are now ((n^2)/2)+(n/2) instead of 2^n-1 which is infinitely faster. Word length is now irrelevant.

namespace MegaHash
{
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Threading.Tasks;

public class GenericConcurrentMegaHash<T>
{
    // After doing a bulk add, call AwaitAll() to ensure all data was added!
    private ConcurrentBag<Task> bag = new ConcurrentBag<Task>();

    private ConcurrentDictionary<string, List<T>> dictionary = new ConcurrentDictionary<string, List<T>>();

    // consider changing this to include for example '-'
    public char[] splitChars;

    public GenericConcurrentMegaHash()
        : this(new char[] { ' ' })
    {
    }

    public GenericConcurrentMegaHash(char[] splitChars)
    {
        this.splitChars = splitChars;
    }

    public void Add(string keyWords, T o)
    {
        keyWords = keyWords.ToUpper();

        foreach (string keyWord in keyWords.Split(splitChars))
        {
            if (keyWord == null || keyWord.Length < 1)
                return;

            this.bag.Add(Task.Factory.StartNew(() => { AddInternal(keyWord, o); }));
        }
    }

    public void AwaitAll()
    {
        lock (this.bag)
        {
            foreach (Task t in bag)
                t.Wait();

            this.bag = new ConcurrentBag<Task>();
        }
    }

    private void AddInternal(string key, T y)
    {
        for (int i = 0; i < key.Length; i++)
        {
            for (int i2 = 0; i2 < i + 1; i2++)
            {
                string desire = key.Substring(i2, key.Length - i);

                if (dictionary.ContainsKey(desire))
                {
                    List<T> l = dictionary[desire];
                    lock (l)
                    {
                        try
                        {
                            if (!l.Contains(y))
                                l.Add(y);
                        }
                        catch (Exception ex)
                        {
                            ex.ToString();
                        }
                    }
                }
                else
                {
                    List<T> l = new List<T>();
                    l.Add(y);
                    dictionary[desire] = l;
                }
            }
        }
    }

    public IList<T> FulltextSearch(string searchString)
    {
        searchString = searchString.ToUpper();

        List<T> list = new List<T>();

        string[] searchWords = searchString.Split(splitChars);

        foreach (string arg in searchWords)
        {
            if (arg == null || arg.Length < 1)
                continue;

            if (dictionary.ContainsKey(arg))
                foreach (T obj in dictionary[arg])
                    if (!list.Contains(obj))
                        list.Add(obj);
        }

        List<T> returnList = new List<T>();

        foreach (T o in list)
        {
            foreach (string arg in searchWords)
                if (dictionary[arg] == null || !dictionary[arg].Contains(o))
                    goto BREAK;
            returnList.Add(o);
        BREAK:
            continue;
        }

        return returnList;
    }
}

}

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top