Optimal Data Structure for Searching though String

https://softwareengineering.stackexchange.com/questions/408597

09-03-2021
|

Pergunta

I am current developing a karaoke app in JavaFX, I need advise on the most optimal data structure storing a huge list of songs (library). I am inclined to use Binary Search Trees due to its Big O Performance of O(log(n)).

I am using the String compareTo function in Java for insertion but searching proves difficult with compareTo. Therefore, I am not sure that is the best for searching though strings.

I am using https://www.bigocheatsheet.com/ for Big O Performance Metrics.

Should I just ditch Binary Search Trees or use an another simple structure like List? (However performance will be O(n))

Song Data & Sorting: I load the sample song data provided to me by loading the file and storing it into the data structure. I want to search only by Song Name (Using Java String Contains method) since it's a basic program. Here is the sample song data:

Decades (2007 Remaster) Joy Division    374 test.mp4
Lets Stay Together  Al Green    199 test.mp4
Jump For Joy    New York Trio   286 test.mp4
Victims Of The Revolution   Bad Religion    197 test.mp4
Unstable Condition  John Tejada 348 test.mp4
Go or Linger    Natas loves you 173 test.mp4
Ocean Front Property    George Strait   197 test.mp4
Negai (Album-Mix)   Perfume 298 test.mp4
A Little Bit More   Jamie Lidell    186 test.mp4
Good Position   Yin Yoga Academy    219 test.mp4
Weekend Kane Brown  226 test.mp4
Oh Industry Bette Midler    244 test.mp4

Solução

You can't use a binary search when searching for substrings, e.g. title.contains(searchTerm). There are two reasonable alternatives:

Tokenize the titles into words, remove stopwords, and index all words so that you can narrow your search down to titles that contain a particular word. So you'd have an index that tells you which titles contain the word foo and which titles contains the word bar. Then the search algorithm would look something like this:
```
List<String> titles = ...;
HashMap<String, Set<Integer>> index = ...;

// try to narrow down a smaller set of candidates
Set<Integer> candidates = null;  // null representing all entries
for (String word : tokenize(searchTerm)) {
  candidates = intersection(candidates, index.get(word));
}

// scan through all candidate titles
return candidates.stream().map(titles::get).filter(s -> s.contains(searchTerm));
```
The O(…) complexity depends a lot on the distribution of your data. However, you can start with the index entry with the fewest candidates, and stop narrowing down the candidates when the candidate set is small enough. Average complexity will be around O(k·l) where k is the largest index set you consider and l is the average length of the titles.
Modern CPUs are really fast (Gigabytes per second) at scanning memory for a particular pattern – as long as the data is in memory and available in a flat region. Unfortunately you don't have good control over the memory layout in Java. This gives you two options that you should seriously explore:
- Don't be clever, just scan the entire list of records in O(n) fashion. This is likely to be faster than clever algorithms for up to at least a few hundred entries, possibly up until many thousands of entries.
- Hope that your JVM has a super-optimized string search capabilities, and keep all your data in a flat string that you search. For example, load your file contents into a single String, then use indexOf() to find occurrences of the search term, then find the line start and line end of the entry and extract the record in some usable form. There will be some false positives e.g. when the artist name matches the search term. But those can be filtered out in a post-processing step.

You might be tempted to distribute the search across multiple threads that each get a part of the candidate set or the flat string. But this may or may not be useful, because CPUs can often process data faster than it can be served through the memory bus. If your speed requirements are particularly high, the only reasonable approach is to split the workload across multiple machines, MapReduce style. But to be clear: a music manager application is far, far away from such big data considerations.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange