What is the most efficient way to keep track of a specific character's index in a string?

https://stackoverflow.com/questions/36122

09-06-2019
|

Question

Take the following string as an example:

"The quick brown fox"

Right now the q in quick is at index 4 of the string (starting at 0) and the f in fox is at index 16. Now lets say the user enters some more text into this string.

"The very quick dark brown fox"

Now the q is at index 9 and the f is at index 26.

What is the most efficient method of keeping track of the index of the original q in quick and f in fox no matter how many characters are added by the user?

Language doesn't matter to me, this is more of a theory question than anything so use whatever language you want just try to keep it to generally popular and current languages.

The sample string I gave is short but I'm hoping for a way that can efficiently handle any size string. So updating an array with the offset would work with a short string but will bog down with to many characters.

Even though in the example I was looking for the index of unique characters in the string I also want to be able to track the index of the same character in different locations such as the o in brown and the o in fox. So searching is out of the question.

I was hoping for the answer to be both time and memory efficient but if I had to choose just one I care more about performance speed.

Solution

Let's say that you have a string and some of its letters are interesting. To make things easier let's say that the letter at index 0 is always interesting and you never add something before it—a sentinel. Write down pairs of (interesting letter, distance to the previous interesting letter). If the string is "+the very Quick dark brown Fox" and you are interested in q from 'quick' and f from 'fox' then you would write: (+,0), (q,10), (f,17). (The sign + is the sentinel.)

Now you put these in a balanced binary tree whose in-order traversal gives the sequence of letters in the order they appear in the string. You might now recognize the partial sums problem: You enhance the tree so that nodes contain (letter, distance, sum). The sum is the sum of all distances in the left subtree. (Therefore sum(x)=distance(left(x))+sum(left(x)).)

You can now query and update this data structure in logarithmic time.

To say that you added n characters to the left of character c you say distance(c)+=n an then go and update sum for all parents of c.

To ask what is the index of c you compute sum(c)+sum(parent(c))+sum(parent(parent(c)))+...

OTHER TIPS

Your question is a little ambiguous - are you looking to keep track of the first instances of every letter? If so, an array of length 26 might be the best option.

Whenever you insert text into a string at a position lower than the index you have, just compute the offset based on the length of the inserted string.

It would also help if you had a target language in mind as not all data structures and interactions are equally efficient and effective in all languages.

The standard trick that usually helps in similar situations is to keep the characters of the string as leaves in a balanced binary tree. Additionally, internal nodes of the tree should keep sets of letters (if the alphabet is small and fixed, they could be bitmaps) that occur in the subtree rooted at a particular node.

Inserting or deleting a letter into this structure only needs O(log(N)) operations (update the bitmaps on the path to root) and finding the first occurence of a letter also takes O(log(N)) operations - you descend from the root, going for the leftmost child whose bitmap contains the interesting letter.

Edit: The internal nodes should also keep number of leaves in the represented subtree, for efficient computation of the letter's index.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow