Is in-place run length encoding possible in O(1) space given that the output is shorter than the input?

https://cs.stackexchange.com/questions/130419

space-complexity

29-09-2020
|

Pergunta

This is inspired by a problem from here. This is the approximate form of the problem:

Given a string like "aaaa777cbb" (10 symbols long), run length encode it in-place to a string like "a473c1b2" (8 symbols long). You are guaranteed that the input will always be longer than the output.

The precise form of the problem is:

You are given an ordered list $L$ of symbols from a set $S$. Any symbol from $S$ may appear in the list.
$S$ contains all the positive integers up to and including $|L|$ (the length of $L$) and also some other symbols.
Rules of manipulating the input in-place
- You can replace one symbol in the list with another
- You can trim the list to a length of your choice by removing symbols from the end
- You cannot insert symbols
You must overwrite the list of symbols with it's run-length-encoding representation and trim it to length so that it includes only the run-length-encoding representation.
- The run-length-encoding representation replaces each series of 1 or more of the same symbol in the input with that symbol followed by the symbol representing the number of occurrences of the previous symbol.
  - For example: $[a, a, a, a, a, a, a, a, a, a, 7]$ becomes $[a, 10, 7, 1]$ meaning "$a$ ten times followed by $7$ one time"
  - Note that the length of the output list is always even
- You are guaranteed that the length of the input list is always larger than the length of the output list
You must do this with $O(1)$ additional working memory
- Each "word" of working memory contains $log_2 |S|$ bits (put another way, words may be constructed which store constant amounts of information, the position of any element in the input, or any symbol from the input)

Intuitively I don't think this is possible. The solutions provided on the original site seem to break on strings like "abccccc" (length 7) where the output should be "a1b1c5" (length 6), since they start by overwriting "b" with the "1" from "a1" before they have even checked which symbol is in the 2nd position.

I have thought about trying to start by finding the compressible runs of letters (2 or more of the same letter), but I don't know how to tell which symbols are already processed and which are from the original input without using some sort of memory that which would grow with the size of the input (like a bitmap of processed areas) and therefore put me in violation of the $O(1)$ space requirement.

I consider acceptable answers to be proofs that this problem either is or is not solvable in $O(1)$ space.

Solução

An $O(1)$ space algorithm that uses one extra symbol not found in $L$, which I will call $B$ for blank space.

I define an operation, a "shift right" at position $k$. It finds the next blank symbol $B$ after position $k$ , shifts all symbols one to the right, and sets position $k$ to $B$. For example a right shift at the third symbol:

abcdeBfjgB    becomes    abBcdefjgB
  ^                        ^

Similarly a "shift left" at position $k$ assumes there is a $B$ symbol there, and moves it all the way to the end of the string, shifting all other symbols left.

abBdeBfjgB    becomes    abdeBfjgBB
  ^                        ^

Note that you can perform both shifts in $O(1)$ memory.

Now, first we replace all runs of any symbol $x$ with length $l \geq 3$ or greater with $xlB^{l-2}$. This can be done in-place, and leaves such runs identifiable. Also note that these are all the runs that shortens the output compared to the input.

Then, move a single pointer $p$ from left to right:

If the string at the pointer starts with $B$, shift left.
If the string at the pointer starts with $xlB^+$, this is the start of a run with length at least 3. Increment $p$ by $2$.
If the string at the pointer starts with $xx$ replace it with $x 2$ and increment $p$ by 2. Note that $xx$ can never be the start of a run of length 3 or higher, since we already replaced those.
If the string at the pointer has form $xy$, increment $p$ by 1, shift right, and replace the resulting $B$ at $p$ with $1$. Increment $p$ by 1 again. Note that the shift right must succeed due to the guarantee the output is shorter than the input, and we already created all space possible when replacing all runs of length $3$.

If any space is left over at the end, the algorithm will eventually get stuck performing step 1. Detect this, strip the remaining space, and you are done.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange