Pergunta

The problem, coming from an interview question, is:

You have a stream of incoming numbers in range 0 to 60000 and you have a function which will take a number from that range and return the count of occurrence of that number till that moment. Give a suitable Data structure/algorithm to implement this system.

The stream is infinite, so if fixed size data structures re used, i.e. primitive types in Java or C, they will overflow. So there is the need to use data structures that have a size that grows over time. As pointed by the interviewer, the memory occupied by those data structures will diverge.

The model of computation is a Turing machine with three tapes:

  • infinite read-only one-way input tape;
  • constant space bounded read-write two way work tape;
  • infinite write-only one-way output tape.

The main reason to choose the model above is that in the real world there is virtually no limit to the quantity of input that can be acquired using a keyboard or a network connection. Also, there is virtually no limit to the quantity of information that can be displayed on amonitor over time. But memory is limited and expensive.

I modeled the problem as the problem to recognize the language L of all couples (number,number of occurrences so far).

As a corollary of the Theorem 3.13 in Hopcroft-Ullman I know that every language recognized by a constant space bounded machine is regular.

But, in any given moment, the language L is a finite language, because the number of couples to be recognized is finite: 60001. So I can't use the pumping lemma for regular languages to prove that such language is not regular.

Is there a way I can complete my proof?

The original question is here.

Foi útil?

Solução

Let me give an explanation which does not differ in content from the accepted answer, but brings the question back to the realm of regular languages.

The language you are dealing with can be defined as follows. Let $s \in \Sigma^{\mathbb{N}}$ be a (countably) infinite string of symbols from some finite alphabet $\Sigma$, and let $s[1:i]$ be the prefix of the first $i$ symbols in $s$. In your case $s$ is the input and and $\Sigma$ is the nonnegative positive integers $\{0, \ldots, 60000\}$. $L(s)$ can be defined as a language of triples $(s[1:i], a, j)$ where $(s[1:i], a, j) \in L(s)$ if and only if there are $j$ occurrences of $a$ in $s[1:i]$. Convince yourself, using the pumping lemma, that for any fixed $s$, $L(s)$ is not regular. Then convince yourself that if a bounded memory Turing machine could solve your problem, it could also recognize the non-regular language $L(s)$. This argument ought to work even if we fix $a$ and define a similar language $L(s, a)$.

These kinds of examples are the reason why our computational model of choice (our = the algorithms community) is the word RAM machine, and we assume that word size grows with input size. This is supposed to model the fact that as the instances we face in reality grow, so does the memory of our computers. Of course, at some point we'll face physical limits of the hardware: there is only so much memory that can be accessed fast by a single/finitely many processors. One way to go beyond these limits is memory hierarchies and the modeling becomes more complicated, but there is some very nice theory as well (see external memory models and cache-oblivious algorithms).

Outras dicas

For simplicity assume, you don't have 600.000 numbers but only 1. At some time $i$ a certain number of 1s have been sent. We call this number the prefix of the stream. There are infinitely many possible prefixes.

Since there is only a constant size work tape, the TM can have at most a constant number of configurations. Thus, there are at least two different prefixes, that would lead to the same configuration. In both cases the TM has to output the same number, for any following request. Hence the TM can not compute the desired function.

I hope this is convincing, although it is not super-formal.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a cs.stackexchange
scroll top