Trovare il RLE lunghezza minima

https://stackoverflow.com/questions/2261318

20-09-2019
|

Domanda

L'algoritmo RLE classica comprime i dati utilizzando i numeri per rappresentare quante volte il carattere che segue un numero appare nel testo in quella posizione. Ad esempio:

AAABBAAABBCECE => 3A2B3A2B1C1E1C1E

Tuttavia, nell'esempio precedente, tale metodo comporta ancora più spazio utilizzata dal testo compresso. Un'idea migliore sarebbe quella di utilizzare i numeri per rappresentare quante volte il stringa a seguito di un certo numero appare nel testo dato. Ad esempio:

AAABBAAABBCECE => 2AAABB2CE ( "AAABB" due volte, poi "CE" due volte).

Ora, la mia domanda è: come potrei implementare un algoritmo efficiente che scopre il numero minimo di caratteri in un RLE ottimale con questo metodo? metodi di forza bruta esistono, ma ho bisogno di qualcosa di più veloce (al massimo O (lunghezza ²) ). Forse possiamo usare la programmazione dinamica?

Soluzione

Si può fare in ~~quadratica~~ ~~cubica~~ tempo quadratica tramite la programmazione dinamica.

Ecco il codice Python:

import sys
import numpy as np

bignum = 10000

S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
N = len(S)

# length of longest substring match bet s[i:] and s[j:]                                                                                                        
maxmatch = np.zeros( (N+1,N+1), dtype=int)

for i in xrange(N-1,-1,-1):
  for j in xrange(i+1,N):
    if S[i] == S[j]:
      maxmatch[i,j] = maxmatch[i+1,j+1]+1

# P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters                                                                                                                   
Q = np.zeros(N+1, dtype=int) + bignum

# base case: no cost for empty string                                                                                                                          
P[0,0]=0
Q[0]=0

for n in xrange(1,N+1):
  for k in xrange(1,n+1):
    if n-2*k >= 0:
#     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
#     if s1 == s2:                                                                                                                                             
      if maxmatch[n-2*k,n-k] >=k:
        # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
        P[n,k] = min(P[n,k], P[n-k,k])
        print 'P[%d,%d] = %d' % (n,k,P[n,k])
    # Here we are starting a new block: 1 x_1...x_k                                                                                                            
    P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
    print 'P[%d,%d] = %d' % (n,k,P[n,k])
  for k in xrange(1,n+1):
    Q[n] = min(Q[n], P[n,k])

  print

print Q[N]

È possibile ricostruire la codifica reale ricordando le scelte lungo il percorso.

ho lasciato fuori una ruga minore, che è che potrebbe essere necessario utilizzare un byte in più per tenere C + 1 se C è di grandi dimensioni. Se si utilizza interi a 32 bit, questo non si presenti in ogni contesto in cui tempo di esecuzione di questo algoritmo è fattibile. Se si utilizza a volte interi più brevi per risparmiare spazio, allora si dovrà pensarci, e magari aggiungere un'altra dimensione al vostro tavolo in base alle dimensioni della più recente C. In teoria, questo potrebbe aggiungere un log (N) fattore, ma non credo che questo sarà evidente nella pratica.

Modifica: Per il bene di @Moron, qui è lo stesso codice con più dichiarazioni di stampa, in modo da poter più facilmente vedere che cosa sta pensando l'algoritmo:

import sys
import numpy as np

bignum = 10000

S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
N = len(S)

# length of longest substring match bet s[i:] and s[j:]                                                                                                        
maxmatch = np.zeros( (N+1,N+1), dtype=int)

for i in xrange(N-1,-1,-1):
  for j in xrange(i+1,N):
    if S[i] == S[j]:
      maxmatch[i,j] = maxmatch[i+1,j+1]+1

# P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters                                                                                                                   
Q = np.zeros(N+1, dtype=int) + bignum

# base case: no cost for empty string                                                                                                                          
P[0,0]=0
Q[0]=0

for n in xrange(1,N+1):
  for k in xrange(1,n+1):
    if n-2*k >= 0:
#     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
#     if s1 == s2:                                                                                                                                             
      if maxmatch[n-2*k,n-k] >=k:
        # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
        P[n,k] = min(P[n,k], P[n-k,k])
        print "P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for P[%d,%d] with %s's count incremented" % (n\
,k,P[n,k],n,P[n-k,k],n-k,k,S[n-k:n])
    # Here we are starting a new block: 1 x_1...x_k                                                                                                            
    P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
    print 'P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for Q[%d] with a new block 1%s' % (n,k,P[n,k],n,Q[\
n-k]+1+k,n-k,S[n-k:n])
  for k in xrange(1,n+1):
    Q[n] = min(Q[n], P[n,k])

  print
  print 'Q[%d] = %d\t I can encode first %d characters of S in only %d characters!' % (n,Q[n],n,Q[n])
  print


print Q[N]

Le ultime righe della sua uscita sul ABCDABCDABCDBCD sono in questo modo:

Q[13] = 7        I can encode first 13 characters of S in only 7 characters!

P[14,1] = 9      I can encode first 14 characters of S in only 9 characters if I use my solution for Q[13] with a new block 1C
P[14,2] = 8      I can encode first 14 characters of S in only 8 characters if I use my solution for Q[12] with a new block 1BC
P[14,3] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[11] with a new block 1DBC
P[14,4] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[10] with a new block 1CDBC
P[14,5] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[9] with a new block 1BCDBC
P[14,6] = 12     I can encode first 14 characters of S in only 12 characters if I use my solution for Q[8] with a new block 1ABCDBC
P[14,7] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[7] with a new block 1DABCDBC
P[14,8] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[6] with a new block 1CDABCDBC
P[14,9] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[5] with a new block 1BCDABCDBC
P[14,10] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[4] with a new block 1ABCDABCDBC
P[14,11] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[3] with a new block 1DABCDABCDBC
P[14,12] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBC
P[14,13] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBC
P[14,14] = 15    I can encode first 14 characters of S in only 15 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBC

Q[14] = 8        I can encode first 14 characters of S in only 8 characters!

P[15,1] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[14] with a new block 1D
P[15,2] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[13] with a new block 1CD
P[15,3] = 11     I can encode first 15 characters of S in only 11 characters if I use my solution for P[12,3] with BCD's count incremented
P[15,3] = 9      I can encode first 15 characters of S in only 9 characters if I use my solution for Q[12] with a new block 1BCD
P[15,4] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[11] with a new block 1DBCD
P[15,5] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[10] with a new block 1CDBCD
P[15,6] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[9] with a new block 1BCDBCD
P[15,7] = 13     I can encode first 15 characters of S in only 13 characters if I use my solution for Q[8] with a new block 1ABCDBCD
P[15,8] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[7] with a new block 1DABCDBCD
P[15,9] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[6] with a new block 1CDABCDBCD
P[15,10] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[5] with a new block 1BCDABCDBCD
P[15,11] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[4] with a new block 1ABCDABCDBCD
P[15,12] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[3] with a new block 1DABCDABCDBCD
P[15,13] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBCD
P[15,14] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBCD
P[15,15] = 16    I can encode first 15 characters of S in only 16 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBCD

Q[15] = 9        I can encode first 15 characters of S in only 9 characters!

Altri suggerimenti

Non credo programmazione dinamica funzionerà qui, come si potrebbe avere sotto-stringhe circa la metà della lunghezza della stringa completa nella soluzione. Sembra che tu abbia bisogno di usare la forza bruta. Per un problema correlato, controlla la Lempel-Ziv- Welch Algoritmo . È un algoritmo efficiente che trova una codifica minimo utilizzando sottostringhe.

Un modo molto comune per codificare i dati compressi RLE è quello di indicare un byte speciale come il "DLE" (scusate, non ricordo che cosa si distingue questo termine per), che significa "il prossimo è un conteggio seguito da un byte ".

In questo modo, solo ripetendo sequenze deve essere codificato. In genere il simbolo DLE viene scelto per ridurre al minimo la possibilità di esso che si verificano naturalmente nel dati non compressi.

Per la vostra originale esempio, impostiamo il punto finale (o punto) come il DLE, questo sarebbe codificare il vostro esempio come segue:

AAABBAAABBCECE => 3A2B3A2B1C1E1C1E <-- your encoding
AAABBAAABBCECE => .3ABB.3ABBCECE   <-- my encoding

Si potrebbe codificare solo una sequenza se finisce in realtà come il risparmio di spazio. Se si limita la lunghezza delle sequenze da 255, in modo che il conteggio si inserisce in un byte, una sequenza assume così 3 byte, il DLE, il conteggio, e il byte di ripetere. Non sarebbe probabilmente codificare sequenze 3 byte o, perché decodificare quelli trasporta leggermente più in alto di un non-codificato sequenza.

Nel tuo esempio banale, il risparmio è inesistente, ma se si tenta di comprimere un bitmap che contiene uno screenshot di un programma per lo più bianchi, come il Blocco note, o un browser, allora vedrete il risparmio di spazio reale.

Se doveste riscontrare il carattere DLE, naturalmente, solo emettono un conteggio di 0, perché sappiamo che non avremmo mai codificare una sequenza di 0-lunghezza, il DLE seguito da uno 0 byte significa che si decodificare come un singolo byte DLE .

modi molto intelligenti di trovare sottostringhe corrispondenti possono portare a considerare gli alberi di suffisso e array suffisso. Pensando array suffisso e compressione può condurvi alla http: //en.wikipedia .org / wiki / Burrows% E2% 80% 93Wheeler_transform . Questo può essere il modo più elegante di levare dal marasma fino run-length encoding.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow