최소 길이 rle 찾기

https://stackoverflow.com/questions/2261318

20-09-2019
|

문제

고전적인 RLE 알고리즘은 숫자를 사용하여 데이터를 압축하여 해당 위치에 텍스트에 숫자가 나타나는 횟수를 나타냅니다. 예를 들어:

AAABBAAABBCECE => 3A2B3A2B1C1E1C1E

그러나 위의 예에서,이 방법은 압축 텍스트에서 더 많은 공간을 사용합니다. 더 나은 아이디어는 숫자를 사용하여 서브 스트링 주어진 텍스트에 숫자가 나타납니다. 예를 들어:

AAABBAAABBCECE => 2AAABB2CE ( "AAABB"두 번, "CE").

이제 내 질문은 다음과 같습니다.이 방법을 사용하여 최적의 RLE에서 최소 문자 수를 찾는 효율적인 알고리즘을 구현할 수있는 방법은 무엇입니까? Brute Force 방법이 존재하지만 더 빨리 무언가가 필요합니다 (최대 O (길이²)). 아마도 우리는 동적 프로그래밍을 사용할 수 있습니까?

해결책

할 수 있습니다 ~~2 차~~ 입방 동적 프로그래밍을 통한 2 차 시간.

다음은 몇 가지 파이썬 코드입니다.

import sys
import numpy as np

bignum = 10000

S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
N = len(S)

# length of longest substring match bet s[i:] and s[j:]                                                                                                        
maxmatch = np.zeros( (N+1,N+1), dtype=int)

for i in xrange(N-1,-1,-1):
  for j in xrange(i+1,N):
    if S[i] == S[j]:
      maxmatch[i,j] = maxmatch[i+1,j+1]+1

# P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters                                                                                                                   
Q = np.zeros(N+1, dtype=int) + bignum

# base case: no cost for empty string                                                                                                                          
P[0,0]=0
Q[0]=0

for n in xrange(1,N+1):
  for k in xrange(1,n+1):
    if n-2*k >= 0:
#     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
#     if s1 == s2:                                                                                                                                             
      if maxmatch[n-2*k,n-k] >=k:
        # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
        P[n,k] = min(P[n,k], P[n-k,k])
        print 'P[%d,%d] = %d' % (n,k,P[n,k])
    # Here we are starting a new block: 1 x_1...x_k                                                                                                            
    P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
    print 'P[%d,%d] = %d' % (n,k,P[n,k])
  for k in xrange(1,n+1):
    Q[n] = min(Q[n], P[n,k])

  print

print Q[N]

길을 따라 선택을 기억하여 실제 인코딩을 재구성 할 수 있습니다.

나는 작은 주름을 제외하고 C가 큰 경우 C+1을 유지하기 위해 여분의 바이트를 사용해야 할 수도 있습니다. 32 비트 INT를 사용하는 경우이 알고리즘의 런타임이 실현 가능한 컨텍스트에서는 나타나지 않습니다. 때때로 더 짧은 INT를 사용하여 공간을 절약하는 경우, 그것에 대해 생각하고 최신 C의 크기를 기반으로 테이블에 다른 차원을 추가 할 수 있습니다. 나는 이것이 실제로 분명 할 것이라고 생각하지 않습니다.

편집 : @Moron의 이점을 위해 여기에 더 많은 인쇄 문의 동일한 코드가 있으므로 알고리즘이 무엇을 생각하는지 더 쉽게 볼 수 있습니다.

import sys
import numpy as np

bignum = 10000

S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
N = len(S)

# length of longest substring match bet s[i:] and s[j:]                                                                                                        
maxmatch = np.zeros( (N+1,N+1), dtype=int)

for i in xrange(N-1,-1,-1):
  for j in xrange(i+1,N):
    if S[i] == S[j]:
      maxmatch[i,j] = maxmatch[i+1,j+1]+1

# P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters                                                                                                                   
Q = np.zeros(N+1, dtype=int) + bignum

# base case: no cost for empty string                                                                                                                          
P[0,0]=0
Q[0]=0

for n in xrange(1,N+1):
  for k in xrange(1,n+1):
    if n-2*k >= 0:
#     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
#     if s1 == s2:                                                                                                                                             
      if maxmatch[n-2*k,n-k] >=k:
        # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
        P[n,k] = min(P[n,k], P[n-k,k])
        print "P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for P[%d,%d] with %s's count incremented" % (n\
,k,P[n,k],n,P[n-k,k],n-k,k,S[n-k:n])
    # Here we are starting a new block: 1 x_1...x_k                                                                                                            
    P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
    print 'P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for Q[%d] with a new block 1%s' % (n,k,P[n,k],n,Q[\
n-k]+1+k,n-k,S[n-k:n])
  for k in xrange(1,n+1):
    Q[n] = min(Q[n], P[n,k])

  print
  print 'Q[%d] = %d\t I can encode first %d characters of S in only %d characters!' % (n,Q[n],n,Q[n])
  print


print Q[N]

ABCDABCDABCDBCD에서 출력의 마지막 몇 줄은 그렇습니다.

Q[13] = 7        I can encode first 13 characters of S in only 7 characters!

P[14,1] = 9      I can encode first 14 characters of S in only 9 characters if I use my solution for Q[13] with a new block 1C
P[14,2] = 8      I can encode first 14 characters of S in only 8 characters if I use my solution for Q[12] with a new block 1BC
P[14,3] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[11] with a new block 1DBC
P[14,4] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[10] with a new block 1CDBC
P[14,5] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[9] with a new block 1BCDBC
P[14,6] = 12     I can encode first 14 characters of S in only 12 characters if I use my solution for Q[8] with a new block 1ABCDBC
P[14,7] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[7] with a new block 1DABCDBC
P[14,8] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[6] with a new block 1CDABCDBC
P[14,9] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[5] with a new block 1BCDABCDBC
P[14,10] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[4] with a new block 1ABCDABCDBC
P[14,11] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[3] with a new block 1DABCDABCDBC
P[14,12] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBC
P[14,13] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBC
P[14,14] = 15    I can encode first 14 characters of S in only 15 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBC

Q[14] = 8        I can encode first 14 characters of S in only 8 characters!

P[15,1] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[14] with a new block 1D
P[15,2] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[13] with a new block 1CD
P[15,3] = 11     I can encode first 15 characters of S in only 11 characters if I use my solution for P[12,3] with BCD's count incremented
P[15,3] = 9      I can encode first 15 characters of S in only 9 characters if I use my solution for Q[12] with a new block 1BCD
P[15,4] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[11] with a new block 1DBCD
P[15,5] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[10] with a new block 1CDBCD
P[15,6] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[9] with a new block 1BCDBCD
P[15,7] = 13     I can encode first 15 characters of S in only 13 characters if I use my solution for Q[8] with a new block 1ABCDBCD
P[15,8] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[7] with a new block 1DABCDBCD
P[15,9] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[6] with a new block 1CDABCDBCD
P[15,10] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[5] with a new block 1BCDABCDBCD
P[15,11] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[4] with a new block 1ABCDABCDBCD
P[15,12] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[3] with a new block 1DABCDABCDBCD
P[15,13] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBCD
P[15,14] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBCD
P[15,15] = 16    I can encode first 15 characters of S in only 16 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBCD

Q[15] = 9        I can encode first 15 characters of S in only 9 characters!

다른 팁

솔루션의 전체 문자열 길이의 절반 정도의 하위 스트링을 가질 수 있으므로 동적 프로그래밍이 여기서 작동한다고 생각하지 않습니다. 무자비한 힘을 사용해야하는 것 같습니다. 관련 문제는 Lempel-Ziv-Welch 알고리즘. 하위 문자열을 사용하여 최소 인코딩을 찾는 효율적인 알고리즘입니다.

A very common way to encode RLE compressed data is to designate a special byte as the "DLE" (sorry, I don't remember what that term stands for), which means "the next is a count followed by a byte".

This way, only repeating sequences needs to be encoded. Typically the DLE symbol is chosen to minimize the chance of it occuring naturally in the uncompressed data.

For your original example, let's set the full stop (or dot) as the DLE, this would encode your example as follows:

AAABBAAABBCECE => 3A2B3A2B1C1E1C1E <-- your encoding
AAABBAAABBCECE => .3ABB.3ABBCECE   <-- my encoding

You would only encode a sequence if it actually ends up as saving space. If you limit the length of sequences to 255, so that the count fits in a byte, a sequence thus takes 3 bytes, the DLE, the count, and the byte to repeat. You would probably not encode 3-byte sequences either, because decoding those carries slightly more overhead than a non-encoded sequence.

In your trivial example, the saving is nonexistant, but if you try to compress a bitmap containing a screenshot of a mostly white program, like Notepad, or a browser, then you'll see real space savings.

If you should encounter the DLE character naturally, just emit a count of 0, since we know we would never encode a 0-length sequence, the DLE followed by a 0-byte means that you decode it as a single DLE byte.

Very clever ways of finding matching substrings may lead to considering suffix trees and suffix arrays. Thinking about suffix arrays and compression may lead you to the http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform. That may be the most elegant way of souping up run length encoding.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow