Python을 사용하여 큰 이진 파일에서 일련의 문자 제거

https://stackoverflow.com/questions/221386

03-07-2019
|

문제

Python의 이진 파일에서 동일한 값의 긴 시퀀스를 다듬고 싶습니다. 간단한 방법은 단순히 파일을 읽고 Re.Sub를 사용하여 원치 않는 순서를 대체하는 것입니다. 물론 큰 이진 파일에서는 작동하지 않습니다. Numpy와 같은 일에서 할 수 있습니까?

해결책

할 기억이 없다면 open("big.file").read(), Numpy는 실제로 도움이되지 않습니다. Python 변수와 동일한 메모리를 사용합니다 (1GB RAM이있는 경우 1GB의 데이터 만 Numpy에로드 할 수 있습니다).

솔루션은 간단합니다 - 청크에서 파일을 읽으십시오 .. f = open("big.file", "rb"), 그런 다음 일련의 일을하십시오 f.read(500), 시퀀스를 제거하고 다른 파일 객체에 다시 쓰십시오. C ..에서 파일 읽기/쓰기를하는 방법은 거의 없습니다.

문제는 교체하는 패턴을 놓치면 .. 예를 들어 :

target_seq = "567"
input_file = "1234567890"

target_seq.read(5) # reads 12345, doesn't contain 567
target_seq.read(5) # reads 67890, doesn't contain 567

명백한 해결책은 파일의 첫 번째 문자에서 시작하는 것입니다. len(target_seq) 캐릭터, 그런 다음 한 캐릭터를 앞으로 나아가고 다시 앞으로 확인하십시오.

예를 들어 (의사 코드!) :

while cur_data != "":
    seek_start = 0
    chunk_size = len(target_seq)

    input_file.seek(offset = seek_start, whence = 1) #whence=1 means seek from start of file (0 + offset)
    cur_data = input_file.read(chunk_size) # reads 123
    if target_seq == cur_data:
        # Found it!
        out_file.write("replacement_string")
    else:
        # not it, shove it in the new file
        out_file.write(cur_data)
    seek_start += 1

가장 효율적인 방법은 아니지만 작동하며 파일 사본을 메모리 (또는 2)에 보관할 필요는 없습니다.

다른 팁

두 개의 사본이 메모리에 맞으면 사본을 쉽게 만들 수 있습니다. 두 번째 사본은 압축 버전입니다. 물론, 당신은 numpy를 사용할 수 있지만, 당신은 또한 정렬 패키지. 또한 큰 이진 객체를 바이트 문자열로 취급하여 직접 조작 할 수 있습니다.

파일이있을 수 있습니다 진짜 크고 두 개의 사본을 메모리에 맞출 수 없습니다. (당신은 많은 세부 사항을 제공하지 않았으므로 이것은 단지 추측 일뿐입니다.) 청크에서 압축을해야합니다. 당신은 청크로 읽고, 그 청크에 대해 약간의 처리를하고 그것을 작성하십시오. 다시 말하지만, Numpy, Array 또는 간단한 바이트 스트링은 잘 작동합니다.

DBR의 솔루션은 좋은 아이디어이지만, 다음 청크를 읽기 전에 검색하는 시퀀스의 길이를 파일 포인터를 되 감는 것뿐입니다.

def ReplaceSequence(inFilename, outFilename, oldSeq, newSeq):
 inputFile  = open(inFilename, "rb")
 outputFile = open(outFilename, "wb")

 data = ""
 chunk = 1024

 while 1:
      data = inputFile.read(chunk)
      data = data.replace(oldSeq, newSeq)
      outputFile.write(data)

      inputFile.seek(-len(oldSequence), 1)
      outputFile.seek(-len(oldSequence), 1)

     if len(data) < chunk:
           break

 inputFile.close()
 outputFile.close()

교체 문자열의 크기가 다르지 않으면 Ajmayorga 제안은 괜찮습니다. 또는 교체 문자열이 청크 끝에 있습니다.

나는 이것을 다음과 같이 고쳤다 :

def ReplaceSequence(inFilename, outFilename, oldSeq, newSeq):
    inputFile  = open(inFilename, "rb")
    outputFile = open(outFilename, "wb")

data = ""
chunk = 1024

oldSeqLen = len(oldSeq)

while 1:
    data = inputFile.read(chunk)

    dataSize = len(data)
    seekLen= dataSize - data.rfind(oldSeq) - oldSeqLen
    if seekLen > oldSeqLen:
        seekLen = oldSeqLen

    data = data.replace(oldSeq, newSeq)
    outputFile.write(data)
    inputFile.seek(-seekLen, 1) 
    outputFile.seek(-seekLen, 1)

    if dataSize < chunk:
        break

inputFile.close()
outputFile.close()

질문을 더 정확하게 만들어야합니다. 미리 다듬고 싶은 값을 알고 있습니까?

당신이 그렇게한다고 가정하면, 나는 아마도 일치하는 섹션을 검색 할 것입니다. subprocess 실행하려면 "fgrep -o -b <search string>"그리고 파이썬을 사용하여 파일의 관련 섹션을 변경하십시오. file 사물 seek, read 그리고 write 행동 양식.

이 생성기 기반 버전은 파일 컨텐츠의 한 문자를 한 번에 메모리에 유지합니다.

내가 당신의 질문 제목을 문자 그대로 가져 가고 있습니다 - 당신은 같은 실행을 줄이고 싶습니다. 캐릭터 단일 캐릭터로. 일반적으로 패턴을 대체하려면 작동하지 않습니다.

import StringIO

def gen_chars(stream):
   while True:
      ch = stream.read(1)
      if ch: 
         yield ch
      else:
         break

def gen_unique_chars(stream):
   lastchar = ''
   for char in gen_chars(stream):
      if char != lastchar:
         yield char
      lastchar=char

def remove_seq(infile, outfile):
   for ch in gen_unique_chars(infile):
      outfile.write(ch)

# Represents a file open for reading
infile  = StringIO.StringIO("1122233333444555")

# Represents a file open for writing
outfile = StringIO.StringIO()

# Will print "12345"
remove_seq(infile, outfile)
outfile.seek(0)
print outfile.read()

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow