Ottieni le ultime n righe di un file con Python, simile a tail

https://stackoverflow.com/questions/136168

02-07-2019
|

Domanda

Sto scrivendo un visualizzatore di file di registro per un'applicazione Web e per questo voglio impaginare le righe del file di registro. Gli elementi nel file sono basati sulla riga con l'ultimo elemento in basso.

Quindi ho bisogno di un metodo tail () in grado di leggere le righe n dal basso e supportare un offset. Quello che mi è venuto in mente è simile al seguente:

def tail(f, n, offset=0):
    """Reads a n lines from f with an offset of offset lines."""
    avg_line_length = 74
    to_read = n + offset
    while 1:
        try:
            f.seek(-(avg_line_length * to_read), 2)
        except IOError:
            # woops.  apparently file is smaller than what we want
            # to step back, go to the beginning instead
            f.seek(0)
        pos = f.tell()
        lines = f.read().splitlines()
        if len(lines) >= to_read or pos == 0:
            return lines[-to_read:offset and -offset or None]
        avg_line_length *= 1.3

È un approccio ragionevole? Qual è il modo consigliato per personalizzare i file di registro con gli offset?

Soluzione 6

Il codice che ho finito per usare. Penso che questo sia il migliore finora:

def tail(f, n, offset=None):
    """Reads a n lines from f with an offset of offset lines.  The return
    value is a tuple in the form ``(lines, has_more)`` where `has_more` is
    an indicator that is `True` if there are more lines in the file.
    """
    avg_line_length = 74
    to_read = n + (offset or 0)

    while 1:
        try:
            f.seek(-(avg_line_length * to_read), 2)
        except IOError:
            # woops.  apparently file is smaller than what we want
            # to step back, go to the beginning instead
            f.seek(0)
        pos = f.tell()
        lines = f.read().splitlines()
        if len(lines) >= to_read or pos == 0:
            return lines[-to_read:offset and -offset or None], \
                   len(lines) > to_read or pos > 0
        avg_line_length *= 1.3

Altri suggerimenti

Potrebbe essere più veloce del tuo. Non fa ipotesi sulla lunghezza della linea. Esegue il backup del file di un blocco alla volta fino a quando non trova il numero corretto di caratteri "\ n".

def tail( f, lines=20 ):
    total_lines_wanted = lines

    BLOCK_SIZE = 1024
    f.seek(0, 2)
    block_end_byte = f.tell()
    lines_to_go = total_lines_wanted
    block_number = -1
    blocks = [] # blocks of size BLOCK_SIZE, in reverse order starting
                # from the end of the file
    while lines_to_go > 0 and block_end_byte > 0:
        if (block_end_byte - BLOCK_SIZE > 0):
            # read the last block we haven't yet read
            f.seek(block_number*BLOCK_SIZE, 2)
            blocks.append(f.read(BLOCK_SIZE))
        else:
            # file too small, start from begining
            f.seek(0,0)
            # only read what was not read
            blocks.append(f.read(block_end_byte))
        lines_found = blocks[-1].count('\n')
        lines_to_go -= lines_found
        block_end_byte -= BLOCK_SIZE
        block_number -= 1
    all_read_text = ''.join(reversed(blocks))
    return '\n'.join(all_read_text.splitlines()[-total_lines_wanted:])

Non mi piacciono le ipotesi complicate sulla lunghezza della linea quando - in pratica - non puoi mai sapere cose del genere.

Generalmente, questo localizzerà le ultime 20 linee sul primo o secondo passaggio attraverso il loop. Se la tua cosa di 74 caratteri è effettivamente accurata, fai la dimensione del blocco 2048 e traccerai quasi immediatamente 20 linee.

Inoltre, non brucio molte calorie del cervello cercando di perfezionare l'allineamento con i blocchi fisici del sistema operativo. Usando questi pacchetti I / O di alto livello, dubito che vedrai qualsiasi conseguenza delle prestazioni del tentativo di allinearlo sui limiti del blocco del sistema operativo. Se si utilizza l'I / O di livello inferiore, è possibile che venga visualizzato un aumento di velocità.

Presuppone un sistema simile a unix su Python 2 che puoi fare:

import os
def tail(f, n, offset=0):
  stdin,stdout = os.popen2("tail -n "+n+offset+" "+f)
  stdin.close()
  lines = stdout.readlines(); stdout.close()
  return lines[:,-offset]

Per python 3 puoi fare:

import subprocess
def tail(f, n, offset=0):
    proc = subprocess.Popen(['tail', '-n', n + offset, f], stdout=subprocess.PIPE)
    lines = proc.stdout.readlines()
    return lines[:, -offset]

Se la lettura dell'intero file è accettabile, utilizzare un deque.

from collections import deque
deque(f, maxlen=n)

Prima della 2.6, i deques non avevano un'opzione maxlen, ma è abbastanza facile da implementare.

import itertools
def maxque(items, size):
    items = iter(items)
    q = deque(itertools.islice(items, size))
    for item in items:
        del q[0]
        q.append(item)
    return q

Se è un requisito per leggere il file dalla fine, usa una ricerca al galoppo (a.k.a esponenziale).

def tail(f, n):
    assert n >= 0
    pos, lines = n+1, []
    while len(lines) <= n:
        try:
            f.seek(-pos, 2)
        except IOError:
            f.seek(0)
            break
        finally:
            lines = list(f)
        pos *= 2
    return lines[-n:]

Ecco la mia risposta. Pitone puro. Usando timeit sembra abbastanza veloce. Coda di 100 righe di un file di registro con 100.000 righe:

>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=10)
0.0014600753784179688
>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=100)
0.00899195671081543
>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=1000)
0.05842900276184082
>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=10000)
0.5394978523254395
>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=100000)
5.377126932144165

Ecco il codice:

import os


def tail(f, lines=1, _buffer=4098):
    """Tail a file and get X lines from the end"""
    # place holder for the lines found
    lines_found = []

    # block counter will be multiplied by buffer
    # to get the block size from the end
    block_counter = -1

    # loop until we find X lines
    while len(lines_found) < lines:
        try:
            f.seek(block_counter * _buffer, os.SEEK_END)
        except IOError:  # either file is too small, or too many lines requested
            f.seek(0)
            lines_found = f.readlines()
            break

        lines_found = f.readlines()

        # we found enough lines, get out
        # Removed this line because it was redundant the while will catch
        # it, I left it for history
        # if len(lines_found) > lines:
        #    break

        # decrement the block counter to get the
        # next X bytes
        block_counter -= 1

    return lines_found[-lines:]

La risposta di S.Lott sopra funziona quasi per me, ma finisce per darmi linee parziali. Si scopre che corrompe i dati sui limiti dei blocchi perché i dati mantengono i blocchi letti in ordine inverso. Quando viene chiamato '' .join (data), i blocchi sono nell'ordine sbagliato. Questo risolve questo.

def tail(f, window=20):
    """
    Returns the last `window` lines of file `f` as a list.
    f - a byte file-like object
    """
    if window == 0:
        return []
    BUFSIZ = 1024
    f.seek(0, 2)
    bytes = f.tell()
    size = window + 1
    block = -1
    data = []
    while size > 0 and bytes > 0:
        if bytes - BUFSIZ > 0:
            # Seek back one whole BUFSIZ
            f.seek(block * BUFSIZ, 2)
            # read BUFFER
            data.insert(0, f.read(BUFSIZ))
        else:
            # file too small, start from begining
            f.seek(0,0)
            # only read what was not read
            data.insert(0, f.read(bytes))
        linesFound = data[0].count('\n')
        size -= linesFound
        bytes -= BUFSIZ
        block -= 1
    return ''.join(data).splitlines()[-window:]

Soluzione semplice e veloce con mmap:

import mmap
import os

def tail(filename, n):
    """Returns last n lines from the filename. No exception handling"""
    size = os.path.getsize(filename)
    with open(filename, "rb") as f:
        # for Windows the mmap parameters are different
        fm = mmap.mmap(f.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ)
        try:
            for i in xrange(size - 1, -1, -1):
                if fm[i] == '\n':
                    n -= 1
                    if n == -1:
                        break
            return fm[i + 1 if i else 0:].splitlines()
        finally:
            fm.close()

Una versione compatibile di Python3 ancora più pulita che non si inserisce ma aggiunge & amp; inverte:

def tail(f, window=1):
    """
    Returns the last `window` lines of file `f` as a list of bytes.
    """
    if window == 0:
        return b''
    BUFSIZE = 1024
    f.seek(0, 2)
    end = f.tell()
    nlines = window + 1
    data = []
    while nlines > 0 and end > 0:
        i = max(0, end - BUFSIZE)
        nread = min(end, BUFSIZE)

        f.seek(i)
        chunk = f.read(nread)
        data.append(chunk)
        nlines -= chunk.count(b'\n')
        end -= nread
    return b'\n'.join(b''.join(reversed(data)).splitlines()[-window:])

usalo in questo modo:

with open(path, 'rb') as f:
    last_lines = tail(f, 3).decode('utf-8')

Ho trovato Popen sopra per essere la soluzione migliore. È veloce e sporco e funziona Per python 2.6 su macchina Unix ho usato il seguente

    def GetLastNLines(self, n, fileName):
    """
    Name:           Get LastNLines
    Description:        Gets last n lines using Unix tail
    Output:         returns last n lines of a file
    Keyword argument:
    n -- number of last lines to return
    filename -- Name of the file you need to tail into
    """
    p=subprocess.Popen(['tail','-n',str(n),self.__fileName], stdout=subprocess.PIPE)
    soutput,sinput=p.communicate()
    return soutput

soutput conterrà le ultime n righe del codice. per scorrere attraverso soutput riga per riga do:

for line in GetLastNLines(50,'myfile.log').split('\n'):
    print line

Aggiorna la soluzione @papercrane a python3. Apri il file con open (nome file, 'rb') e:

def tail(f, window=20):
    """Returns the last `window` lines of file `f` as a list.
    """
    if window == 0:
        return []

    BUFSIZ = 1024
    f.seek(0, 2)
    remaining_bytes = f.tell()
    size = window + 1
    block = -1
    data = []

    while size > 0 and remaining_bytes > 0:
        if remaining_bytes - BUFSIZ > 0:
            # Seek back one whole BUFSIZ
            f.seek(block * BUFSIZ, 2)
            # read BUFFER
            bunch = f.read(BUFSIZ)
        else:
            # file too small, start from beginning
            f.seek(0, 0)
            # only read what was not read
            bunch = f.read(remaining_bytes)

        bunch = bunch.decode('utf-8')
        data.insert(0, bunch)
        size -= bunch.count('\n')
        remaining_bytes -= BUFSIZ
        block -= 1

    return ''.join(data).splitlines()[-window:]

Pubblicando una risposta su richiesta dei commentatori su la mia risposta a una domanda simile in cui è stata utilizzata la stessa tecnica per mutare l'ultima riga di un file, non solo ottenerlo.

Per un file di dimensioni significative, mmap è il modo migliore per farlo. Per migliorare la risposta mmap esistente, questa versione è portatile tra Windows e Linux e dovrebbe funzionare più velocemente (anche se non funzionerà senza alcune modifiche su Python a 32 bit con file nell'intervallo GB, vedere l ' altra risposta per suggerimenti su come gestirlo e per modificare il funzionamento su Python 2 ).

import io  # Gets consistent version of open for both Py2.7 and Py3.x
import itertools
import mmap

def skip_back_lines(mm, numlines, startidx):
    '''Factored out to simplify handling of n and offset'''
    for _ in itertools.repeat(None, numlines):
        startidx = mm.rfind(b'\n', 0, startidx)
        if startidx < 0:
            break
    return startidx

def tail(f, n, offset=0):
    # Reopen file in binary mode
    with io.open(f.name, 'rb') as binf, mmap.mmap(binf.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        # len(mm) - 1 handles files ending w/newline by getting the prior line
        startofline = skip_back_lines(mm, offset, len(mm) - 1)
        if startofline < 0:
            return []  # Offset lines consumed whole file, nothing to return
            # If using a generator function (yield-ing, see below),
            # this should be a plain return, no empty list

        endoflines = startofline + 1  # Slice end to omit offset lines

        # Find start of lines to capture (add 1 to move from newline to beginning of following line)
        startofline = skip_back_lines(mm, n, startofline) + 1

        # Passing True to splitlines makes it return the list of lines without
        # removing the trailing newline (if any), so list mimics f.readlines()
        return mm[startofline:endoflines].splitlines(True)
        # If Windows style \r\n newlines need to be normalized to \n, and input
        # is ASCII compatible, can normalize newlines with:
        # return mm[startofline:endoflines].replace(os.linesep.encode('ascii'), b'\n').splitlines(True)

Ciò presuppone che il numero di righe codificate sia abbastanza piccolo da poterle leggere in sicurezza tutte in una volta; potresti anche rendere questa una funzione generatore e leggere manualmente una riga alla volta sostituendo la riga finale con:

        mm.seek(startofline)
        # Call mm.readline n times, or until EOF, whichever comes first
        # Python 3.2 and earlier:
        for line in itertools.islice(iter(mm.readline, b''), n):
            yield line

        # 3.3+:
        yield from itertools.islice(iter(mm.readline, b''), n)

Infine, questo viene letto in modalità binaria (necessario per usare mmap ) in modo da fornire le righe str (Py2) e byte (Py3 ); se vuoi unicode (Py2) o str (Py3), l'approccio iterativo potrebbe essere modificato per decodificare per te e / o correggere nuove righe:

        lines = itertools.islice(iter(mm.readline, b''), n)
        if f.encoding:  # Decode if the passed file was opened with a specific encoding
            lines = (line.decode(f.encoding) for line in lines)
        if 'b' not in f.mode:  # Fix line breaks if passed file opened in text mode
            lines = (line.replace(os.linesep, '\n') for line in lines)
        # Python 3.2 and earlier:
        for line in lines:
            yield line
        # 3.3+:
        yield from lines

Nota: ho scritto tutto su una macchina in cui non ho accesso a Python per testare. Per favore fatemi sapere se ho scritto qualcosa di sbagliato; questo era abbastanza simile a la mia altra risposta che penso dovrebbe funzionare, ma le modifiche (ad esempio gestendo un offset ) potrebbe causare errori impercettibili. Per favore fatemi sapere nei commenti se ci sono errori.

basato sulla risposta più votata da S.Lott (25 settembre 2008 alle 21:43), ma risolto per piccoli file.

def tail(the_file, lines_2find=20):  
    the_file.seek(0, 2)                         #go to end of file
    bytes_in_file = the_file.tell()             
    lines_found, total_bytes_scanned = 0, 0
    while lines_2find+1 > lines_found and bytes_in_file > total_bytes_scanned: 
        byte_block = min(1024, bytes_in_file-total_bytes_scanned)
        the_file.seek(-(byte_block+total_bytes_scanned), 2)
        total_bytes_scanned += byte_block
        lines_found += the_file.read(1024).count('\n')
    the_file.seek(-total_bytes_scanned, 2)
    line_list = list(the_file.readlines())
    return line_list[-lines_2find:]

    #we read at least 21 line breaks from the bottom, block by block for speed
    #21 to ensure we don't get a half line

Spero che sia utile.

Esistono alcune implementazioni esistenti di tail su pypi che puoi installare usando pip:

mtFileUtil
multtail
log4tailer
...

A seconda della situazione, potrebbero esserci dei vantaggi nell'utilizzare uno di questi strumenti esistenti.

Ecco un'implementazione piuttosto semplice:

with open('/etc/passwd', 'r') as f:
  try:
    f.seek(0,2)
    s = ''
    while s.count('\n') < 11:
      cur = f.tell()
      f.seek((cur - 10))
      s = f.read(10) + s
      f.seek((cur - 10))
    print s
  except Exception as e:
    f.readlines()

Semplice:

with open("test.txt") as f:
data = f.readlines()
tail = data[-2:]
print(''.join(tail)

Per efficienza con file molto grandi (comuni nelle situazioni di file di registro in cui potresti voler usare tail), generalmente vuoi evitare di leggere l'intero file (anche se lo fai senza leggere l'intero file in memoria contemporaneamente) , devi in ??qualche modo elaborare l'offset in linee anziché in caratteri. Una possibilità è leggere all'indietro con seek () char da char, ma questo è molto lento. Invece, è meglio elaborarlo in blocchi più grandi.

Ho una funzione di utilità che ho scritto qualche tempo fa per leggere i file all'indietro che possono essere utilizzati qui.

import os, itertools

def rblocks(f, blocksize=4096):
    """Read file as series of blocks from end of file to start.

    The data itself is in normal order, only the order of the blocks is reversed.
    ie. "hello world" -> ["ld","wor", "lo ", "hel"]
    Note that the file must be opened in binary mode.
    """
    if 'b' not in f.mode.lower():
        raise Exception("File must be opened using binary mode.")
    size = os.stat(f.name).st_size
    fullblocks, lastblock = divmod(size, blocksize)

    # The first(end of file) block will be short, since this leaves 
    # the rest aligned on a blocksize boundary.  This may be more 
    # efficient than having the last (first in file) block be short
    f.seek(-lastblock,2)
    yield f.read(lastblock)

    for i in range(fullblocks-1,-1, -1):
        f.seek(i * blocksize)
        yield f.read(blocksize)

def tail(f, nlines):
    buf = ''
    result = []
    for block in rblocks(f):
        buf = block + buf
        lines = buf.splitlines()

        # Return all lines except the first (since may be partial)
        if lines:
            result.extend(lines[1:]) # First line may not be complete
            if(len(result) >= nlines):
                return result[-nlines:]

            buf = lines[0]

    return ([buf]+result)[-nlines:]


f=open('file_to_tail.txt','rb')
for line in tail(f, 20):
    print line

[Modifica] Aggiunta versione più specifica (evita la necessità di invertire due volte)

puoi andare alla fine del tuo file con f.seek (0, 2) e poi leggere le righe una ad una con il seguente rimpiazzo per readline ():

def readline_backwards(self, f):
    backline = ''
    last = ''
    while not last == '\n':
        backline = last + backline
        if f.tell() <= 0:
            return backline
        f.seek(-1, 1)
        last = f.read(1)
        f.seek(-1, 1)
    backline = last
    last = ''
    while not last == '\n':
        backline = last + backline
        if f.tell() <= 0:
            return backline
        f.seek(-1, 1)
        last = f.read(1)
        f.seek(-1, 1)
    f.seek(1, 1)
    return backline

Basato sulla risposta di Eyecue (10 giugno 10 alle 21:28): questa classe aggiunge il metodo head () e tail () all'oggetto file.

class File(file):
    def head(self, lines_2find=1):
        self.seek(0)                            #Rewind file
        return [self.next() for x in xrange(lines_2find)]

    def tail(self, lines_2find=1):  
        self.seek(0, 2)                         #go to end of file
        bytes_in_file = self.tell()             
        lines_found, total_bytes_scanned = 0, 0
        while (lines_2find+1 > lines_found and
               bytes_in_file > total_bytes_scanned): 
            byte_block = min(1024, bytes_in_file-total_bytes_scanned)
            self.seek(-(byte_block+total_bytes_scanned), 2)
            total_bytes_scanned += byte_block
            lines_found += self.read(1024).count('\n')
        self.seek(-total_bytes_scanned, 2)
        line_list = list(self.readlines())
        return line_list[-lines_2find:]

Utilizzo:

f = File('path/to/file', 'r')
f.head(3)
f.tail(3)

Molte di queste soluzioni hanno problemi se il file non termina in \ n o nel garantire la lettura della prima riga completa.

def tail(file, n=1, bs=1024):
    f = open(file)
    f.seek(-1,2)
    l = 1-f.read(1).count('\n') # If file doesn't end in \n, count it anyway.
    B = f.tell()
    while n >= l and B > 0:
            block = min(bs, B)
            B -= block
            f.seek(B, 0)
            l += f.read(block).count('\n')
    f.seek(B, 0)
    l = min(l,n) # discard first (incomplete) line if l > n
    lines = f.readlines()[-l:]
    f.close()
    return lines

Ho dovuto leggere un valore specifico dall'ultima riga di un file e mi sono imbattuto in questo thread. Invece di reinventare la ruota in Python, ho finito con un piccolo script shell, salvato come / Usr / local / bin / get_last_netp:

#! /bin/bash
tail -n1 /home/leif/projects/transfer/export.log | awk {'print $14'}

E nel programma Python:

from subprocess import check_output

last_netp = int(check_output("/usr/local/bin/get_last_netp"))

Non il primo esempio usando un deque, ma uno più semplice. Questo è generale: funziona su qualsiasi oggetto iterabile, non solo su un file.

#!/usr/bin/env python
import sys
import collections
def tail(iterable, N):
    deq = collections.deque()
    for thing in iterable:
        if len(deq) >= N:
            deq.popleft()
        deq.append(thing)
    for thing in deq:
        yield thing
if __name__ == '__main__':
    for line in tail(sys.stdin,10):
        sys.stdout.write(line)

This is my version of tailf

import sys, time, os

filename = 'path to file'

try:
    with open(filename) as f:
        size = os.path.getsize(filename)
        if size < 1024:
            s = size
        else:
            s = 999
        f.seek(-s, 2)
        l = f.read()
        print l
        while True:
            line = f.readline()
            if not line:
                time.sleep(1)
                continue
            print line
except IOError:
    pass

import time

attemps = 600
wait_sec = 5
fname = "YOUR_PATH"

with open(fname, "r") as f:
    where = f.tell()
    for i in range(attemps):
        line = f.readline()
        if not line:
            time.sleep(wait_sec)
            f.seek(where)
        else:
            print line, # already has newline

import itertools
fname = 'log.txt'
offset = 5
n = 10
with open(fname) as f:
    n_last_lines = list(reversed([x for x in itertools.islice(f, None)][-(offset+1):-(offset+n+1):-1]))

abc = "2018-06-16 04:45:18.68"
filename = "abc.txt"
with open(filename) as myFile:
    for num, line in enumerate(myFile, 1):
        if abc in line:
            lastline = num
print "last occurance of work at file is in "+str(lastline)

Esiste un modulo molto utile che può fare questo:

from file_read_backwards import FileReadBackwards

with FileReadBackwards("/tmp/file", encoding="utf-8") as frb:

# getting lines by lines starting from the last line up
for l in frb:
    print(l)

Ripensandoci, probabilmente è veloce come qualsiasi altra cosa qui.

def tail( f, window=20 ):
    lines= ['']*window
    count= 0
    for l in f:
        lines[count%window]= l
        count += 1
    print lines[count%window:], lines[:count%window]

È molto più semplice. E sembra strappare ad un buon ritmo.

Ho trovato probabilmente il modo più semplice per trovare la prima o l'ultima N riga di un file

Ultime N righe di un file (ad esempio: N = 10)

file=open("xyz.txt",'r")
liner=file.readlines()
for ran in range((len(liner)-N),len(liner)):
    print liner[ran]

Prime N righe di un file (ad esempio: N = 10)

file=open("xyz.txt",'r")
liner=file.readlines()
for ran in range(0,N+1):
    print liner[ran]

è così semplice:

def tail(fname,nl):
with open(fname) as f:
    data=f.readlines() #readlines return a list
    print(''.join(data[-nl:]))

Anche se questo non è davvero un aspetto efficiente con file di grandi dimensioni, questo codice è piuttosto semplice:

Legge l'oggetto file, f .
Divide la stringa restituita usando newline, \ n .
Ottiene gli elenchi degli ultimi indici dell'array, usando il segno negativo per indicare gli ultimi indici e il : per ottenere un subarray.
```
def tail(f,n):
    return "\n".join(f.read().split("\n")[-n:])
```

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow