Frage

Ich bin neu in Python aus der R-Welt, und ich arbeite an großen Textdateien, die in Datenspalten strukturiert ist (dies ist Lidar-Daten, so in allgemein 60 Millionen + Datensätze).

ist es möglich, den Feldabscheider (z. B. von Tab-TAB-DELIMITED in Comma-begrenzt) einer solchen großen Datei zu ändern, ohne die Datei lesen zu müssen, ohne die Datei lesen zu müssen und eine generationspoDicetagcode-Schleife in den Zeilen durchführen?

War es hilfreich?

Lösung

No.

  • Read the file in
  • Change separators for each line
  • Write each line back

This is easily doable with just a few lines of Python (not tested but the general approach works):

# Python - it's so readable, the code basically just writes itself ;-)
#
with open('infile') as infile:
  with open('outfile', 'w') as outfile:
    for line in infile:
      fields = line.split('\t')
      outfile.write(','.join(fields))

I'm not familiar with R, but if it has a library function for this it's probably doing exactly the same thing.

Note that this code only reads one line at a time from the file, so the file can be larger than the physical RAM - it's never wholly loaded in.

Andere Tipps

You can use the linux tr command to replace any character with any other character.

Actually lets say yes, you can do it without loops eg:

with open('in') as infile:
  with open('out', 'w') as outfile:
      map(lambda line: outfile.write(','.join(line.split('\n'))), infile)

You cant, but i strongly advise you to check generators.

Point is that you can make faster and well structured program without need to write and store data in memory in order to process it.

For instance

file = open("bigfile","w")
j = (i.split("\t") for i in file)
s = (","join(i) for i in j)
#and now magic happens
for i in s:
     some_other_file.write(i)

This code spends memory for holding only single line.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top