Frage

We have a very old file delivery application(IPGear, if you have heard about it, written in tcl). We upload our IP files there and our customers download it from the system.

When you upload a file to this application, it adds .RCA extension to uploaded file and add some metadata to file. if we view the content of any file in a text editor(Usually tgz, pdf and text files), we see some metadata added to the top of the file by the application(5-10 lines, readable).

If you download a file from the system, they somehow strip this metadata from the file and returns as TGZ file which works fine(we can extract it)

if we find that RCA file on the storage where this application keeps files and edit the metadata they have added via text editor, we are able to extract the file without any problem., which fine too. But we need to do this process for 22K files, therefore we need to script it.

We are able to find the bits the application adds by opening via StreamReader, and strip the metadata and write file to the disk via StreamWriter. However, the file we write to the system is corrupted if it is TGZ file. if we do same thing for text files, they work.

the content of the tgz file looks below when we open in text editor

TGZ Content

The bits on lines 29-38 are the metadata we strip.

it looks like the streamreader is not able to write this content back to disk even if we tried different encoding settings.

One another note about this is that the file we are trying to read and write is copied from a Solaris based server into local machine(Windows 7) via WinSCP.

So, my question is, what is the best way of reading TGZ file into memory(as text) so manipulation, and save back without corruption? is streamreader and streamwriter not good for this purpose?

I tried to give as much information as I can, please add comments if you need more clarification.

War es hilfreich?

Lösung

it looks like the streamreader is not able to write this content back to disk even if we tried different encoding settings.

Yes, because a tgz file isn't plain text. StreamReader and StreamWriter are for text content, not arbitrary binary content.

So, my question is, what is the best way of reading TGZ file into memory(as text)

You don't. You read it as binary data, because it is binary data.

If the TGZ archive contains text files, you'll need to decompress the TGZ to the TAR format, then extract the relevant data from that. Then you can work with it as text. Before that point, it's just binary data.

But it sounds like you actually may just want to read text information before the TGZ file... in which case you need to work out where that text information ends, and not read any of the TGZ file as text (because it's not). This is non-trivial, but if you know that the text is in ASCII it'll be a bit easier - you will need to work out how to detect the end of the text and the start of the real content though, and we can't really tell that from the screenshot you've given.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top