How to separate content from a file that is a container for binary and other forms of content

https://stackoverflow.com/questions/822161

03-07-2019
|

Question

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.

I would appreciate any help to identify some relevant reading material.

I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:

<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4@`("`@,#`P$!`0+^_OW]_?_#P\*"@H.#@X-#0T&!@8!`0
M$+"PL"`@('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?

I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?

Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?

I can find the lines where the graphics begin:

filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
    graphicsTags.append(match.span())

I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.

Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n

I love this site and I love PYTHON

This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:

import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')

I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.

Solution

What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.

The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:

import codecs

data       = "Let's just pretend that this is binary data, ok?"
uuencode   = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode   = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)

print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()

OTHER TIPS

You definitely need to be reading in binary mode if the content includes JPEG images.

As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .

There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.

You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow