Question

I want to scrape string data from some binary text files that contain embedded SQL statements. I don't need any fancy cleanup--just some way to extract the readable text. I'm using vb.net, but a call to an external utility would work too.

Was it helpful?

Solution 4

Thanks all. Great ideas. Really helped me think. Upvotes all around. Ended up I didn't need to be very sure that they were strings so I went with a quick, sloppy, ugly, hack.

 'strip out non-string characters 
 For Each b As Byte In byteArray
      If b = 9 Or b = 10 Or b = 13 Or (b > 31 And b < 127) Then
          newByteArray(i) = b.ToString
          i += 1
      End If
  Next

  'move it into a string
  resultString = System.Text.Encoding.ASCII.GetString(newByteArray)

OTHER TIPS

The GNU strings utility has been around forever and does more-or-less exactly this by using a heuristic to yank any data that "looks like a string" from a binary.

Grab the GNU binutils (including strings) for Win32 from MinGW: http://sourceforge.net/projects/mingw/files/.

This is not so trivial as it may seem at first. A string can be encoded in many ways. What you consider "readable text", how do the unreadable parts look? Say it looks like this:

 &8)JÓxZZ`\■£ÌS?E?L?E?C?T?*?F?R?O?M?m?y?T?b?l?§ıÍ4¢

you are lucky, because it is likely encoded using UTF-16 or another multibyte encoding. These are rather trivial to recognize. But in just about all other cases (UTF-8, ISO-8859-1, Windows-1252) it is next to impossible to distinguish an individual character for being text or non-text, unless you know a fair deal of how a certain "readable text" starts and how it ends.

The point is: anything is allowed and considered readable text. UTF-8, ASCII and Windows-1252 allow even NULL characters (while some programming languages don't). Here's a thread that gives a VB example of how you can proceed, it might give you some hints.

PS: analyzing this type of data can be hard, it will help a great deal if you could upload your file somewhere so we can have a look.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top