I am working on a project where I am have image files that have been malformed (fuzzed i.e their image data have been altered). These files when rendered on various platforms lead to warning/crash/pass report from the platform.

I am trying to build a shield using unsupervised machine learning that will help me identify/classify these images as malicious or not. I have the binary data of these files, but I have no clue of what featureSet/patterns I can identify from this, because visually these images could be anything. (I need to be able to find feature set from the binary data)

I need some advise on the tools/methods I could use for automatic feature extraction from this binary data; feature sets which I can use with unsupervised learning algorithms such as Kohenen's SOM etc.

I am new to this, any help would be great!

有帮助吗?

解决方案

I do not think this is feasible.

The problem is that these are old exploits, and training on them will not tell you much about future exploits. Because this is an extremely unbalanced problem: no exploit uses the same thing as another. So even if you generate multiple files of the same type, you will in the end have likely a relevant single training case for example for each exploit.

Nevertheless, what you need to do is to extract features from the file meta data. This is where the exploits are, not in the actual image. As such, parsing the files is already much the area where the problem is, and your detection tool may become vulnerable to exactly such an exploit.

As the data may be compressed, a naive binary feature thing will not work, either.

其他提示

You probably don't want to look at the actual pixel data at all since the corruption most (almost certain) lay in the file header with it's different "chunks" (example for png, works differently but in the same way for other formats):

http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header

It should be straight forward to choose features, make a program that reads all the header information from the file and if the information is missing and use this information as features. Still will be much smaller then the unnecessary raw image data.

Oh, and always start out with simpler algorithms like pca together with kmeans or something, and if they fail you should bring out the big guns.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top