How to derive KDD99 Features from DARPA pcap file? [closed]

Question

Be careful with this data set.

http://www.kdnuggets.com/news/2007/n18/4i.html

Some excerpts:

the artificial data was generated using a closed network, some proprietary network traffic generators, and hand-injected attacks

Among the issues raised, the most important seemed to be that no validation was ever performed to show that the DARPA dataset actually looked like real network traffic.

In 2003, Mahoney and Chan built a trivial intrusion detection system and ran it against the DARPA tcpdump data. They found numerous irregularities, including that -- due to the way the data was generated -- all the malicious packets had a TTL of 126 or 253 whereas almost all the benign packets had a TTL of 127 or 254.

the DARPA dataset (and by extension, the KDD Cup '99 dataset) was fundamentally broken, and one could not draw any conclusions from any experiments run using them

we strongly recommend that (1) all researchers stop using the KDD Cup '99 dataset

As for the feature extraction used. IIRC the majority of features simply were attributes of the parsed IP/TCP/UDP headers. Such as, port number, last octet of IP, and some packet flags.

As such, these findings no longer reflect realistic attacks anymore anyway. Todays TCP/IP stacks are much more robust than at the time the data set was created, where a "ping of death" would instantly lock up a windows host. Every developer of a TCP/IP stack should by now be aware of the risk of such malformed packets and stress-test the stack against such things.

With this, these features have become pretty much meaningless. Incorrectly set SYN flags etc. are no longer used in network attacks; these are much more sophisticated; and most likely no longer attacking the TCP/IP stack, but the services running on the next layer. So I would not bother finding out which low level packet flags were used in that '99 flawed simulation using attacks that worked in the early '90s...