Question

Not sure if this is even possible but I have been looking at using Regex to get an email address that is in Hex. Basically this is to build up some of my automated forensic tools but I am having problems on making a suitable Regex algorithm.

Regex for email: /^([a-z0-9_.-]+)@([\da-z.-]+).([a-z.]{2,6})$/

Hex values:

@ = 40
. = 2E
.com = 636f6d
_ = 5f
A/a = 41/61 [1]
Z/z = 5a/7a
- = 2d

This is what I have got at the moment (it only takes into account lower case and .com). But it doesn't work! Have I messed something simple up?

"/^([61-7a]+)40([61-7a]+)23(636f6d)$/"

[1] I know email can only be lower case but I need to take uppercase into account too.

Was it helpful?

Solution

define classes

@ = 40
. = 2E
com = 636f6d
_ = 5f
a-z = (4[1-9a-f]|5[0-9a])
A-Z = (6[1-9a-f]|7[0-9a])
- = 2d

substitute into your regex

/^([a-z0-9_.-]+)@([\da-z.-]+).([a-z.]{2,6})$/

/^(((4[1-9a-f]|5[0-9a])|(6[1-9a-f]|7[0-9a])|[0-9]|5f|2E|2d)+)40(([0-9]|(4[1-9a-f]|5[0-9a])|(6[1-9a-f]|7[0-9a])|2E|2d)+)2E(((4[1-9a-f]|5[0-9a])|(6[1-9a-f]|7[0-9a])){2,6})$/

breaks down to...

/^
(
    (
        (4[1-9a-f]|5[0-9a])     // a-z
        |(6[1-9a-f]|7[0-9a])    // A-Z
        |[0-9]                  // 0-9
        |5f                     // _
        |2E                     // .
        |2d                     // -
    )+ // 1 or more times
) 
40
(
    (
        [0-9]                   // 0-9
        |(4[1-9a-f]|5[0-9a])    // a-z
        |(6[1-9a-f]|7[0-9a])    // A-Z
        |2E                     // .
        |2d                     // -
    )+ // 1 or more times
)
2E                              // .
(
    (
        (4[1-9a-f]|5[0-9a])     // a-z
        |(6[1-9a-f]|7[0-9a])    // A-Z
    ){2,6} // between 2 and 6 times
)$/

OTHER TIPS

I think that you are approaching the problem wrong. Assuming that you are using the standard hex-char equivalencies, you should convert the email out of hex first, then use the email regex. This can be done by simply processing the email two chars at a time, and using chr(int('piece, 16)) on each piece.

I think that you need to look at the documentation for regular expressions in Python (http://docs.python.org/2/library/re.html).

For example, [61-7a] will match any of 6, 1-7 or a.

Slightly away from what you're looking to achieve but take a look at Bulk Extractor which parses through a drive and carves out email addresses and lists them in popularity order.

I can't post a comment anywhere since this questions is already answered, I guess, but I think this needs to be said.

The approach you are taking is actually worse than converting each individual character into ASCII equivalent. You are actually converting each byte into 2 ASCII characters.

Just to reference a part of the REGEX pattern that you posted as a final/working pattern: 4[0-9a-fA-F] You are attempting to find chars @ABCDEFGHIJKLMNO. You have a-f and A-F because you are trying to account for the Hex code being stored in upper case or lower case. Hex code on a hard drive does not get stored in upper or lower case (it doesn't even get stored in hex codes). You are accounting for whatever tool is presenting this data to you - in ASCII.

What tool are you using to access this data?

If you are using python to read a dd image file, then you need to use a regex that goes after the raw data. That would be something like [\x40-\x4f] to replicate the above. This is all unnecessary though because [@-O] would accomplish the same thing.

I am not sure exactly how you are testing this, but I suspect that you are pasting hex codes into an online REGEX testing engine. That testing engine is then interpretting those hex codes as 2 individual characters, not as a pair of nibbles from a byte.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top