سؤال

EDIT: I should note that I want a general case for any hex array, not just the google one I provided.

EDIT BACKGROUND: Background is networking: I'm parsing a DNS packet and trying to get its QNAME. I'm taking in the whole packet as a string, and every character represents a byte. Apparently this problem looks like a Pascal string problem, and using the struct module seems like the way to go.

I have a char array in Python 2.7 which includes octal values. For example, let's say I have an array

DNS = "\03www\06google\03com\0"

I want to get:

www.google.com

What's an efficient way to do this? My first thought would be iterating through the DNS char array and adding chars to my new array answer. Every time i see a '\' char, I would ignore the '\' and two chars after it. Is there a way to get the resulting www.google.com without using a new array?

my disgusting implementation (my answer is an array of chars, which is not what i want, i want just the string www.google.com:

DNS = "\\03www\\06google\\03com\\0"
answer = []
i = 0
while i < len(DNS):
    if DNS[i] == '\\' and DNS[i+1] != 0:
        i += 3    
    elif DNS[i] == '\\' and DNS[i+1] == 0:
        break
    else:
        answer.append(DNS[i])
        i += 1
هل كانت مفيدة؟

المحلول

Now that you've explained your real problem, none of the answers you've gotten so far will work. Why? Because they're all ways to remove sequences like \03 from a string. But you don't have sequences like \03, you have single control characters.

You could, of course, do something similar, just replacing any control character with a dot.

But what you're really trying to do is not replace control characters with dots, but parse DNS packets.

DNS is defined by RFC 1035. The QNAME in a DNS packet is:

a domain name represented as a sequence of labels, where each label consists of a length octet followed by that number of octets. The domain name terminates with the zero length octet for the null label of the root. Note that this field may be an odd number of octets; no padding is used.

So, let's parse that. If you understand how "labels consisting of "a length octet followed by that number of octets" relates to "Pascal strings", there's a quicker way. Also, you could write this more cleanly and less verbosely as a generator. But let's do it the dead-simple way:

def parse_qname(packet):
    components = []
    offset = 0
    while True:
        length, = struct.unpack_from('B', packet, offset)
        offset += 1
        if not length:
            break
        component = struct.unpack_from('{}s'.format(length), packet, offset)
        offset += length
        components.append(component)
    return components, offset

نصائح أخرى

import re
DNS = "\\03www\\06google\\03com\\0"
m = re.sub("\\\\([0-9,a-f]){2}", "", DNS)
print(m)

Maybe something like this?

#!/usr/bin/python3

import re

def convert(adorned_hostname):
    result1 = re.sub(r'^\\03', '', adorned_hostname )
    result2 = re.sub(r'\\0[36]', '.', result1)
    result3 = re.sub(r'\\0$', '', result2)
    return result3

def main():
    adorned_hostname = r"\03www\06google\03com\0"
    expected_result = 'www.google.com'
    actual_result = convert(adorned_hostname)
    print(actual_result, expected_result)
    assert actual_result == expected_result

main()

For the question as originally asked, replacing the backslash-hex sequences in strings like "\\03www\\06google\\03com\\0" with dots…

If you want to do this with a regular expression:

  • \\ matches a backslash.
  • [0-9A-Fa-f] matches any hex digit.
  • [0-9A-Fa-f]+ matches one or more hex digits.
  • \\[0-9A-Fa-f]+ matches a backslash followed by one or more hex digits.

You want to find each such sequence, and replace it with a dot, right? If you look through the re docs, you'll find a function called sub which is used for replacing a pattern with a replacement string:

re.sub(r'\\[0-9A-Fa-f]+', '.', DNS)

I suspect these may actually be octal, not hex, in which case you want [0-7] rather than [0-9A-Fa-f], but nothing else would change.


A different way to do this is to recognize that these are valid Python escape sequences. And, if we unescape them back to where they came from (e.g., with DNS.decode('string_escape')), this turns into a sequence of length-prefixed (aka "Pascal") strings, a standard format that you can parse in any number of ways, including the stdlib struct module. This has the advantage of validating the data as you read it, and not being thrown off by any false positives that could show up if one of the string components, say, had a backslash in the middle of it.

Of course that's presuming more about the data. It seems likely that the real meaning of this is "a sequence of length-prefixed strings, concatenated, then backslash-escaped", in which case you should parse it as such. But it could be just a coincidence that it looks like that, in which case it would be a very bad idea to parse it as such.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top