Python - parse IPv4 addresses from string (even when censored)

Question 1

The code below will...

find IPs in strings even when censored (ex: 192.168.1[dot]20 or 10.10.10 .21)
place them into a list
clean them of the censorship (spaces/braces/parenthesis)
and replace the uncleaned list entry with the cleaned one.

Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing digit (6 and 3 from the aforementioned). If its first octet is invalid (ex: 256.10.10.10), it will drop the leading digit (resulting in 56.10.10.10).


import re

def extractIPs(fileContent):
    pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
    ips = [each[0] for each in re.findall(pattern, fileContent)]   
    for item in ips:
        location = ips.index(item)
        ip = re.sub("[ ()\[\]]", "", item)
        ip = re.sub("dot", ".", ip)
        ips.remove(item)
        ips.insert(location, ip) 
    return ips


myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()

IPs = extractIPs(fileContent)
print "Original file content:\n{0}".format(fileContent)
print "--------------------------------"
print "Parsed results:\n{0}".format(IPs)

Question 2

Here is a regex that works:

import re
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
text = "The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. "
ips = [match[0] for match in re.findall(pattern, text)]
print ips

# output: ['192.168.1.1', '8.8.8.8', '101.099.098.000', '192.168.1[.]1', '192.168.1(.)1', '192.168.1[dot]1', '192.168.1(dot)1', '192 .168 .1 .1', '192. 168. 1. 1']

The regex has a few main parts, which I will explain here:

([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])
This matches the numerical parts of the ip address. | means "or". The first case handles numbers from 0 to 199 with or without leading zeroes. The second two cases handle numbers over 199.
[ (\[]?(\.|dot)[ )\]]?
This matches the "dot" parts. There are three sub-components:
- [ (\[]? The "prefix" for the dot. Either a space, an open paren, or open square brace. The trailing ? means that this part is optional.
- (\.|dot) Either "dot" or a period.
- [ )\]]? The "suffix". Same logic as the prefix.
{3} means repeat the previous component 3 times.
The final element is another number, which is the same as the first, except it is not followed by a dot.

Question 3

Description

This regex will match each of four octets of a what looks like an IP address. Each of the octets will be placed into it's own capture group for collection.

(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])

enter image description here

Given the following sample text this regex will match all 10 embedded IP strings in their entirety including the first one. Working example: http://www.rubular.com/r/1MbGZOhuj5

The following are IP addresses: 192.168.1.222, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).

The resulting matches could be iterated over and a properly formatted IP string could be constructed by joining the 4 capture groups with a dot.

Question 4

Extract and Categorize IPv4 Addresses (Even When Censored)

Note: This is just an implementation of a class I wrote for extracting IPv4 Addresses. I will likely update my class with a method for this functionality in the future. You can find it on my GitHub page.

What I'm demonstrating below is the following:

Cleaning up your string content example
Bringing your string data into a list
Using the ExtractIPs() class to parse and categorize IPv4 Addresses
- This class returns a dictionary containing 4 lists:
  - Valid IPv4 Addresses
  - Public IPv4 Addresses
  - Private IPv4 Addresses
  - Invalid IPv4 Addresses

ExtractIPs class

#!/usr/bin/env python

"""Extract and Classify IP Addresses."""

import re  # Use Regular Expressions.


__program__ = "IPAddresses.py"
__author__ = "Johnny C. Wachter"
__copyright__ = "Copyright (C) 2014 Johnny C. Wachter"
__license__ = "MIT"
__version__ = "0.0.1"
__maintainer__ = "Johnny C. Wachter"
__contact__ = "wachter.johnny@gmail.com"
__status__ = "Development"


class ExtractIPs(object):

    """Extract and Classify IP Addresses From Input Data."""

    def __init__(self, input_data):
        """Instantiate the Class."""

        self.input_data = input_data

        self.ipv4_results = {
            'valid_ips': [],  # Store all valid IP Addresses.
            'invalid_ips': [],  # Store all invalid IP Addresses.
            'private_ips': [],  # Store all Private IP Addresses.
            'public_ips': []  # Store all Public IP Addresses.
        }

    def extract_ipv4_like(self):
        """Extract IP-like strings from input data.
        :rtype : list
        """

        ipv4_like_list = []

        ip_like_pattern = re.compile(r'([0-9]{1,3}\.){3}([0-9]{1,3})')

        for entry in self.input_data:

            if re.match(ip_like_pattern, entry):

                if len(entry.split('.')) == 4:

                    ipv4_like_list.append(entry)

        return ipv4_like_list

    def validate_ipv4_like(self):
        """Validate that IP-like entries fall within the appropriate range."""

        if self.extract_ipv4_like():

            # We're gonna want to ignore the below two addresses.
            ignore_list = ['0.0.0.0', '255.255.255.255']

            # Separate the Valid from Invalid IP Addresses.
            for ipv4_like in self.extract_ipv4_like():

                # Split the 'IP' into parts so each part can be validated.
                parts = ipv4_like.split('.')

                # All part values should be between 0 and 255.
                if all(0 <= int(part) < 256 for part in parts):

                    if not ipv4_like in ignore_list:

                        self.ipv4_results['valid_ips'].append(ipv4_like)

                else:

                    self.ipv4_results['invalid_ips'].append(ipv4_like)

        else:
            pass

    def classify_ipv4_addresses(self):
        """Classify Valid IP Addresses."""

        if self.ipv4_results['valid_ips']:

            # Now we will classify the Valid IP Addresses.
            for valid_ip in self.ipv4_results['valid_ips']:

                private_ip_pattern = re.findall(

                    r"""^10\.(\d{1,3}\.){2}\d{1,3}

                    (^127\.0\.0\.1)|  # Loopback

                    (^10\.(\d{1,3}\.){2}\d{1,3})|  # 10/8 Range

                    # Matching the 172.16/12 Range takes several matches
                    (^172\.1[6-9]\.\d{1,3}\.\d{1,3})|
                    (^172\.2[0-9]\.\d{1,3}\.\d{1,3})|
                    (^172\.3[0-1]\.\d{1,3}\.\d{1,3})|

                    (^192\.168\.\d{1,3}\.\d{1,3})|  # 192.168/16 Range

                    # Match APIPA Range.
                    (^169\.254\.\d{1,3}\.\d{1,3})

                    # VERBOSE for a clean look of this RegEx.
                    """, valid_ip, re.VERBOSE
                )

                if private_ip_pattern:

                    self.ipv4_results['private_ips'].append(valid_ip)

                else:
                    self.ipv4_results['public_ips'].append(valid_ip)

        else:
            pass

    def get_ipv4_results(self):
        """Extract and classify all valid and invalid IP-like strings.
        :returns : dict
        """

        self.extract_ipv4_like()
        self.validate_ipv4_like()
        self.classify_ipv4_addresses()

        return self.ipv4_results

Example Extraction With Censorship

censored = re.compile(
    r"""

    \(\.\)|
    \(dot\)|
    \[\.\]|
    \[dot\]|
    ( \.)

    """, re.VERBOSE | re.IGNORECASE
)

data_list = input_string.split()  # Bring your input string to a list.

clean_list = []  # List to store the cleaned up input.

for entry in data_list:

    # Remove undesired leading and trailing characters.
    clean_entry = entry.strip(' .,<>?/[]\\{}"\'|`~!@#$%^&*()_+-=')

    clean_list.append(clean_entry)  # Add the entry to the clean list.

clean_unique_list = list(set(clean_list))  # Remove duplicates in list.

# Now we can go ahead and extract IPv4 Addresses. Note that this will be a dict.
results = ExtractIPs(clean_list).get_ipv4_results()

for k, v in results.iteritems():

    # After all that work, make sure the results are nicely presented!
    print("\n%s: %s" % (k, v))

Results:

public_ips: ['8.8.8.8', '101.099.098.000']

valid_ips: ['192.168.1.1', '8.8.8.8', '101.099.098.000']

invalid_ips: []

private_ips: ['192.168.1.1']