Question

I need to filter a pandas Dataframe by the range of ip addresses. Is it possible with out regular expressions?

Ex. From 61.245.160.0   To 61.245.175.255
Was it helpful?

Solution

Strings are orderable in python, so you should be able to get away with just that:

In [11]: '61.245.160.0' < '61.245.175.255'
Out[11]: True

Either boolean mask:

In [12]: df[('61.245.160.0' < df.ip) & (df.ip < '61.245.175.255')]

or take a slice (if ip were the index):

In [13]: df.loc['61.245.160.0':'61.245.175.255']

OTHER TIPS

I have an approach using ipaddress.

For example, I want to know if host0 = 10.2.23.5 belongs to any of the following networks NETS = ['10.2.48.0/25','10.2.23.0/25','10.2.154.0/24'].

>>> host0 = ip.IPv4Address('10.2.23.5')
>>> NETS = ['10.2.48.0/25','10.2.23.0/25','10.2.154.0/24']
>>> nets  = [ip.IPv4Network(x) for x in NETS]
>>> [x for x in nets if (host2 >= x.network_address and host2 <= x.broadcast_address)]
[IPv4Network('10.2.23.0/25')]

Now, in order to get together this approach with Pandas, one shall do the following: create a function and apply it to each row of the DF.

def fnc(row):
    host = ip.IPv4Address(row)
    vec = [x for x in netsPy if (host >= x.network_address and host <= x.broadcast_address)]

    if len(vec) == 0:
        return '1'
    else:
        return '-1'

You later on apply it to the DF.

df['newCol'] = df['IP'].apply(fnc)

This will create a new column newCol where each row will be either 1 or -1 , depending on whether the IP address belongs to either network of your interest.

Assuming you have the following DF:

In [48]: df
Out[48]:
               ip
0    61.245.160.1
1  61.245.160.100
2  61.245.160.200
3  61.245.160.254

let's find all IPs falling between (but not including) 61.245.160.99 and 61.245.160.254:

In [49]: ip_from = '61.245.160.99'

In [50]: ip_to = '61.245.160.254'

if we will compare IPs as strings - it will be compared lexicographically so it won't work properly as @adele has pointed out:

In [51]: df.query("'61.245.160.99' < ip < '61.245.160.254'")
Out[51]:
Empty DataFrame
Columns: [ip]
Index: []

In [52]: df.query('@ip_from < ip < @ip_to')
Out[52]:
Empty DataFrame
Columns: [ip]
Index: []

We can use numerical IP representation:

In [53]: df[df.ip.apply(lambda x: int(IPAddress(x)))
   ....:      .to_frame('ip')
   ....:      .eval('{} < ip < {}'.format(int(IPAddress(ip_from)),
   ....:                                  int(IPAddress(ip_to)))
   ....:       )
   ....: ]
Out[53]:
               ip
1  61.245.160.100
2  61.245.160.200

Explanation:

In [66]: df.ip.apply(lambda x: int(IPAddress(x)))
Out[66]:
0    1039507457
1    1039507556
2    1039507656
3    1039507710
Name: ip, dtype: int64

In [67]: df.ip.apply(lambda x: int(IPAddress(x))).to_frame('ip')
Out[67]:
           ip
0  1039507457
1  1039507556
2  1039507656
3  1039507710

In [68]: (df.ip.apply(lambda x: int(IPAddress(x)))
   ....:    .to_frame('ip')
   ....:    .eval('{} < ip < {}'.format(int(IPAddress(ip_from)),
   ....:                               int(IPAddress(ip_to))))
   ....: )
Out[68]:
0    False
1     True
2     True
3    False
dtype: bool

PS here is a bit faster (vectorized) function which will return numerical IP representation:

def ip_to_int(ip_ser):
    ips = ip_ser.str.split('.', expand=True).astype(np.int16).values
    mults = np.tile(np.array([24, 16, 8, 0]), len(ip_ser)).reshape(ips.shape)
    return np.sum(np.left_shift(ips, mults), axis=1)

Demo:

In [78]: df['int_ip'] = ip_to_int(df.ip)

In [79]: df
Out[79]:
               ip      int_ip
0    61.245.160.1  1039507457
1  61.245.160.100  1039507556
2  61.245.160.200  1039507656
3  61.245.160.254  1039507710

check:

In [80]: (df.ip.apply(lambda x: int(IPAddress(x))) == ip_to_int(df.ip)).all()
Out[80]: True
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top