Question

I have IP addresses as feature and I would like to know how much two IP addresses are similar to each other to use the difference in an Euclidean distance measure (in order to quantify the similarities of my data points). What tactic can I use for this?

Was it helpful?

Solution

If I understood them correctly, both Jeremy and Edmund's (first) solutions are the same, namely, plain euclidean distance in a 4-dimensional space of IP addresses.BTW, I think a very fast alternative to euclidean distance would be to calculate a hamming distance bit-wise.

Edmund's first update would be better than his second. The reason is simple to state: his 2nd update tries to define a distance measure by considering a non-linear function of the coordinates of a 4D vector. That however will most likely destroy the key properties that it needs to satisfy in order to be a metric, namely

  1. Injectivity: $d(IP_1,IP_2)=0 \iff IP_1=IP_2$,
  2. Symmetry: $d(IP_1,IP_2)=d(IP_2,IP_1)$, and
  3. Triangular inequality: $d(IP_1,IP_2)\leq d(IP_1,IP_3)+d(IP_3,IP_2)\,\forall IP_3$.

The latter is key for later interpreting small distances as close points in IP space. One would need a linear (in the coordinates) distance function. However, simple euclidean distance is not enough as you saw.

Physics (well, differential geometry actually) could lead to a nice solution to this problem: define a metric tensor $g$. In plain english, give weights to each pair of coordinates, take each pair difference, square it and multiply it by its weight, and then add those products. Take the square root of that sum and define it as your distance.

For the sake of simplicity, one could start trying with a diagonal metric tensor.

Example: Say you take $g=\begin{pmatrix}1000 &0 &0 &0 \\0 &100&0&0\\0&0&10&0\\0&0&0&1\end{pmatrix}$ $IP_1=(x_1,x_2,x_3,x_4)$ and $IP_2=(y_1,y_2,y_3,y_4)$. Then the square of the distance is given by $$d(IP_1,IP_2)^2=1000*(x_1-y_1)^2+100*(x_2-y_2)^2+\\ \,+10*(x_3-y_3)^2+1*(x_4-y_4)^2$$ For $IP_1=192.168.1.1,\,IP_2=192.168.1.2$ the distance is clearly 1. However, for $192.168.1.1$ and $191.168.1.1$ the distance is $\sqrt{1000}\approx 32$

Eventually you could play around with different weights and set a kind of normalization where you could fix the value of the maximal distance $d(0.0.0.0,FF.FF.FF.FF)$.

Furthermore, this set up allows for more complex descriptions of your data where the relevant distance would contain "cross-products" of coordinates like say $g_{13}*(x_1-y_1)*(x_3-y_3)$.

EDIT: While this would be a better "weighting" method than using those other I addressed, I realize now it is actualy meaningless: As Anony-Mousse and Phillip mention, IP are indeed 32 dimensional. This means in particular that giving the same weight to all bits in say the 2nd group is in general not sound: One bit could be part of the netmask while the other not. See Anony-Mousse answer for additional objections.

OTHER TIPS

That's a very interesting question. Similarity here should be computed component-wise, but the thing is from a "business logic" perspective, the similarity of the last 3 numbers doesn't matter if the other 3 sets of numbers are not the same. Keeping that in mind, I would probably do something like the following (there is probably a more elegant way of doing it, and I don't have much time to think about it so forgive me if it doesn't answer your question and for the poor formatting).

Assuming IPv4 of the form aaa.bbb.ccc.ddd, I would so something like:

If aaa_1 == aaa_2:
  If bbb_1 == bbb_2:
    If ccc_1 == ccc_2:
        If ddd_1 == ddd_2:
            Dist = 1;
        Else:
            Dist = (3 + distance(ddd_1,ddd_2))/4;
        End if;
    Else:
        Dist = (2 + distance(ccc_1,ccc_2))/4;
    End if;
  Else:
    Dist = (1 + distance(bbb_1,bbb_2))/4;
  End if;
 Else:
  Dist = distance(aaa_1,aaa_2);
  Return 1/Dist;

IP (v4) addresses are a 32-bit integer, which trivially gives you a metric. However, it may not be a particularly useful metric - 10.255.255.255 and 11.0.0.0 are almost certainly significantly more different than 192.168.1.1 and 192.168.1.2.

20 years ago, I would have suggested to use the length of the shared prefix as similarity measure.

So you take two IPs. In their 32 bit representation, not the "pretty printed" x.y.z.w form; the real "int" reoresentation your network stack uses. Then XOR them, count the leading zeros, and you get

distance = 32 - leadingZeros(ip1 XOR ip2)

However, we have exhausted the IPv4 namespace long ago. The last 10 years, the few remaining netblock have been more or less "randomly" (at least from a similarity perspective) been distributed. IP ranges have been relocated and so on.

A lomg time ago, people would have told you routing happens on trees, based on their prefix. So If you wanted to read an IP 10.2.3.4 it would go to 10.0.0.0 then 10.2.0.0 then 10.2.3.0. But that was just the theory. If you manually configured your router, that is what you would do. 10.2 isthe second building, 10.2.3. is the third floor router. History. Within networks, IPs are assigned by DHCP, often first-come-first-served. Within intranet, you have mostly switches not routers. And on the global level, the BGP is responsible for taking care of the big mess of todays routing tables.

In other words: use some database like GeoIP to map the IPs to (approximate) coordinates. Best you can do. IP based similarity is mostly useful on a /24 prefix, but a binary yes/no similarity won't make you happy I guess.

Like someone else mentioned, treating IPs as int automatically gives higher bits higher weights. I've used variance of IPs which is log scaled.

math.log(np.std([IP(ip).int() for ip in ips]))

2nd Update

The below can be improved as it does not consider the hierarchical structure of an IP address. To account for this the elements of the IP vectors can be non-linearly scaled before computing the distance vector and its norm. This gives more weight to the elements higher in the hierarchy.

Mathematica code

Once we have the 4D vectors from the 1st update each element is scaled based on its position $[x^{2}_{1},x^{\frac{3}{2}}_{2},x^{1}_{3},x^{\frac{1}{2}}_{4}]$.

Subtract @@ (MapIndexed[
       Function[{value, index}, 
        value^((5 - First@index)/2)], #] & /@ {ip1, ip2}) // Norm // N
(* 2209.17 *)

There is information lost in collapsing from 4D down to 1D but this can't be helped if you are looking for a 1D distance metric.


1st Update

An IP address is made up of 4 numbers. Takes these as vectors in 4D and calculate the distance between them ( Distance in Euclidean space ).

Mathematica code

(* Make some IP addresses *)
{ip1, ip2} = 
 StringRiffle[#, "."] & /@ 
  Map[ToString, RandomInteger[{1, 255}, {2, 4}], {2}]
(* {"50.229.29.146", "27.167.216.58"} *)

(* Extract 4D vector *)
{ip1, ip2} = Map[FromDigits, StringSplit[#, "."] & /@ {ip1, ip2}, {2}]
(* {{50, 229, 29, 146}, {27, 167, 216, 58}} *)

(* Calculate distance *)
Norm[ip1 - ip2] // N
(* 216.993 *)

Consider an IP address as a 4D vector. Subtract and calculate the norm.

First you need to distinguish private and public addresses (check wikipedia IP_address#Private_addresses).

Private IPs: the best you can do is compute if 2 addresses COULD be on the same network or not, then you need clues from other features to KNOW if it is the case or no.

Public IPs: For geographic distance, you may want to check web services/API that try to map IP and geographical locations (one google search turned this one for instance enter link description here).

Another point which could be interesting is the "organisational distance", from the IP address you can try to identify the owner of the address (the ISP), check ARIN for instance http://whois.arin.net/rest/net/NET-8-8-8-0-1/pft?s=8.8.8.8.

From there you can try to figure out if two addresses belong to the same organisation or not. You will have to find some way to tell if the organisation is an ISP with private customers or a company with their own network. Please be also aware that some organisations have started to resell blocks from their IPV4 addresses to other companies for money with the effect that addresses that used to be in the same organisation/location can now be thousands of miles away in different companies.

I think it would be wise to consider these informations as probabilities only.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top