What is the current best state of the art algorithm for graph embedding of directed weighted graphs for binary classification?

https://datascience.stackexchange.com/questions/70163

10-12-2020
|

Pergunta

Sorry if this is a newbee question, I'm not an expert in data science, so this is the problem:

We have a directed and weighted graph, which higher or lower weight values does not imply the importance of the edge (so preferably the embedding algorithm shouldn't consider higher weights as more important), they are just used to imply the timing of the events which connect the nodes, so the higher weighted edges are events that have happened after the lower ones.

There can be multiple edges between nodes, and I want to do a binary classification using deep learning, meaning that my model gets a embedded graph vector as input and decides if its malicious(1) or not(0).

So what is the best state of the art graph embedding algorithm for this task that can capture as much information as possible from the graph? i read some graph embedding papers but couldn't find any good comparison of them since there are so many new ones.

IMPORTANT NOTE: One problem i have seen with some of the graph embedding algorithms, is that they try to have a small vector dimension, since i guess they are used in fields which there are a LOT of nodes so they need to do this, but in this task its not really important, the nodes are the functions in the program, and they very rarely go above 2000 functions, so even if the algorithm creates a 20k dimensions its no problem, I'm saying this because some of the algorithms that I'm reading will produce a vector that even has lower dimensions compared to number of the nodes in the graph! and that causes loss of information in my opinion. so to sum up, the performance and large vector size is not a problem in my task. so preferably the algorithm should gather as much information as possible from the graph.

Solução

I'm no expert myself, but recently (i.e. this is true to 2019), I've heard (at a meetup from an expert) that Node2Vec is the SOTA.

Here's a link to a post on Medium explaining it - basically, Node2Vec generates random walks on the graph (with hyper-parameters relating to walk length, etc), and embeds nodes in walks the same way that Word2Vec embeds words in a sentence.

Note that since random walks are generated on the graph, intuitively you don't have to use edge weights, and can use multiple edges.

P.S. There's an older (2014) algorithm for this, called "DeepWalk", which I don't know much about but is supposed to be similar, simpler, and not as performative. I'm just adding this in case you want to search the term.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange