Question

I am new to natural language processing and I have not heard of a problem similar to mine yet. I was wondering if anyone could refer me to a method for solving my problem, or tell me how this problem is referred to in the academic literature, so that I can look for resources online.

Here is the problem : From some text (wikipedia articles, for example), I would like to extract the hierarchy of different concepts that can be found in it. By hierarchy I mean a tree wherein A is a descendant of B if A or one of A's parents (transitive) is defined by B. For instance, normal distribution would be a descendant of probability (since normal distribution is defined using probabilities) and probability would be a descendant (or child) of mathematics. Since it is transitive, normal distribution would also be a child of mathematics.

One way I thought about solving this is by looking at the number of times a word A is used alone (called A), the words A and B are used together (called A AND B, 'together' could be, for instance, in the same article or in the same paragraph, or in the same sentence), and the number of times the word B is used alone (called B). Let A be mathematics and B be probability. Then, if the ratios (A AND B)/A and (A AND B)/B are low, then it could imply that there is no direct link between A and B (but a link could exist through transitivity). Conversely, if A is bigger than B, A is a bigger concept than B. If A and B are almost the same then they are probably siblings (children of the same parent).

Let's take 3 examples:

  • Mathematics (A) and carrot (B). A AND B is really low compared to A and B, so there is no direct link between them (or only an indirect link by transitivity).
  • Mathematics (A) and probabilities (B). A AND B is quite high compared to B, and A is much bigger than B, so B should be a child of A (probabilities is a child of mathematics).
  • Topology (A) and Probabilities (B). A AND B is relativaly high (the texts that present the different areas of mathematics will likely speak about the 2), A and B are about the same order of magnitude, so A and B should be the children of a same parent. Indeed, Topology and Probabilities are the children of Mathematics.

This way of solving the problem is far from perfect, for instance 'the' (A) and 'probability' (B) would probably end up saying probability is a child of the (because A AND B is huge and A is much bigger than B).

If anyone knows some papers on this or has any ideas on how I might solve this problem, I would appreciate some direction. Also, does my solution seem viable? How could it be improved?

Was it helpful?

Solution

Look up taxonomy/ontology construction/induction. Relevant papers:

  • Automatic Taxonomy Construction from Keywords via Scalable Bayesian Rose Trees
  • Topic Models for Taxonomies
  • OntoLearn Reloaded. A Graph-Based Algorithm for Taxonomy Induction
  • Ontology Population and Enrichment: State of the Art
  • Probabilistic Topic Models for Learning Terminological Ontologies
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top