What data model can be used for the “meaning” of a page or text

https://stackoverflow.com/questions/7292003

19-01-2021
|

Question

I have read many times around the web about this question:

How do you extract the meaning of a page.

And I know that I am not experience enough to even try to suggest any solution. To me this is the holy grail of web programming or maybe even computer technology as a whole.

But through the power of imagination let us assume that I have written the ultimate script that does exactly that. For example I enter this text:

Imagination has brought mankind through the dark ages to its present state of civilization. Imagination led Columbus to discover America. Imagination led Franklin to discover electricity.

and my powerful script extracts the meaning and says this:

The ability of human beings to think leads them to discover new things.

For the purpose of this example, I used a "String" to explain the meaning the text. But if I had to store this in a database, or an array or any sort of storage, what will be the datatype I will be using?

Note that I can have another text that uses a different analogy but still has the same meaning worded differently, for example:

Imagination helps human kind advance.

Now I can enter a search query about the importance of imagination and these 2 results appear. But how will they be matched? Will it be a String comparison? Some integers, floating points? Maybe even binary?

What will the meaning be saved under? I would like to hear from you.

Update: Let me restate the question simply.

How do you represent Meaning in data?

No correct solution

OTHER TIPS

Assuming that our brains do not have access to a metaphysical cloud server, meaning is represented as configuration of neuronal connections, hormonal levels, electrical activity -- maybe even quantum fluctuations -- and the interaction between all these and the outer world and other brains. So this is good news: at least we know that there is -- at least -- one answer to your question (meaning is represented somewhere, somehow). Bad news is that most of us do not have any idea how this works and those who think they do understand haven't been able to convince the others or each other. Being one of the clueless people, I can't give the answer to your question, but provide a list of the answers that I have come across to smaller and degenerated versions of the grand problem.

If you want to represent the meaning of lexical entities (e.g., concepts, actions) you can use distributed models such as vector space models. In these models, usually, meaning has a geometric component. Each concept is represented as a vector and you place the concepts in a space in such a way that similar concepts are closer to each other. A very common way to construct such a space is to pick a set of commonly used words (basis words) as the dimensions of the space and simply count the number of times a target concept is observed together in speech/text with these basis words. Similar concepts will be used in similar contexts; thus, their vectors will be pointing similar directions. On top of that you can carry out a bunch of weighting, normalization, dimensionality reduction and recombination techniques (e.g., tf-idf, http://en.wikipedia.org/wiki/Pointwise_mutual_information, SVD). A slightly related, but probabilistic -- rather than geometric -- approach is latent Dirichlet allocation and other generative/Bayesian models which are already mentioned in another answer.

Vector space model approach is good for discriminative purposes. You can decide whether two given phrases are semantically related or not (for example matching queries to documents or finding similar search query pairs to help the user to expand his query). But it is not very straightforward to incorporate syntax in these models. I can't see very clearly how you could represent the meaning of a sentence by a vector.

Grammar formalisms could help to incorporate syntax and bring a structure to meaning and the relations between the concepts (e.g., head-driven phrase structure grammar). If you build two agents who share a vocabulary and grammar and make them communicate (i.e., transfer information from one to the other) via these mechanisms you could say they represent the meaning. It is rather a philosophical question where and how the meaning is represented when a robot tells another to pick the "red circle above the black box" via a built-in or emerged grammar and vocabulary and the other one successfully picks the intended object (see this very interesting experiment on grounding vocabulary: Talking Heads).

Another way to capture meaning is to use networks. For example, by representing each concept as a node in a graph and the relations between the concepts as edges between the nodes, one can come up with a practical representation of meaning. Concept Net is a project that aims to represent common sense and it is possible to view it as a semantic network of commonsense concepts. In a way, the meaning of a certain concept is represented via its location relative to other concepts in the network.

Speaking of common sense, Cyc is another ambitious example of a project that tries to capture commonsense knowledge, but it does so in a very different way than Concept Net. Cyc uses a well-defined symbolic language to represent the attributes of objects and the relations between objects in a non-ambiguous way. By employing a very large set of rules and concepts, and an inference engine, one can come up with deductions about the world, answer questions like "Can horses be sick?", "Bring me a picture of a sad person."

I worked on a system that attempted to do this at a previous company. We were more focused on "what unstructured documents are most similar to this unstructured document", but the relevant part was how we determined the "meaning" of the document.

We used two different algorithms, PLSA (Probabilistic Latent Semantic Analysis) and PSVM (Probabilistic Support Vector Machines). Both extract topics that are significantly more prevalent in the document being analyzed than in other documents in the collection.

The topics themselves have numerical IDs, and there was an xref table from document to topic. To determine how close two documents were, we would look at the percentage of topics the documents have in common.

Presuming your super script could produce topics from the query entered, you could use a similar structure. It has the added advantage of the xref table only containing integers, so you're only looking at integers not string operations.

Semantics is a wide and deep field, and there are many models, all of them with advantages and problems from an AI implementation point of view. With this scarce amount of background, one can hardly make a recommendation, beyond "study the literature, and pick a theory which resonates with your intuition (and if you are at all successful in this, replace it with a better theory of your own, and score academic points)". Having said that, the freshman course material I can vaguely recollect used to have nice things to say about a recursive structure called a "frame", but this must have been 15 years ago.

A meaning is in general an abstract concept which is a internal black box data structure which depends on the chosen algorithm. But this is not the interesting part. If you do some semantic analysis the general question concerns differences in meanings, eg if two documents talk about the same topic, or how different some docs are, or to group documents with similar meanings.

If you use a vector space model, the meaning / semantics can be represented by a collection of vectors which represent specific topics. One way to extract such patterns is http://en.wikipedia.org/wiki/Latent_semantic_analysis or http://en.wikipedia.org/wiki/Nonnegative_matrix_factorization. But there are more elaborate statistical models which represent semantics by parameters of certain probability distributions. A recent method is http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation.

I will talk about Semantic Web because I think it offers the most advanced studies and language implementations about the subject.

Resource Description Framework is one of the many data models inherent to Semantic Web available to describe informations.

RDF is an abstract model with several serialization formats (i.e., file formats), and so the particular way in which a resource or triple is encoded varies from format to format

and

However, in practice, RDF data is often persisted in relational database or native representations also called Triplestores, or Quad stores if context (i.e. the named graph) is also persisted for each RDF triple.

RDF content can be retrieved using RDF Queries.

Topic Maps another model of knowledge data storing and representation.

Topic Maps is a standard for the representation and interchange of knowledge, with an emphasis on the findability of information.

and

In the year 2000 Topic Maps was defined in an XML syntax XTM. This is now commonly known as "XTM 1.0" and is still in fairly common use.

From the official Topic Maps Data Model:

The only atomic fundamental types defined in this part of ISO/IEC13250 (in 4.3) are strings and null. Through the concept of datatypes, data of any type can be represented in this model. All datatypes used shall have a string representation of their value space and this string representation is what is stored in the topic map. The information about which datatype the value belongs to is stored separately, in the form of a locator identifying the datatype.

There are many other formats proposed, you can take a look at this article for more informations.

I also want to link you a recent answer I wrote about a similar topic with a lot of useful links.

After reading various articles, I think a common direction every method is taking is storing data as a text format. The relative information can be stored in a database directly as text.

Having the data in an understandable text format has several benefits, perhaps more than the disadvantages.

Other Semantic methods such as Notation 3 (N3) or Turtle Syntax use slight different formats, but still plain text.

A N3 example

@prefix dc: <http://purl.org/dc/elements/1.1/>.

<http://en.wikipedia.org/wiki/Tony_Benn>
  dc:title "Tony Benn";
  dc:publisher "Wikipedia".

Finally, I would like to link you an useful article you should read: Standardization of Unstructured Textual Data into Semantic Web Format.

Let's assume that you have found the ultimate algorithm that can provide the meaning of a text. In particular you selected a string representation, but considering your algorithm found the meaning correctly, then it can be uniquely identified by the algorithm. Right?

So, for simplicity let's assume there is only one meaning for that particular text. In this case it is uniquely identified before the algorithm will output a phrase describing it.

So, basically, in order to store a meaning we first need a unique identifier.

The meaning can only exist in rapport with a subject. It is the meaning of a subject. In order for that subject to have a meaning we must know something about it. In order for a subject to have a unique meaning it must be represented unambiguously to the observer (that is the algorithm). For example the statement "2 = 3" will have the meaning of false because of standardization of mathematics symbols. But a text written in a foreign language will have no meaning for us. Neither anything that we can't understand. For example "what is the meaning of life?"

In conclusion, in order to build an algorithm that can extract the absolute meaning from any random text, we, as humans, must first know the absolute meaning of anything. :)

In practice, you can only extract the meaning of a known text, written in a known language, in a known format. And for this, there are tools and research in the fields of neural networks, natural language processing and so on...

try making it into a char* (string c-style) it is easily stored in databases and easy to use make it of length 50 (10 words) or 75 (15 words)

EDIT: put both on the same word (imagination) then check for similar indexes and assign them to the same word

use

SELECT * FROM Dictionary WHERE Index = "Imagination"

sorry I'm not too experienced with SQL

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow