Cómo calcular la similitud entre dos oraciones (sintáctico y semántico)

https://stackoverflow.com/questions/3655612

01-10-2019
|

Pregunta

se supone que tengo que tomar dos frases cada vez y calcular si son similares. Por medio similar que, tanto sintácticamente y semánticamente.

ENTRADA 1: Obama firma la ley.         Una nueva ley está firmada por Obama.

INPUT2:         Un autobús se detiene aquí.         Un vehículo se detiene aquí.

ENTRADA3: Fuego en NY.          NY se quemó.

INPUT4: Fuego en NY.          50 murieron en NY fuego.

No quiero usar la ontología árbol como un alma. Escribí un código para calcular distancia Levenshtein (LD) entre oraciones y luego decidir si la segunda frase :

puede ser ignorado (INPUT1 y 2),
debe reemplazar a la primera frase (INPUT 3), o
Tienda junto con la primera frase (INPUT4).

No estoy feliz con el código como LD sólo se computa el nivel sintáctico (¿qué otros métodos?). ¿Cómo puede ser incorporado semántica (como el autobús es una especie de vehículo?).

El código va aquí:

%# As the difference is computed, a decision is made on the new event
%# (string 2) to be ignored, to replace existing event (string 1) or to be
%# stored separately. The higher the LD metric, the higher the difference
%# between two strings. Of course, lower difference indices either identical
%# or similar events. However, the higher difference indicates the new event
%# as a fresh event.

%#.........................................................................
%# Calculating the LD between two strings of events.
%#.........................................................................
L1=length(str1)+1;
L2=length(str2)+1;
L=zeros(L1,L2);   %# Initializing the new length.

g=+1;             %# just constant
m=+0;             %# match is cheaper, we seek to minimize
d=+1;             %# not-a-match is more costly.

% do BC's
L(:,1)=([0:L1-1]*g)';
L(1,:)=[0:L2-1]*g;

m4=0;             %# loop invariant
%# Calculating required edits.
for idx=2:L1;
    for idy=2:L2
        if(str1(idx-1)==str2(idy-1))
            score=m;
        else
            score=d;
        end
        m1=L(idx-1,idy-1) + score;
        m2=L(idx-1,idy) + g;
        m3=L(idx,idy-1) + g;
        L(idx,idy)=min(m1,min(m2,m3)); % only minimum edits allowed.
    end
end
%# The LD between two strings.
D=L(L1,L2);

%#....................................................................
%# Making decision on what to do with the new event (string 2).
%#...................................................................
if (D<=4)     %# Distance is so less that string 2 seems identical to string 1.
    store=str1;        %# Hence string 2 is ignored. String 1 remains stored.
elseif (D>=5 && D<=15) %# Distance is larger to be identical but not enough to
    %# make string 2 an individual event.
    store= str2;       %# String 2 is somewhat similar to string 1.
                       %# So, string 1 is replaced with string 2 and stored.
else
    %# For all other distances, string 2 is stored along with string 1.
    store={str1; str2};
end

Cualquier ayuda es apreciada.

Solución

"semántico". No se algoritmo simple libro de texto para eso. El lenguaje natural (Esp. Inglés) es una bestia muy complicado e inconstante Echemos un vistazo al (sólo una pequeña parte de) los casos previstos:.

INPUT1: Obama signs the law. A new law is signed by Obama.

La firma de una ley hace que sea una ley 'nueva'.

INPUT2: A Bus is stopped here. A vehicle stops here.

necesita saber un autobús si es un tipo de vehículo, así como algún tipo de relación de tiempo. Además, ¿y si el bus no parada, pero no se detiene normalmente o ya no se detuvo? Se puede tomar varias formas.

INPUT3: Fire in NY. NY is burnt down.

necesitan saber que los incendios pueden quemar cosas.

INPUT4: Fire in NY. 50 died in NY fire.

necesitan saber que los incendios pueden matar a las cosas (ver a continuación). Necesidad de asociado del "Headline News" (50 QUÉ?) Con la gente. El cerebro puede hacer esto un poco trivial. Los programas de ordenador no son cerebros.

Y yo no soy importante Inglés: -)

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow