Frage

In levenstein how are you, hw r u, how are u, and hw ar you can be compare as same,

Is there anyway i can achieved this

if i have a phrase like.

phrase

hi, my name is john doe. I live in new york. What is your name?

phrase

My name is Bruce. wht's your name

key phrase

What is your name

response

my name is batman.

im getting the input from user.I have a table with a list of possible request with response. for example the user will ask about 'its name', is there a way i can check if a sentence has a key phrase like What is your name and if its found it will return the possible response

like

phrase = ' hi, my name is john doe. I live in new york. What is your name?'
 
//I know this one will work
if (strpos($phrase,"What is your name") !== false) {
    return $response;
}

//but what if the user mistype it 
if (strpos($phrase,"Wht's your name") !== false) {
    return $response;
}

is there i way to achieve this. levenstein works perfect only if the lenght of strings are not that long with the compared string.

like

hi,wht's your name

my name is batman.

but if it so long

hi, my name is john doe. I live in new york. What is your name?

its not working well. if there are shorter phrase, it will identify the shorter phrase that have a shorter distance and return a wrong response

i was thinking another way around is to check some key phrase. so any idea to achieve this one?

i was working on something like this but maybe there is a better and proper way i think

$samplePhrase = 'hi, im spongebob, i work at krabby patty. i love patties. Whts your name my friend';

$keyPhrase = 'What is your name';
  1. get first character of keyPhrase. That would be 'W' iterate through
  2. $samplePhrase characters and compare to first character of keyPhrase
  3. h,i, ,i,m, ,s,p etc. . .
  4. if keyPhrase.char = samplePhrase.currentChar
  5. get keyPhrase.length
  6. get samplePhrase.currentChar index
  7. get substring of samplePhrase base on the currentChar index to keyPhrase.length
  8. the first it will get would be work at krabby pa
  9. compare work at krabby pa to $keyPhrase ('What is your name') using levenstiens distance
  10. and to check it better use semilar_text. 11.if not equal and distance is to big repeat process.
War es hilfreich?

Lösung

My suggestion would be to generate a list of n-grams from the key phrase and calculate the edit distance between each n-gram and the key phrase.

Example:

key phrase: "What is your name"
phrase 1: "hi, my name is john doe. I live in new york. What is your name?"
phrase 2: "My name is Bruce. wht's your name"

A possible matching n-gram would be between 3 and 4 words long, therefore we create all 3-grams and 4-grams for each phrase, we should also normalize the string by removing punctuation and lowercasing everything.

phrase 1 3-grams:
"hi my name", "my name is", "name is john", "is john doe", "john doe I", "doe I live"... "what is your", "is your name"
phrase 1 4-grams:
"hi my name is", "my name is john doe", "name is john doe I", "is john doe I live"... "what is your name"

phrase 2 3-grams:
"my name is", "name is bruce", "is bruce wht's", "bruce wht's your", "wht's your name"
phrase 2 4-grmas:
"my name is bruce", "name is bruce wht's", "is bruce wht's your", "bruce wht's your name"

Next you can do levenstein distance on each n-gram this should solve the use case you presented above. if you need to further normalize each word you can use phonetic encoders such as Double Metaphone or NYSIIS, however, I did a test with all the "common" phonetic encoders and in your case it didn't show significant improvement, phonetic encoders are more suitable for names.

I have limited experience with PHP but here is a code example:

<?php
function extract_ngrams($phrase, $min_words, $max_words) {
    echo "Calculating N-Grams for phrase: $phrase\n";
    $ngrams = array();
    $words  = str_word_count(strtolower($phrase), 1);
    $word_count = count($words);

    for ($i = 0; $i <= $word_count - $min_words; $i++) {
        for ($j = $min_words; $j <= $max_words && ($j + $i) <= $word_count; $j++) {
            $ngrams[] = implode(' ',array_slice($words, $i, $j));
        }
    }
    return array_unique($ngrams);
}

function contains_key_phrase($ngrams, $key) {
    foreach ($ngrams as $ngram) {
        if (levenshtein($key, $ngram) < 5) {
            echo "found match: $ngram\n";
            return true;
        }
    }
    return false;
}

$key_phrase = "what is your name";
$phrases = array(
        "hi, my name is john doe. I live in new york. What is your name?",
        "My name is Bruce. wht's your name"
        );
$min_words = 3;
$max_words = 4;

foreach ($phrases as $phrase) {
    $ngrams = extract_ngrams($phrase, $min_words, $max_words);
    if (contains_key_phrase($ngrams,$key_phrase)) {
        echo "Phrase [$phrase] contains the key phrase [$key_phrase]\n";
    }
}
?>

And the output is something like this:

Calculating N-Grams for phrase: hi, my name is john doe. I live in new york. What is your name?
found match: what is your name
Phrase [hi, my name is john doe. I live in new york. What is your name?] contains the key phrase [what is your name]
Calculating N-Grams for phrase: My name is Bruce. wht's your name
found match: wht's your name
Phrase [My name is Bruce. wht's your name] contains the key phrase [what is your name]

EDIT: I noticed some suggestions to add phonetic encoding to each word in the generated n-gram. I'm not sure phonetic encoding is the best answer to this problem as they are mostly tuned to stemming names (american, german or french depending on the algorithm) and are not very good at stemming plain words.

I actually wrote a test to validate this in Java (as the encoders are more readily available) here is the output:

===========================
Created new phonetic matcher
    Engine: Caverphone2
    Key Phrase: what is your name
    Encoded Key Phrase: WT11111111 AS11111111 YA11111111 NM11111111
Found match: [What is your name?] Encoded: WT11111111 AS11111111 YA11111111 NM11111111
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
===========================
Created new phonetic matcher
    Engine: DoubleMetaphone
    Key Phrase: what is your name
    Encoded Key Phrase: AT AS AR NM
Found match: [What is your] Encoded: AT AS AR
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: ATS AR NM
Phrase: [My name is Bruce. wht's your name] MATCH: true
===========================
Created new phonetic matcher
    Engine: Nysiis
    Key Phrase: what is your name
    Encoded Key Phrase: WAT I YAR NAN
Found match: [What is your name?] Encoded: WAT I YAR NAN
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: WT YAR NAN
Phrase: [My name is Bruce. wht's your name] MATCH: true
===========================
Created new phonetic matcher
    Engine: Soundex
    Key Phrase: what is your name
    Encoded Key Phrase: W300 I200 Y600 N500
Found match: [What is your name?] Encoded: W300 I200 Y600 N500
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Phrase: [My name is Bruce. wht's your name] MATCH: false
===========================
Created new phonetic matcher
    Engine: RefinedSoundex
    Key Phrase: what is your name
    Encoded Key Phrase: W06 I03 Y09 N8080
Found match: [What is your name?] Encoded: W06 I03 Y09 N8080
Phrase: [hi, my name is john doe. I live in new york. What is your name?] MATCH: true
Found match: [wht's your name] Encoded: W063 Y09 N8080
Phrase: [My name is Bruce. wht's your name] MATCH: true

I used a levenshtein distance of 4 when running these tests, but I am pretty sure you can find multiple edge cases where using the phonetic encoder will fail to match correctly. by looking at the example you can see that because of the stemming done by the encoders you are actually more likely to have false positives when using them in this way. keep in mind that these algorithms are originally intended to find those people in the population census that have the same name and not really which english words 'sound' the same.

Andere Tipps

What you are trying to achieve is a quite complex natural language processing task and it usually requires parsing among other things.

What I am going to suggest is to create a sentence tokenizer that will split the phrase into sentences. Then tokenize each sentence splitting on whitespace, punctuation and probably also rewriting some abbreviations to a more normal form.

Then, you can create custom logic that traverses the token list of each sentence looking for specific meaning. Ex.: ['...','what','...','...','your','name','...','...','?'] can also mean what is your name. The sentence could be "So, what is your name really?" or "What could your name be?"

I am adding code as an example. I am not saying you should use something that simple. The code below uses NlpTools a natural language processing library in php (I am involved in the library so feel free to assume I am biased).

 <?php

 include('vendor/autoload.php');

 use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
 use \NlpTools\Classifiers\Classifier;
 use \NlpTools\Tokenizers\WhitespaceTokenizer;
 use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
 use \NlpTools\Documents\Document;

 class EndOfSentence implements Classifier
 {
     public function classify(array $classes, Document $d)
     {
         list($token, $before, $after) = $d->getDocumentData();

         $lastchar = substr($token, -1);
         $dotcnt = count(explode('.',$token))-1;

         if (count($after)==0)
             return 'EOW';

         // for some abbreviations
         if ($dotcnt>1)
             return 'O';

         if (in_array($lastchar, array(".","?","!")))
             return 'EOW';
     }
 }

 function normalize($s) {
     // get this somewhere static
     $hash_table = array(
         'whats'=>'what is',
         'whts'=>'what is',
         'what\'s'=>'what is',
         '\'s'=>'is',
         'n\'t'=>'not',
         'ur'=>'your'
         // .... more ....
     );

     $s = mb_strtolower($s,'utf-8');
     if (isset($hash_table[$s]))
         return $hash_table[$s];
     return $s;
 }

 $whitespace_tok = new WhitespaceTokenizer();
 $punct_tok = new WhitespaceAndPunctuationTokenizer();
 $sentence_tok = new ClassifierBasedTokenizer(
     new EndOfSentence(),
     $whitespace_tok
 );

 $text = 'hi, my name is john doe. I live in new york. What\'s your name? whts ur name';

 foreach ($sentence_tok->tokenize($text) as $sentence) {
     $words = $whitespace_tok->tokenize($sentence);
     $words = array_map(
         'normalize',
         $words
     );
     $words = call_user_func_array(
         'array_merge',
         array_map(
             array($punct_tok,'tokenize'),
             $words
         )
     );

     // decide what this sequence of tokens is
     print_r($words);
 }

You may think of using the soundex function to convert the input string into a phonetically equivalant writing, and then proceed with your search. soundex

First of all fix all short codes example wht's insted of whats

$txt=$_POST['txt']
$txt=str_ireplace("hw r u","how are You",$txt);
$txt=str_ireplace(" hw "," how ",$txt);//remember an space before and after phrase is required else it will replace all occurrence of hw(even inside a word if hw exists).
$txt=str_ireplace(" r "," are ",$txt);
$txt=str_ireplace(" u "," you ",$txt);
$txt=str_ireplace(" wht's "," What is ",$txt);

Similarly Add as many phrases as you want.. now just check all possible questions in this text & get their position

if (strpos($phrase,"What is your name")) {//No need to add "!=" false
    return $response;
}
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top