Schinke Latin stemming algorithm in PHP

https://stackoverflow.com/questions/3928922

30-09-2019
|

Pergunta

This website offers the "Schinke Latin stemming algorithm" for download to use it in the Snowball stemming system.

I want to use this algorithm, but I don't want to use Snowball.

The good thing: There's some pseudocode on that page which you could translate to a PHP function. This is what I've tried:

<?php
function stemLatin($word) {
    // output = array(NOUN-BASED STEM, VERB-BASED STEM)
    // DEFINE CLASSES BEGIN
    $queWords = array('atque', 'quoque', 'neque', 'itaque', 'absque', 'apsque', 'abusque', 'adaeque', 'adusque', 'denique', 'deque', 'susque', 'oblique', 'peraeque', 'plenisque', 'quandoque', 'quisque', 'quaeque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quaque', 'quique', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'quotusquisque', 'quousque', 'ubique', 'undique', 'usque', 'uterque', 'utique', 'utroque', 'utribique', 'torque', 'coque', 'concoque', 'contorque', 'detorque', 'decoque', 'excoque', 'extorque', 'obtorque', 'optorque', 'retorque', 'recoque', 'attorque', 'incoque', 'intorque', 'praetorque');
    $suffixesA = array('ibus, 'ius, 'ae, 'am, 'as, 'em', 'es', ia', 'is', 'nt', 'os', 'ud', 'um', 'us', 'a', 'e', 'i', 'o', 'u');
    $suffixesB = array('iuntur', 'beris', 'erunt', 'untur', 'iunt', 'mini', 'ntur', 'stis', 'bor', 'ero', 'mur', 'mus', 'ris', 'sti', 'tis', 'tur', 'unt', 'bo', 'ns', 'nt', 'ri', 'm', 'r', 's', 't');
    // DEFINE CLASSES END
    $word = strtolower(trim($word)); // make string lowercase + remove white spaces before and behind
    $word = str_replace('j', 'i', $word); // replace all <j> by <i>
    $word = str_replace('v', 'u', $word); // replace all <v> by <u>
    if (substr($word, -3) == 'que') { // if word ends with -que
        if (in_array($word, $queWords)) { // if word is a queWord
            return array($word, $word); // output queWord as both noun-based and verb-based stem
        }
        else {
            $word = substr($word, 0, -3); // remove the -que
        }
    }
    foreach ($suffixesA as $suffixA) { // remove suffixes for noun-based forms (list A)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            $word = substr($word, 0, -strlen($suffixA)); // remove the suffix
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $nounBased = $word; } else { $nounBased = ''; } // add only if word contains two or more characters
    foreach ($suffixesB as $suffixB) { // remove suffixes for verb-based forms (list B)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            switch ($suffixB) {
                case 'iuntur', 'erunt', 'untur', 'iunt', 'unt': $word = substr($word, 0, -strlen($suffixB)).'i'; break; // replace suffix by <i>
                case 'beris', 'bor', 'bo': $word = substr($word, 0, -strlen($suffixB)).'bi'; break; // replace suffix by <bi>
                case 'ero': $word = substr($word, 0, -strlen($suffixB)).'eri'; break; // replace suffix by <eri>
                default: $word = substr($word, 0, -strlen($suffixB)); break; // remove the suffix
            }
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $verbBased = $word; } else { $verbBased = ''; } // add only if word contains two or more characters
    return array($nounBased, $verbBased);
}
?>

My questions:

1) Will this code work correctly? Does it follow the algorithm's rules?

2) How could you improve the code (performance)?

Thank you very much in advance!

Solução

No, your function will not work, it contains syntax errors. For example you have unclosed quotes and you use a wrong switch syntax.

Here is my rewrite of the function. As the pseudoalgorithm on that page isn't really precise I had to do some interpreting. I interpreted it in a way that the examples mentioned in this article work.

I also did some optimizations. The first one is that I define the word and suffix arrays static. Thus all calls to this function share the same arrays which should be good fore performance ;)

Furthermore I adjusted the arrays so they can be used more effective. I changed the $queWords array so it can be used for a fast hash-table lookup, not a slow in_array. Furthermore I have saved the lengths for the suffixes in the array. Thus you don't need to compute them at runtime (which is really, really slow). I may have made more minor optimizations.

I don't know how much faster this code is, but it should be much faster. Furthermore it now works on the examples provided.

Here is the code:

<?php
    function stemLatin($word) {
        static $queWords = array(
            'atque'         => 1,
            'quoque'        => 1,
            'neque'         => 1,
            'itaque'        => 1,
            'absque'        => 1,
            'apsque'        => 1,
            'abusque'       => 1,
            'adaeque'       => 1,
            'adusque'       => 1,
            'denique'       => 1,
            'deque'         => 1,
            'susque'        => 1,
            'oblique'       => 1,
            'peraeque'      => 1,
            'plenisque'     => 1,
            'quandoque'     => 1,
            'quisque'       => 1,
            'quaeque'       => 1,
            'cuiusque'      => 1,
            'cuique'        => 1,
            'quemque'       => 1,
            'quamque'       => 1,
            'quaque'        => 1,
            'quique'        => 1,
            'quorumque'     => 1,
            'quarumque'     => 1,
            'quibusque'     => 1,
            'quosque'       => 1,
            'quasque'       => 1,
            'quotusquisque' => 1,
            'quousque'      => 1,
            'ubique'        => 1,
            'undique'       => 1,
            'usque'         => 1,
            'uterque'       => 1,
            'utique'        => 1,
            'utroque'       => 1,
            'utribique'     => 1,
            'torque'        => 1,
            'coque'         => 1,
            'concoque'      => 1,
            'contorque'     => 1,
            'detorque'      => 1,
            'decoque'       => 1,
            'excoque'       => 1,
            'extorque'      => 1,
            'obtorque'      => 1,
            'optorque'      => 1,
            'retorque'      => 1,
            'recoque'       => 1,
            'attorque'      => 1,
            'incoque'       => 1,
            'intorque'      => 1,
            'praetorque'    => 1,
        );
        static $suffixesNoun = array(
            'ibus' => 4,
            'ius'  => 3,
            'ae'   => 2,
            'am'   => 2,
            'as'   => 2,
            'em'   => 2,
            'es'   => 2,
            'ia'   => 2,
            'is'   => 2,
            'nt'   => 2,
            'os'   => 2,
            'ud'   => 2,
            'um'   => 2,
            'us'   => 2,
            'a'    => 1,
            'e'    => 1,
            'i'    => 1,
            'o'    => 1,
            'u'    => 1,
        );
        static $suffixesVerb = array(
            'iuntur' => 6,
            'beris'  => 5,
            'erunt'  => 5,
            'untur'  => 5,
            'iunt'   => 4,
            'mini'   => 4,
            'ntur'   => 4,
            'stis'   => 4,
            'bor'    => 3,
            'ero'    => 3,
            'mur'    => 3,
            'mus'    => 3,
            'ris'    => 3,
            'sti'    => 3,
            'tis'    => 3,
            'tur'    => 3,
            'unt'    => 3,
            'bo'     => 2,
            'ns'     => 2,
            'nt'     => 2,
            'ri'     => 2,
            'm'      => 1,
            'r'      => 1,
            's'      => 1,
            't'      => 1,
        );

        $stems = array($word, $word);

        $word = strtr(strtolower(trim($word)), 'jv', 'iu'); // trim, lowercase and j => i, v => u

        if (substr($word, -3) == 'que') {
            if (isset($queWords[$word])) {
                return array($word, $word);
            }
            $word = substr($word, 0, -3);
        }

        foreach ($suffixesNoun as $suffix => $length) {
            if (substr($word, -$length) == $suffix) {
                $tmp = substr($word, 0, -$length);

                if (isset($tmp[1]))
                    $stems[0] = $tmp;
                break;
            }
        }

        foreach ($suffixesVerb as $suffix => $length) {
            if (substr($word, -$length) == $suffix) {
                switch ($suffix) {
                    case 'iuntur':
                    case 'erunt':
                    case 'untur':
                    case 'iunt':
                    case 'unt':
                        $tmp = substr_replace($word, 'i', -$length, $length);
                    break;
                    case 'beris':
                    case 'bor':
                    case 'bo':
                        $tmp = substr_replace($word, 'bi', -$length, $length);
                    break;
                    case 'ero':
                        $tmp = substr_replace($word, 'eri', -$length, $length);
                    break;
                    default:
                        $tmp = substr($word, 0, -$length);
                }

                if (isset($tmp[1]))
                    $stems[1] = $tmp;
                break;
            }
        }

        return $stems;
    }

    var_dump(stemLatin('aquila'));
    var_dump(stemLatin('portat'));
    var_dump(stemLatin('portis'));

Outras dicas

As far as I can tell, this follows the algorithm described in your link, and should work correctly. (Apart from the syntax error you have in the definition of $suffixesA - you're missing a couple of apostrophes.)

Performance-wise, it doesn't look like there's much to gain here, but there are a few things that come to mind.

If this is going to get called many times during a single execution of the script, there might be something gained by defining these arrays outside of the function - I don't think PHP is smart enough to cache those arrays between calls to the function.

You can also combine those two str_replaces into one: $word = str_replace(array('j','v'), array('i','u'), $word);, or, since you're replacing single characters with single characters, you can use $word = strtr($word,'jv','iu'); - but I don't think that will make much difference in practice. You'll have to try it out to be certain.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow