How can I reduce the quantity of similar phrases contained in an array using PHP?

StackOverflow https://stackoverflow.com/questions/7061512

  •  17-12-2020
  •  | 
  •  

Domanda

I have an array containing phrases (a few to hundreds).

Example:

adhesive materials
adhesive material
material adhesive
adhesive applicator
adhesive applicators
adhesive applications
adhesive application
adhesives applications
adhesive application systems
adhesive application system

Programmatically, using PHP, I'd like to reduce the above list to the following list using something like word stemming (some variation is acceptable, eg. adhesive applicator and adhesive application may be difficult to distinguish from one another since the stem is the same):

adhesive material
material adhesive
adhesive applicator
adhesive application
adhesive application system

What is the best way to do this?

È stato utile?

Soluzione

You'd decide a minimum threshold and then use the levenshtein function to determine how close words would have to be.

It looks like you'd more or less be doing this:

$origs = array();
// assuming your example is an array already.
foreach( $setList as $set )
{
    $pieces = explode( ' ', $set );
    $add = true;
    foreach( $origs as $keySet )
    {
        if( levenshtein( $pieces[ 0 ], $keySet[ 0 ] ) < 3 ||
            levenshtein( $pieces[ 1 ], $keySet[ 0 ] ) < 3 )
        {
            $add = false;
            break;
        }
    }

    if( $add ) $origs[] = $pieces;
} 

You'll be left with a list similar to your output. Some modifications will need to be made if you have a preference that the shortest words be in the list, but you get the idea.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top