Question

Suppose I have a list of two-word pairs in a column in Excel. These words are delimited by a space so that a typical pair might look like "extreme happiness". The goal is to search for these 'bigrams' in a larger string located in another column. The issue is that the bigram will only be found if the two words are together and separated by a space. What would be preferable is if Excel could look for both words anywhere in a given larger string. It is crucial that the bigrams occupy one cell each since a score is assigned to each bigram and in fact the function used VLOOKUPs this value based on the bigram cell value. Would it make sense to change the space between any two words to a - or some other character? Is there a way to have Excel look up each value one at a time (perhaps by recognizing this character and passing through the larger string twice, that is, once for each word)?

Example: "The weather last night was extremely cold, but the warm fire gave me some happiness."

Here we would like to find both the word 'extreme' within the word extremely and the word happiness. Currently Excel would not be successful in doing this since it would just look for "extreme happiness" and determine that no such string exists.

If the bigram in the row below "extreme happiness" reads "weather gave" (for some reason) Excel will go check whether that bigram exists in the larger string and return a second score. This is done so that at the end every score can be added together.

Was it helpful?

Solution

This is pretty easy with a couple of formulas. See screenshot below:

enter image description here

The logic is simple. Assuming your bigram is in B1, we can input the following in C1. This will replace the spaces with *, which is Excel's wildcard character.

=SUBSTITUTE(B2," ","*")

Then we concatenate it to give us a wildcarded beginning and end.

=CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*")

We then use a simple COUNTIF against the statement (here in A1) to return to us a count of occurence.

=COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))

A simple IF check enclosing the above, with condition >0, can be used to give us either Yes or No.

=IF(COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))>0,"Yes","No")

Let us know if this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top