Question

Is there any way to code a binary variable dependent on keywords being present in a given string variable? Simple example:

I have a string variable that describes various meals and a dummy variable that denotes if a given meal is breakfast or not. Is there any way to code

breakfast = 1 if meal== [then something saying contains eggs, bacon, etc.]

This is a silly example, but I am more interested in identifying a shortcut to coding binary variables, based on information found in string data.

No correct solution

OTHER TIPS

The inbuilt strpos() will yield a positive value if a string is found inside another. Building on that

 gen breakfast = strpos(meal, "bacon") | strpos(meal, "eggs") 

and so forth. In practice, working with a string made lower case will often help, or indeed be essential. Also, if you have a long list, you may prefer

 gen breakfast = 0 
 quietly foreach thing in bacon eggs cereal "orange juice" { 
       replace breakfast = breakfast | strpos(lower(meal), `"`thing'"') 
 } 

The principle here is using | (or) as a logical operator, yielding 1 (true) if any argument is non-zero. Note that lower() is included to compare with a lower case version of the original.

This technique is naturally not robust to spelling mistakes or small variations in wording.

You can use the incss function of the egenmore package for this.

ssc install egenmore
egen bacon = incss(meal), sub(bacon) insensitive

This gives you a dummy equal to one if for a given observation the string variable "meal" contains the word bacon. It is zero otherwise. The option insensitive tells Stata to not consider case sensitivity (otherwise Bacon is different from bacon). As far as I know you can only search for one sub-string at a time but you can easily write a loop for this:

foreach word in bacon eggs cheese {
egen `word' = incss(meal), sub(`word') insensitive
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top