Matching variable names with their corresponding values over different databases

StackOverflow https://stackoverflow.com/questions/22774562

  •  25-06-2023
  •  | 
  •  

Question

I am trying to import the variable names from an external dataset using a macro, match these names with their corresponding values in the main file and then export results of a looping principal component analysis using esttab.

My code looks like this.

preserve

forvalue file = 537(3)647 {

    import excel "C:\Users\M\Dropbox\Masterarbeit\Stata12\test/`file'.xls", sheet("Sheet1") firstrow clear

    local x ""
    foreach var of varlist *SA {
        local x `x' `var'
    }

    clear
    restore

    forvalue z = 537(3)647 {
        pca `x' if rMonth < `z'+3, comp(1)
        esttab e(L) using pc`z'.csv, replace
    }
}

The command is supposed to cycle through the file defined in the first loop, catch the variable names in the file, match these with the variables and the respective values in the main file (names of the variables are the same) and then perform the pca. Afterwards, it is supposed to create a new list of variable names in the next excel file and use these variables in pca. In this state the code only works should the values also be in the external datasets.

The problem is, that I can't find a way to match the variable names in the external files with the ones in the mainfile, I only get the error "no variables defined" as the external files consist only of the names of the variables, not the values.

Any advice how I can tell Stata that it should look up the variable names from the external files and use their values for the pca?

Edit: before the preserve, my code generates the variables, regresses them on a dependent variable, ranks them according to the t-value and exports them to the files I use to grab the varlist. The code looks like this:

. set excelxlsxlargefile on

cd C:\Users\M\Dropbox\Masterarbeit\Stata12\sentiment_6m

. import excel "C:\Users\M\Dropbox\Masterarbeit\Daten\Dataimport\sentiments\Google Query CDX.xlsx", sheet("Tabelle1") firstrow

set more off

gen Month = month( Date)

gen     January     =   1   if  Month   ==  1
gen     February    =   1   if  Month   ==  2
gen     March   =   1   if  Month   ==  3
gen     April   =   1   if  Month   ==  4
gen     May =   1   if  Month   ==  5
gen     June    =   1   if  Month   ==  6
gen     July    =   1   if  Month   ==  7
gen     August  =   1   if  Month   ==  8
gen     September   =   1   if  Month   ==  9
gen     October =   1   if  Month   ==  10
gen     November    =   1   if  Month   ==  11
gen     December    =   1   if  Month   ==  12
replace     January     =   0   if  January     ==  .
replace     February    =   0   if  February    ==  .
replace     March   =   0   if  March   ==  .
replace     April   =   0   if  April   ==  .
replace     May =   0   if  May ==  .
replace     June    =   0   if  June    ==  .
replace     July    =   0   if  July    ==  .
replace     August  =   0   if  August  ==  .
replace     September   =   0   if  September   ==  .
replace     October =   0   if  October ==  .
replace     November    =   0   if  November    ==  .
replace     December    =   0   if  December    ==  .


foreach var of varlist *_qry{  
sum `var', meanonly
local mu =r(mean)
reg `var' January  February March April May June July August September October November December, nocons
predict double `var'SA, residual
replace `var'SA=`var'SA+`mu'
egen sd = sd(`var'SA)
replace `var'SA=`var'SA/sd
drop sd
drop `var'
}



* BIG LOOP *

generate double rMonth = mofd( Date)
global tflist ""

forvalue y = 537(3)647{


foreach var of varlist *SA{
reg MidCDX `var' if rMonth<=`y'
tempfile tfcur
parmest, idstr("`var'") saving(`"`tfcur'"', replace) flis(tflist) 
}


* Concatenate files into memory (REPLACING THE OLD DATA) *
preserve
clear
append using $tflist
sencode idstr, gene(xvar)
lab var xvar "X-variable"
keybygen xvar, gene(parmseq)
drop if parm=="_cons"
egen rank = rank (-t)
gsort -t
drop if rank>40
save `y', replace
export excel xvar t using `y', firstrow(variables) replace
foreach TF in $tflist {
erase `"`TF'"'
}
global tflist ""
restore

}
Was it helpful?

Solution

Maybe this example helps:

clear all
set more off

/*
load two example MS Excel files with var names only and accumulate var names in a local.
files are named varfile.xls and varfile2.xls
*/

foreach i in "" "2" {

    import excel "/home/roberto/Desktop/stata_tests/varfile`i'.xls", firstrow clear

    * get var names
    quietly ds

    * save var names in local
    local myvars `myvars' `r(varlist)'
}

* load database that contains vars and values
sysuse auto, clear

* do pca
pca `myvars'

/*
varfile.xls contains variables "weight" and "price"
varfile2.xls contains variables "mpg" and "length"
*/

ds does the trick here because it saves the names of the variables picked up in the MS Excel sheet and stores the result in r(varlist). See help ds and help saved results (or help stored results). Afterwards, we load a "complete" database and use the stored variable names with pca.

The MS Excel files look like this:

enter image description here

This, I think, answers the specific question you pose.

Edit

Looking closer at your code, I'm not sure the problem is related to matching variable names in the complete database, but rather some problem with the way you set up preserve and restore. Instead of using that set of commands, try simply loading the complete database when you need it (with use).

What do you have before the preserve? Where does your error appear? Please post more code. A reproducible example would help.

Edit 2

My conjecture now is that you have nothing before the preserve, so when you restore, you're just setting the slate clean; you are restoring a blank database. Therefore, trying pca <somevar> gives you:

no variables defined
r(111);

preserve preserves the data as it is just before the command is issued.

OTHER TIPS

Personal comment: There is too much code here for me to want to try and absorb what you are trying to do. I comment only on some details of technique.

This block of code

gen January = 1 if Month == 1 gen February = 1 if Month == 2 gen March = 1 if Month == 3 gen April = 1 if Month == 4 gen May = 1 if Month == 5 gen June = 1 if Month == 6 gen July = 1 if Month == 7 gen August = 1 if Month == 8 gen September = 1 if Month == 9 gen October = 1 if Month == 10 gen November = 1 if Month == 11 gen December = 1 if Month == 12 replace January = 0 if January == . replace February = 0 if February == . replace March = 0 if March == . replace April = 0 if April == . replace May = 0 if May == . replace June = 0 if June == . replace July = 0 if July == . replace August = 0 if August == . replace September = 0 if September == . replace October = 0 if October == . replace November = 0 if November == . replace December = 0 if December == .

can be rewritten like this

tokenize "`c(Months)'"
forval j = 1/12 { 
    gen ``j'' = Month == `j' 
}

The month names January to December are wired into c(Months).

sum `var', meanonly
local mu =r(mean)
reg `var' January  February March April May June July August September October November December, nocons
predict double `var'SA, residual
replace `var'SA=`var'SA+`mu'
egen sd = sd(`var'SA)
replace `var'SA=`var'SA/sd
drop sd

can be shortened to

reg `var' January-December, nocons
predict double `var'SA, residual
sum `var' 
replace `var'SA = (`var'SA + r(mean)) / r(sd) 

Note that it is not a good idea to create an entire variable holding just the SD. That cancels out any time savings from using summarize, meanonly.

I don't comment here on what you are trying to do statistically, adding the mean and then dividing by the SD.

@Roberto Ferrer is addressing your main problem, which hinges on comparing variable names across files. I add a detail on the use of local macros and wildcard syntax.

local x ""
foreach var of varlist *SA {
    local x `x' `var'
}

is a long way to get

unab x : *SA 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top