Question

I have a large data set. I have to subset the data set (Big_data) by using values stored in other dta file (Criteria_data). I will show you the problem first:

   **Big_data**                           **Criteria_data**
====================      ================================================
  lon        lat             4_digit_id   minlon  maxlon  minlat  maxlat
-76.22      44.27              0765       -78.44  -77.22  34.324  35.011
-67.55      33.19              6161       -66.11  -65.93  40.32   41.88
    .......                                   ........
 (over 1 million obs)                    (271 observations)        
====================      ================================================

I have to subset the bid data as follows:

use Big_data

preserve
keep if (-78.44<lon<-77.22) & (34.324<lat<35.011)
save data_0765, replace
restore

preserve
keep if (-66.11<lon<-65.93) & (40.32<lat<41.88)
save data_6161, replace
restore

....

(1) What should be the efficient programming for the subsetting in Stata? (2) Are the inequality expressions correctly written?

Was it helpful?

Solution

1) Subsetting data

With 400,000 observations in the main file and 300 in the reference file, it takes about 1.5 minutes. I can't test this with double the observations in the main file because the lack of RAM takes my computer to a crawl.

The strategy involves creating as many variables as needed to hold the reference latitudes and longitudes (271*4 = 1084 in the OP's case; Stata IC and up can handle this. See help limits). This requires some reshaping and appending. Then we check for those observations of the big data file that meet the conditions.

clear all
set more off

*----- create example databases -----

tempfile bigdata reference

input ///
lon        lat   
-76.22      44.27
-66.0      40.85 // meets conditions
-77.10     34.8 // meets conditions
-66.00    42.0 
end

expand 100000

save "`bigdata'"
*list

clear all

input ///
str4 id   minlon  maxlon  minlat  maxlat
"0765"       -78.44  -75.22  34.324  35.011
"6161"       -66.11  -65.93  40.32   41.88
end

drop id
expand 150
gen id = _n

save "`reference'"
*list


*----- reshape original reference file -----

use "`reference'", clear

tempfile reference2

destring id, replace
levelsof id, local(lev)

gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id) 

gen lat = .
gen lon = .

save "`reference2'"


*----- create working database -----

use "`bigdata'"

timer on 1
quietly {
    forvalues num = 1/300 {
        gen minlon`num' = .
        gen maxlon`num' = .
        gen minlat`num' = .
        gen maxlat`num' = .
    }
}
timer off 1

timer on 2
append using "`reference2'"
drop i
timer off 2

*----- flag observations for which conditions are met -----

timer on 3
gen byte flag = 0
foreach le of local lev {
    quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3

*keep if flag
*keep lon lat

*list

timer list

The inrange() function implies that the minimums and maximums must be adjusted beforehand to satisfy the OP's strict inequalities (the function tests <=, >=).

Probably some expansion using expand, use of correlatives and by (so data is in long form) could speed things up. It's not totally clear for me right now. I'm sure there are better ways in plain Stata mode. Mata may be even better.

(joinby was also tested but again RAM was a problem.)

Edit

Doing computations in chunks rather than for the complete database, significantly improves the RAM issue. Using a main file with 1.2 million observations and a reference file with 300 observations, the following code does all the work in about 1.5 minutes:

set more off

*----- create example big data -----

clear all

set obs 1200000
set seed 13056

gen lat = runiform()*100
gen lon = runiform()*100

local sizebd `=_N' // to be used in computations

tempfile bigdata
save "`bigdata'"

*----- create example reference data -----

clear all

set obs 300
set seed 97532

gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5

gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5

gen id = _n

tempfile reference
save "`reference'"


*----- reshape original reference file -----

use "`reference'", clear

destring id, replace
levelsof id, local(lev)

gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id) 
drop i

tempfile reference2
save "`reference2'"


*----- create file to save results -----

tempfile results
clear all
set obs 0

gen lon = .
gen lat = .

save "`results'"


*----- start computations -----

clear all

* local that controls # of observations in intermediate files
local step = 5000 // can't be larger than sizedb

timer clear

timer on 99
forvalues en = `step'(`step')`sizebd' {

    * load observations and join with references
    timer on 1
    local start = `en' - (`step' - 1)
    use in `start'/`en' using "`bigdata'", clear
    timer off 1

    timer on 2
    append using "`reference2'"
    timer off 2

    * flag observations that meet conditions
    timer on 3
    gen byte flag = 0
    foreach le of local lev {
        quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
    }
    timer off 3

    * append to result database
    timer on 4
    quietly {
        keep if flag
        keep lon lat
        append using "`results'"
        save "`results'", replace
    }
    timer off 4

}
timer off 99

timer list
display "total time is " `r(t99)'/60 " minutes"

use "`results'"
browse

2) Inequalities

You ask if your inequalities are correct. They are in fact legal, meaning that Stata will not complain, but the result is probably unexpected.

The following result may seem surprising:

. display  (66.11 < 100 < 67.93)
1

How is it the case that the expression evaluates to true (i.e. 1) ? Stata first evaluates 66.11 < 100 which is true, and then sees 1 < 67.93 which is also true, of course.

The intended expression was (and Stata will now do what you want):

. display  (66.11 < 100) & (100 < 67.93)
0

You can also rely on the function inrange().

The following example is consistent with the previous explanation:

. display  (66.11 < 100 < 0)
0

Stata sees 66.11 < 100 which is true (i.e. 1) and follows up with 1 < 0, which is false (i.e. 0).

OTHER TIPS

This uses Roberto's data setup:

clear all

set obs 1200000
set seed 13056

gen lat = runiform()*100
gen lon = runiform()*100

local sizebd `=_N' // to be used in computations

tempfile bigdata
save "`bigdata'"

*----- create example reference data -----

clear all

set obs 300
set seed 97532

gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5

gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5

gen id = _n

tempfile reference
save "`reference'"


timer on 1
levelsof id, local(id_list)

foreach id of local id_list {
    sum minlat if id==`id', meanonly
    local minlat = r(min)
    sum maxlat if id==`id', meanonly
    local maxlat = r(max)

    sum minlon if id==`id', meanonly
    local minlon = r(min)
    sum maxlon if id==`id', meanonly
    local maxlon = r(max)

    preserve
        use if (inrange(lon,`minlon',`maxlon') & inrange(lat,`minlat',`maxlat')) using "`bigdata'", clear
        qui save data_`id', replace
    restore
}

timer off 1

I would try to avoid preserveing and restoreing the "big" file, and doing so is possible, but at the expense of losing Stata format.

Using the same set up as Roberto and Dimitriy did,

set more off

use `bigdata', clear
merge 1:1 _n using `reference'

* check for data consistency: 
* minlat, maxlat, minlon, maxlon are either all defined or all missing
assert inlist( mi(minlat) + mi(maxlat) + mi(minlon) + mi(maxlon), 0, 4)

* this will come handy later
gen byte touse = 0

* set up and cycle over the reference data
count if !missing(minlat)
forvalues n=1/`=r(N)' {
    replace touse = inrange(lat,minlat[`n'],maxlat[`n']) & inrange(lon,minlon[`n'],maxlon[`n'])
    local thisid = id[`n']
    outfile lat lon if touse using data_`thisid'.csv, replace comma
}

Time it on your machine. You could avoid touse and thisid and only have the single outfile within the cycle, but it would be less readable.

You can then infile lat lon using data_###.csv, clear later. If you really need the Stata files proper, you can convert that swarm of CSV files with

clear
local allcsv : dir . files "*.csv"
foreach f of local allcsv {
   * change the filename
   local dtaname = subinstr(`"`f'"',".csv",".dta",.)
   infile lat lon using `"`f'"', clear
   if _N>0 save `"`dtaname'"', replace
}

Time it, too. I protected the save as some of the simulated data sets were empty. I think this was faster than 1.5 min on my machine, including the conversion.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top