Pergunta

Following is the format for two sample variables from the codebook of US consumer expenditure survey (2011) p. 62.

VARIABLE_NAME VARIABLE_DESCRIPTION Format Note
FEDRFNDX During the past 12 months, what was the total amount of refund received from Federal income tax by ALL CU members?
NUM(8)
FEDTAXX During the past 12 months, what was the total amount PAID for Federal income tax, in addition to that 
withheld from earnings, by ALL CU members?
NUM(8)

where CU means consumer unit (or household).The Stata datafile shows the following format for FEDRFNDX and FEDTAXX variables:

FEDRFNDX    int     %8.0g
FEDTAXX long        %12.0g

My question is why the Stata format for these variables differs although they are both NUM(8) in the codebook and both do refer to the amount . As a end user of survey data, how can we be sure that we have the right format (for example, if we are just given the codebook like the one above say NUM(8) and starting position of variables plus the ascii data and not Stata data)?

I apologize if this question is too localized.

Foi útil?

Solução

The format only says something about how the data is to be displayed, not how it is stored. In this case the formats are the defaults for the different storage types: FEDRFNDX is stored as an int, while FEDTAXX is stored as a long. You can find out more about the differences by typing in Stata help data_types.

My guess would be that

  1. either both can safely be stored as int without loss of information

  2. or FEDRFNDX only has integer values less than 32,740, which means it does not use the full 8 digits that the codebook reserved for it, while FEDTAXX uses integer numbers larger than 32,740. 32,740 is the largest number that can be stored in a (2 byte) int, while 2,147,483,620 is the limit for a (4 byte) long.

A safe way to check which of these is true is to type compress after loading your dataset. This will change the storage type of each variable to the lowest form possible without loss of information. So, if my first guess is true, it will change the storage type of FEDTAXX to int, while if my second guess is true it will leave the storage type unchanged.

After that it is always a good idea to just type tab FEDTAXX and look at the values. I like the user-written command fre for that, as it displays both the values and the value labels. You can get that by typing in Stata ssc install fre.

Outras dicas

@Maarten Buis gave an excellent specific answer. The following more general remarks are too long for a comment.

What "format" is and is not in Stata is the subject of several misunderstandings. The best reason for that might be the loose, shifting meaning of "format" across computing. Whatever the reason, format in the specific sense here refers in Stata only to display format. The main way to change the format associated with a variable is through the format command and the help for that command is a good place to start.

Stata evidently surprises many users by making its data types storage types, making them fairly visible to the user and giving some considerable responsibility to the user over choice of storage type. But the connection between storage type and format is at best loose, namely that different storage types have different default formats.

It's crucial to grasp that changing the format in Stata does not change what is being stored.

A test of understanding for intermediate and/or long-term users is to be able to explain what is happening here

. set obs 1
obs was 0, now 1

. gen foo = 2000000001

. di %12.0f foo[1]
2000000000

Why did Stata (appear to) round that large integer? (Clue: This is not a bug, but just Stata following your tacit instructions on storage type.)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top