문제

I work in healthcare IT, reviewing data management processes of various observational studies. One problem I have repeatedly faced is the poorly encoded data, especially when some values are missing.


Background, skip if you like

I am currently reviewing a study which collects data on patients suffering from a specific disease. Patients usually join the study once the disease is confirmed (a few months after the outbreak). One of the important parameters is the result of a certain blood test in the acute phase of the disease, i.e. hopefully as close to the outbreak as possible. Sometimes however, that test was not performed because there was no indication for it or the patient forgot to bring a copy of the result, etc.

An important aspect here is that all the data in the DB is generated from (tons of unsorted) paper files and often people do not have the time to skim through all of it, so they do not enter the blood test's result just because they haven't gotten to it yet.


In order to "somehow" encode reasons for a value being absent, I have seen various schemes:

  1. Use a TEXT(3) field for booleans and use n/a to express that "the data really isn't available" (e.g. because the blood test wasn't performed) and NULL to express "Maybe the data is somewhere, I just haven't looked for it yet".

  2. Use an additional field in the same table, e.g. a boolean field "bloodtest_perf" where "perf" stands for "actually performed".

I dislike the first approach because it allows you to enter "yes", "y", "ja", "YES", etc. and you end up spending most of the time cleaning instead of analyzing data. The second approach isn't much better either, because you end up with dummy data at best and inconsitent data at worst:

__ TBL_TEST1 __________________________________
| patID | test1_perf | test1_date | test1_res |
+---------------------------------------------+
| 12345 | no         | NULL       | NULL      |
| 12345 | yes        | 2011-05-13 | 20.0      |
+---------------------------------------------+

The best solution I could come up with is to create a TBL_TEST1_METADATA which contains an entry iff test1 was not performed, which then specifies why not, but the clinicians (with rudimentary MS-Access knowledge) are struggling with this normalization approach.

What is a pragmatic yet effective solution for this problem?

올바른 솔루션이 없습니다

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 softwareengineering.stackexchange
scroll top