Training on data with inherently non-applicable data cells

https://datascience.stackexchange.com/questions/46203

01-11-2019
|

Question

I am training a model on a chemical sample dataset to find outliers and perform imputation where it makes sense.

Chemical Dataset

Contains thousands of rows of chemical mixtures with many columns of properties. Example properties: bromine content, density.

Inherently non-applicable data

The chemicals can be gas, liquid or solid but some properties are only applicable to samples of a certain state. An example could be viscosity in liquids, bond type (ionic, molecular, covalent) in solids or density in gas.

So far...

...all research has pointed towards methods of fixing "missing values" via column means, data imputation or something similar. There doesn't seem to be any sense in imputing the freezing point of a gas. A gas mixture does not have a freezing point. I am still in the process of data preparation and unsure how to proceed.

I am working in python and missing data is stored as NaN values. Perhaps there are some models that can deal with such NaN-values.

Side-Note:

The majority of the dataset is comprised of distillation curve datapoints (sequential data describing what percentage of a chemical sample evaporates as temperature is increased). This data is present for all samples.

Follow-up 1: Is there a model that will give me NaN values for the freezing point when I give it something that resembles a gas?

Follow-up 2: Can this be compared to image object detection where the object is partially obscured? or part of the image is corrupt?

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange