MATLAB: Utilisation d'interpolation pour remplacer les valeurs manquantes (NAN)
-
26-09-2019 - |
Question
J'ai un tableau cellulaire contenant chacun une séquence de valeurs comme vecteur de ligne. Les séquences contiennent certaines valeurs manquantes représentées par NaN
.
Je voudrais remplacer tous les NANS en utilisant une sorte de méthode d'interpolation, comment puis-je le faire dans MATLAB? Je suis également ouvert à d'autres suggestions sur la façon de gérer ces valeurs manquantes.
Considérez cet exemple de données pour illustrer le problème:
seq = {randn(1,10); randn(1,7); randn(1,8)};
for i=1:numel(seq)
%# simulate some missing values
ind = rand( size(seq{i}) ) < 0.2;
seq{i}(ind) = nan;
end
Les séquences résultantes:
seq{1}
ans =
-0.50782 -0.32058 NaN -3.0292 -0.45701 1.2424 NaN 0.93373 NaN -0.029006
seq{2}
ans =
0.18245 -1.5651 -0.084539 1.6039 0.098348 0.041374 -0.73417
seq{3}
ans =
NaN NaN 0.42639 -0.37281 -0.23645 2.0237 -2.2584 2.2294
Éditer:
Sur la base des réponses, je pense qu'il y a eu une confusion: évidemment, je ne travaille pas avec des données aléatoires, le code indiqué ci-dessus est simplement un exemple de la façon dont les données sont structurées.
Les données réelles sont une forme de signaux traités. Le problème est que pendant l'analyse, ma solution échouerait si les séquences contiennent des valeurs manquantes, d'où le besoin de filtrage / interpolation (j'ai déjà envisagé d'utiliser la moyenne de chaque séquence pour remplir les blancs, mais j'espère quelque chose de plus puissant)
La solution
Well, if you're working with time-series data then you can use Matlab's built in interpolation function.
Something like this should work for your situation, but you'll need to tailor it a little ... ie. if you don't have equal spaced sampling you'll need to modify the times
line.
nseq = cell(size(seq))
for i = 1:numel(seq)
times = 1:length(seq{i});
mask = ~isnan(seq{i});
nseq{i} = seq{i};
nseq{i}(~mask) = interp1(times(mask), seq{i}(mask), times(~mask));
end
You'll need to play around with the options of interp1
to figure out which ones work best for your situation.
Autres conseils
I would use inpaint_nans, a tool designed to replace nan elements in 1-d or 2-d matrices by interpolation.
seq{1} = [-0.50782 -0.32058 NaN -3.0292 -0.45701 1.2424 NaN 0.93373 NaN -0.029006];
seq{2} = [0.18245 -1.5651 -0.084539 1.6039 0.098348 0.041374 -0.73417];
seq{3} = [NaN NaN 0.42639 -0.37281 -0.23645 2.0237];
for i = 1:3
seq{i} = inpaint_nans(seq{i});
end
seq{:}
ans =
-0.50782 -0.32058 -2.0724 -3.0292 -0.45701 1.2424 1.4528 0.93373 0.44482 -0.029006
ans =
0.18245 -1.5651 -0.084539 1.6039 0.098348 0.041374 -0.73417
ans =
2.0248 1.2256 0.42639 -0.37281 -0.23645 2.0237
If you have access to the System Identification Toolbox, you can use the MISDATA function to estimate missing values. According to the documentation:
This command linearly interpolates missing values to estimate the first model. Then, it uses this model to estimate the missing data as parameters by minimizing the output prediction errors obtained from the reconstructed data.
Basically the algorithm alternates between estimating missing data and estimating models, in a way similar to the Expectation Maximization (EM) algorithm.
The model estimated can be any of the linear models idmodel
(AR/ARX/..), or if non given, uses a default-order state-space model.
Here's how to apply it to your data:
for i=1:numel(seq)
dat = misdata( iddata(seq{i}(:)) );
seq{i} = dat.OutputData;
end
There also some other functions like interp1. For curved plots spline is the the best method to find missing data.
As JudoWill says, you need to assume some sort of relationship between your data.
One trivial option would be to compute the mean of your total series, and use those for missing data. Another trivial option would be to take the mean of the n previous and n next values.
But be very careful with this: if you're missing data, you're generally better to deal with those missing data, than to make up some fake data that could screw up your analysis.
Consider the following example
X=some Nx1 array Y=F(X) with some NaNs in it
then use
X1=X(find(~isnan(Y))); Y1=Y(find(~isnan(Y)));
Now interpolate over X1 and Y1 to compute all values at all X.