Question

I have 11,719 XY coordinate pairs obtained from 6 different instruments that track ocean currents (they measure UTC time, longitude, latitude, temperature). These instruments drifted along, so there are no repeat measures at the same location. I need to extract a "mean track" from that XY data. Here is the dataset: http://dropbox.com/s/dg0psl3hnmilg44/summerdrifters.csv

I was thinking of a regression line but the mean track i want to get is obviously not linear. I tried using different fitting functions, such as smoothing splines through cftool(lon,lat), but it is not the most convenient way as I need to fraction the data into subsets, then somehow merge the different functions.

Was it helpful?

Solution

The data set has 11719 points. The continuous segments can be detected like this

d = load('summerRunningMean.csv');

% detect continuous parts
s = sqrt(sum(diff(d)' .^ 2));
ind = [0, find(s > 100 * mean(s)), size(d, 1)];

which results in 6 segments with lengths from 534 to 4255 points.

Plotting the tracks superimposed in space

for i = 1 : 6
    x = d(ind(i) + 1 : ind(i + 1), 1);
    y = d(ind(i) + 1 : ind(i + 1), 2);
    plot(x, y)
    hold all
end
xlabel x
ylabel y
axis equal

gives this result

tracks in space

and shows that the tracks have roughly the same shape, and that all but one of them start at the same place.

Superimposition of the time series however shows that the movement along that shape occurs at different speeds:

superimposed time series

The problem with doing statistics for this data is therefore that there is no common reference point to assign data points from different tracks to each other.

In the following I excluded track #6 because it has a different starting point than the others.


One approach to define such a common reference is to compute the curve length along the track:

l = [0 ; cumsum(sqrt(diff(x) .^ 2 + diff(y) .^ 2))];

Plotting over this length instead of over time makes the x and y coordinates more similar and therefore comparable:

coordinates over curve length

From this result, one can compute the mean and other statistics. To actually do it, the data have to be reparametrized by interpolation:

li = 0 :0.001: l(end);
xi = interp1(l, x, li);
yi = interp1(l, y, li);

Now, for each of the tracks we have a common reference and can store the transformed data in common data matrices:

n = numel(li);
xp(1 : n, i) = xi;
yp(1 : n, i) = yi;

where i is the track index. The matrices xp and yp have to be initialized to NaNs because the tracks have different lengths. Then, statistics can be computed, e.g. the mean:

xm = nanmean(xp, 2);
ym = nanmean(yp, 2);

The resulting mean track together with the original tracks:

added mean track


To further improve the agreement between the tracks, one could smooth them in order to reduce random variations (span = 5000 seems to work well):

xs = smooth(xi, span);
ys = smooth(yi, span);

After that, the curve length parameter has to be recomputed:

l = [0 ; cumsum(sqrt(diff(xs) .^ 2 + diff(ys) .^ 2))];

The result:

smoothed coordinates over curve length

The agreement underlying the common reference has clearly been improved. The data have again to be reparametrized

li = 0 :0.001: l(end);
xsi = interp1(l, xs, li);
ysi = interp1(l, ys, li);

the result stored in the common data matrices

n = numel(li);
xp(1 : n, i) = xsi;
yp(1 : n, i) = ysi;

and the mean computed

xm = nanmean(xp, 2);
ym = nanmean(yp, 2);

The resulting mean track together with the original tracks:

added smoothed mean track

The result looks much smoother. However, the smoothing has reduced the circumference of the large loop at the end of two of the tracks, and as a result the "mean track" does no longer lie between the original tracks. The trade-off between smoothness and closeness to the original tracks can be regulated via the value of span.


The full code to generate the last figure is included here; in order to reproduce the unsmoothed version, set span = 1;.

d = load('summerRunningMean.csv');

% detect tracks as continuous segments
s = sqrt(sum(diff(d)' .^ 2));
ind = [0, find(s > 100 * mean(s)), size(d, 1)];
% remove 6th track because itis an outlier
ind = ind(1 : end - 1);
N = size(ind, 2) - 1;

% smoothing parameter
span = 5000;

xp = nan(13000, N);   % I know, hardcoded, not nice.
yp = nan(13000, N);
for i = 1 : N
    % extract data
    x = d(ind(i) + 1 : ind(i + 1), 1);
    y = d(ind(i) + 1 : ind(i + 1), 2);

    % determine length along curve
    % to use as curve parameter
    l = [0 ; cumsum(sqrt(diff(x) .^ 2 + diff(y) .^ 2))];

    % reparametrize by interpolation
    li = 0 :0.001: l(end);
    xi = interp1(l, x, li);
    yi = interp1(l, y, li);

    % smooth to remove small deviations
    xs = smooth(xi, span);
    ys = smooth(yi, span);

    % determine length along smoothed curve
    % as improved curve parameter
    l = [0 ; cumsum(sqrt(diff(xs) .^ 2 + diff(ys) .^ 2))];

    % again, reparametrize by interpolation
    li = 0 :0.001: l(end);
    xsi = interp1(l, xs, li);
    ysi = interp1(l, ys, li);

    % store
    n = numel(li);
    xp(1 : n, i) = xsi;
    yp(1 : n, i) = ysi;

    plot(x, y)
    axis equal
    hold all
end

% compute mean
xm = nanmean(xp, 2);
ym = nanmean(yp, 2);

% plot mean
plot(xm, ym, 'k', 'LineWidth', 2)
xlabel x
ylabel y

OTHER TIPS

You can use scatter(X,Y) to plot your data and cftool(X,Y) to interactively play with different models. Note that cftool plots your data with a superimposed regression line.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top