Creating a curve to fit x-y data where X is categorical

https://stackoverflow.com/questions/16074591

04-04-2022
|

Question

I've got a dataset of diving behavior from tagged animals, and I'm struggling to fit a curve to the data, I think mainly because the X variable in this case is categorical, and not continuous data. Let me give a bit of background:

My dataset has 184 observations of 14 variables:

      tagID    ddmmyy Hour.GMT.Hour.Local.  X0   X3   X10   X20   X50    X100   X150  X200  X300  X400
1     122097   250912     0            9   0.0  0.0   0.3  12.0   15.3   59.6   12.8  0.0    0    0
2     122097   260912     0            9   0.0  2.4   6.9  5.5    13.7   66.5   5.0   0.0    0    0
3     122097   260912     6            15  0.0  1.9   3.6  4.1    12.7   39.3   34.6  3.8    0    0
4     122097   260912     12           21  0.0  0.2   5.5  8.0    18.1   61.4   6.7   0.0    0    0
5     122097   280912     6            15  2.4  9.3   6.0  3.4    7.6    21.1   50.3  0.0    0    0
6     122097   290912     18           3   0.0  0.2   1.6  6.4    41.4   50.4   0.0   0.0    0    0

The variables I'm interested in are X0:X400. These are depth bins, and the values represent the percent of the total time for that period of the day that the animal spent in that depth bin. So on the first line, it spent 0% of its time between 0-3meters, 59.6% of its time between 100-150 meters, etc. With a bit of help from some answers to my last question here on stackoverflow, I calculated the mean % time spent in each depth bin by this animal:

diving.means <- colMeans(diving[, -(1:4)])
lowerIntervalBound <- gsub("X", "", names(diving)[-(1:4)])
lowInts <- as.numeric(lowerIntervalBound)
plot(x=factor(lowInts), y=diving.means, xlab="Depth Bin (Meters—Lower Bound)", ylab="% Time Spent")

which provided me with this plot:

enter image description here

Unfortunately because my data are means (a single value), and not frequencies, I couldn't figure out how to plot them as a histogram... That's neither here nor there, as I can easily just input these as values and make the desired plot if necessary.. but this does the trick analytically for now.

Now I've got multiple animals and different time bins that I'd like to compare. I'll eventually work out a system to weight the time spent in bins to get an average depth to compare statistically, but for now I just want to compare them visually, qualitatively, as well as produce plots that I can use in presentations and eventually publications. What I'd like to do is create a density curve representing my 'histogram,' and then plot those curves from multiple scenarios on a single plot to compare. However, I can't seem to make this work with the density() function, as I don't have frequency data. I sort of have densities calculated already, as % time spent in each bin.. but they're not represented in raw format in my dataset as frequencies of categories, which I can then make histograms and density curves out of.

This is how my data look:

> diving.means
          X0           X3          X10          X20          X50         X100         X300         X400           X150         X200 
 3.330978261  3.299456522  8.857608696 17.646195652 30.261413043 29.356521739  6.445108696  0.664130435    0.135869565  0.001630435

or:

> df<-data.frame(lowInts, diving.means)
> df
 lowInts diving.means
X0         0  3.330978261
X3         3  3.299456522
X10       10  8.857608696
X20       20 17.646195652
X50       50 30.261413043
X100     100 29.356521739
X150     150  6.445108696
X200     200  0.664130435
X300     300  0.135869565
X400     400  0.001630435

And what I would like to produce is something that looks more or less like this (pulled it randomly from a publication—axes are unrelated to my data):

enter image description here

and then be able to isolate the curves and plot them together.

Thanks for any help you can provide!

Solution

You already have frequencies, so hist cannot be used. You can use plot with spline interpolation for density:

df <- read.table(text=" lowInts diving.means
X0         0  3.330978261
X3         3  3.299456522
X10       10  8.857608696
X20       20 17.646195652
X50       50 30.261413043
X100     100 29.356521739
X150     150  6.445108696
X200     200  0.664130435
X300     300  0.135869565
X400     400  0.001630435")

require(splines)
dens <-predict(interpSpline(df[,1], df[,2]))
plot(df[,1], df[,2], type="s", ylim=c(0,40))
lines(dens, col="red",lwd=2)

enter image description here

OTHER TIPS

I think a step function is what you want.

You could use stepfun to create this function.

I would work in long format, and then you could create step functions for the median or mean

# assuming your data is called `diving`
library(data.table)
DTlong <- reshape(data.table(diving), varying = list(5:14), direction = 'long', 
  times = c(0,3,10,20,50,100,150,200,300,400), 
  v.names = 'time.spent', timevar = 'hours')




DTsummary <- DTlong[,c(mean.d = mean(time.spent), 
          setattr(as.list(fivenum(time.spent)), 'names',c('min','lhinge','median','uhinge','max'))),
       by=list(hours, midhours, upperhours)]

Base R step fun

f.median <- DTsummary[, stepfun(hours, c(0,median))]
f.uhinge <- DTsummary[, stepfun(hours, c(0,uhinge))]
f.lhinge <- DTsummary[, stepfun(hours, c(0,lhinge))]


plot(f.median, main = 'median time spent', xlim = c(0,500), do.points = FALSE)

enter image description here

using ggplot2

ggplot(DTsummary, aes(x = hours)) + geom_step(aes(y = median))

enter image description here

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow