I'm doing some clustering research and need to generate synthetic data that would look something like these examples:

Dataset examples

We have 2d plots with 2 classes (red and black). How could I generate 2D data like this? It has a V structure, so I was thinking about generating points around straight lines - is there a way to do that in R? I'm using R, but am open to other tools (just data has to be exportable).

有帮助吗?

解决方案

Here's a thought.

n <- c(200,200)                 # Number of points in each class
cls <- rep(1:2, n)              # Class memberships
i <- c(.2-.12*abs(rnorm(n[1])), # Noiseless x position
       -.2+.12*abs(rnorm(n[2])))
noise <- .04*(.2-abs(i))        # Noise level relative to `i`

# Final sample
x <- cbind(i, abs(.5*i)) + noise*matrix(rnorm(sum(n)*2), sum(n), 2)

plot(x[,1], x[,2], col=cls)

enter image description here

其他提示

Is there any reason to generate this very particular type of data? Any results drawn from this will likely not generalize to other datasets.

Anyway, the obvious way to generate this kind of data is to use a nonlinear projection, e.g. using the famous "abs" function (absolute value).

i.e. project x to (in python syntax, I don't like R): math.abs(x) or if you want some extra randomness: math.abs(x + random.random(.1)) + random.random(.1)

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top