Question

I am evaluating an algorithm, and would like to use artificial data.

The algorithm works fine, for one dimensional artificial datasets, as seen in this StackOverflow answer.

I would like to test the algorithm for datasets with more than one dimension and certain characteristics (e.g. noise, correlation). Did someone already implement an ‘artificial dataset generator’ in R?

Any feedback would be very much appreciated. Thanks!

Était-ce utile?

La solution

The mlbench package in R is a collection of functions for generating data of varying dimensionality and structure for benchmarking purposes. It includes both regression and classification data sets.

Of course, these data sets are all fairly artificial and so they may not really reflect "real life" performance, since they may not mirror the sort of structure that your algorithm is intended for. But it's a place to start, at least.

Autres conseils

You could use wakefield package to generate random data sets.

It allows easy creation of data frames, time series, adjusting correlations, and even visualizing generated data, e.g.:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")
pacman::p_load(dplyr, tidyr, ggplot2)

set.seed(10)

r_data_frame(n=100,
    id,
    dob,
    animal,
    grade, grade,
    death,
    dummy,
    grade_letter,
    gender,
    paragraph,
    sentence
) %>%
   r_na() %>%
   plot(palette = "Set1")

enter image description here

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top