سؤال

If different classes of an application need to extract one or more random numbers, where should a random number generator be initialized in order to produce good random sequences?

In particular, I need to build some decision trees in order to train a random forest. The construction of each decision tree involves the following steps:

  1. The dataset (organized on multiple rows of data) is loaded.
  2. Some rows in this dataset are randomly selected in order to build a new dataset. This new dataset will be gradually splitted during the growth of the tree.
  3. This new dataset is used in order to grow a decision tree: the creation of each node needs the random selection of a few rows of this new dataset (before creating one node, you have to randomly generate some small different subsets of this new dataset).

The three steps listed above are performed for the construction of each decision tree. The procedure just described provides that the random number generation occurs several times. For example the second step should ensure that each decision tree is trained with a dataset slightly different from the initial one, so the random number generator should avoid the generation of equal datasets (or in any case the likelihood of this occurring should be very low).

In essence, in this procedure we can identify two sources of randomness:

  • the generation of N random dataset, each to train a single decision tree;
  • before you create a node, you must perform M random extractions from a given dataset.

How many random number generators should I use? Since I have a class that implements the random forest, and another class that implements the decision tree, I thought I'd initialize a random number generator in the first class (the first source of randomness), and another random number generator in the second class (the second source of randomness). Is this correct?

In general, what are the guidelines for choosing the correct number of pseudo-random number generators?

هل كانت مفيدة؟

المحلول 2

Depends on how repeatable you need the sequence to be. e.g. if you can't guarantee the order that the rand() calls are made in, and need to generate the same sequence each time for testing, then you'd need a separate seed/generator for each of these queues.

If you don't care for repeatability, then just have one generator, one seed, and let it run.

نصائح أخرى

Keep in mind that no matter what computer language you use, numbers generated are always going to be pseudo-random. This means that given the same seed used to spawn the generation you will always get the same result. All the included random number generators in programming languages are already heavily developed and tested to be as optimal as possible. One pass with a random function should be enough.

Use only one random generator but make sure it's well seeded. You can place it in the beginning of your main() and either generate sequences of random numbers for use later or make calls to the generator as you go.

Make sure to NOT seed it every time you make a call to it as that's prone to producing same numbers if you are seeding using time to within a second, for example. Seeding your generator only once is a best practice.

In fact, if you are on a Unix-like system, consider making use of /dev/random for your generator. Don't code your own as pretty much any system you use is guaranteed to provide native or have libraries for producing randomness.

In general, consider using generators that make use of external sources (noise from computer hardware) instead of calculating their own.

Use only one random number generator for all functions.

Using two or more random number generators can cause a problem. Most random number generators use the system time as a starting seed. If you instantiate two random number classes close together in time, they might produce the same sequence of random numbers. Using a single random generator this cannot possibly happen.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top