Pregunta

I'm building a decision tree system in Scala, but some of the entries in my data have identical attributes. I've gotten around this by implementing a "random" node type, allowing the query to randomly select which branch to traverse, but I'm getting a "MatchError" when trying to split the remaining examples at random. My current code:

def splitRandom(examples: Array[String]): Array[String]={
        examples.collect {case x if(r.nextInt(100) < 50) => x}
}

"examples" is an array of strings, with each string being a line containing a single data entry with all of its attributes.

¿Fue útil?

Solución

collect isn't a good choice for random behavior because the same condition can be evaluated twice (first on an isDefinedAt, and then a second time to compute the value); if it says true the first time and false the second--on the same input--match will be upset. Use filter instead:

examples.filter(_ => r.nextInt(100) < 50)

Otros consejos

there is a solution fits your issue:

import util.Random
val shuffled = Random.shuffle(your_array)
val (first, second) = shuffled.splitAt(your_position)

I found this trick when I wanted a rdd.randomSplit's counterpart for Scala List or Array

You can do some type transformation if needed

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top