Question

This is probably a newbie question on possible classification algorithm, so please bear with me. I have a dataset that comprises both nominal and numeric attribute which may look the example below (not actual dataset). What kind of algorithm would be best to predicate the class and get the accuracy (preferably in Python/Java)?

Classes: classA, classB, classC

attribute1: Recurrence <Yes, No>
attribute2: Subject <Math, Science, Geography>
attribute3: ProbabilityA <0.0 - 1.0>
atrribute4: ProbabilityB <0.0 - 1.0>
attribute5: ProbabilityC <0.0 - 1.0>

The nominal data can contain numeric value of [1,-1] where 1 represent present and -1 not present, or it can be a set of string values such as ['YES', 'NO'] or ['Type1', 'Type2', 'Type3']. The numeric value is used to express the likelihood of an attribute. For example [0-1], The closer the value to 1, the more likely it evaluate to true.

Était-ce utile?

La solution

Well, this is by no means a "newbie question", and is in fact quite complicated. While Inti's suggestion is certainly a good start, it really depends upon so many factors that there is no easy "right answer".

Some things to consider:

  • Speed vs. accuracy
  • Memory constraints
  • Training set (how large of a data set you can use to "learn" how to classify)
  • Test data set (how much of the data set you'll keep "in reserve" to verify / measure the quality of your algo)
  • Implementation: e.g., will this be running in a "batch mode", or will you need to make a classification in an ongoing fashion for each new observation you wish to categorize.
  • etc.

Until some more info like this is known, it's tough to give very precise details. (In general, on this forum, the more effort you put into the question, the more effort others put into their answers.)

That being said, here are some buzz words to start looking up, to get your head around the possibilities:

  • random forest / CART / decision tree (different algos, but similar in concept)
  • Naive Bayes
  • SVM (likely not helpful with the nominal parameters you have)
  • Neural Net
  • Clustering
  • KNN, as Inti suggests
  • many more...

The world of potential options in machine learning algos is pretty huge, nothing works perfectly, and nothing works equally well in all situations. This wiki page is not so great, but it's a decent start on finding algos.

Once you've decided whatever algo you think will work for your case, then look up a library / implementation in Python or Java or what-have-you. With SciPy and NumPy, you can assume that Python has a pretty large library of possibilities. I suspect Java also has a huge library, but I personally know Python far better.

Autres conseils

KNN (K nearest neighbour). You can see the tool Weka (but it is in java). In fact the algorithm is quite simple and the results are good. The only problem is that Knn is a lazy classifier; hence, training stage is fast (almost empty) and classification stage is more slow. Now, that is only important if your training set is really big because the algorithm is O(N*M) where N is number of training instance and M is number of attributes. In the worst case you can perform some filtering in your data. Weka has some methods to do that.

PS. In Weka the algorithm has a different name, IBK

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top