Domanda

I am trying to parse a collection of data that has two (or one) useful pieces, but may be organized in many different ways:

V01C01
Vol 1 Chapter 1
Chapter 1 Volume 1 - Alt title
V1.1
etc.

I don't want to use a massive collection of regexs, because there is no way to predict all of the combinations of how things will be organized (also some will have extraneous text). I feel like there is a branch of machine learning that may be perfect for this, but I'm not experienced in it enough to know.

È stato utile?

Soluzione

Well that is an interesting problem for sure and there are a couple of things you could try.

Making the assumption that you don't have labels on your data, then the first thing I would try to do, is to check the connections between each instance using a clustering algorithm like k-means (http://en.wikipedia.org/wiki/K-means_clustering), keep in mind that this wouldn't solve your problem but would help you to explore your data and hopefully find a set of features to train a supervised learning classifier.

In the case that you do have labels on your data, or you could manually tag your set. Then you are in front a more manageable problem. At first glance, it would look a lot like a text or document classification problem (like classify emails as Spam/NoSpam), in which case a naive bayes classifier could be a good first attempt to attack the problem since is a easy algorithm to implement and can provide reasonable good results.

About Naives Bayes Classifier (https://www.bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html)

I made some assumptions here and I might be wrong based on that. Maybe if you clarify some points (like if you are able to manually tag the data) we would be able to help you further.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top