Question

I want to do text categorization on a dataset of news. I have a lot of features like subject, keyword, summary, etc... all of these features are stored in one cell array of structs, each struct looking like this:

       label: 'misc.forsale'
        subj: ' Motorcycle wanted.'
     keyword: [1x190 char]
   reference: []
organization: ' Worcester Polytechnic Institute'
        from: ' kedz@bigwpi.WPI.EDU (John Kedziora)'
     summary: []
       lines: ' 11'
       vocab: [4x2 double]

I want to classify them with class = classify(test, train, target, 'diaglinear');
but these functions only receive arrays as input, and do not accept cells or structs.

I can't convert this cell array to one multidimensional array because the amount of features varies (for example, one subject has two words and other has three words).

What can I do?

Was it helpful?

Solution

Do some feature extraction first. For example, tokenize the strings, then use TF-IDF.

You can include the key with the tokens. This is a common practise in information retrieval. See the Xapian manual for an example.

Usually, you will do some stemming, e.g. Examples -> exampl. Now, just add a prefix to make the words distinct depending on their occurrence. E.g. Sexampl when the subject contained example and Kexampl when it was a keyword.

Then you have a "bag of words" representation that is used everywhere. They even do this for mining images, it's called "visual words" then. These aren't english-language words either.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top