Question

I'm trying to classify a large set of documents (~100M) as valid or invalid, based upon a small given set of labeled valid documents (~3k). I'd like to know if the PU learning approach described in this paper combined with the character-level string encoding described in this paper is likely to give good results or not, and why.

Some preliminary info about the dataset (ignore if you want to skip ahead to the model):

  • Each document has <= 50 words.
  • Each word has <= 50 characters.
  • Each character has <= 50 words.
  • Any randomly selected document pair is likely to have very few common words.
  • There are a significant number of proper nouns in the dataset that have to be captured.
  • Original set for training the network contains ~10k unlabeled documents and ~3k labeled ones. The assumption is that the 10k unlabeled documents selected will be mostly different from the 3k labeled ones.
  • Train-test-validation split is 80%-10%-10%.

The first goal is to write a classifier that distinguishes labeled and unlabeled documents. For completeness' sake, the code for this classifier (written using Keras) is as follows:

def cnn_model(input_batch_shape):
  input_layer = layers.Input(batch_shape=input_batch_shape)
  fixed_length_letter = layers.Dense(16, activation="relu")(input_layer)
  reshaped_fixed_length_letter = layers.Reshape((keras.backend.int_shape(fixed_length_letter)[1], keras.backend.int_shape(fixed_length_letter)[2], keras.backend.int_shape(fixed_length_letter)[3], 1))(fixed_length_letter)
  conv_word_1 = layers.Conv3D(16, (1, int(keras.backend.int_shape(reshaped_fixed_length_letter)[2]/5), keras.backend.int_shape(reshaped_fixed_length_letter)[3]), activation="relu")(reshaped_fixed_length_letter)
  conv_word_1 = layers.MaxPooling3D((1,5,1))(conv_word_1)
  conv_word_2 = layers.Conv3D(16, (1, int(keras.backend.int_shape(conv_word_1)[2]/4), keras.backend.int_shape(conv_word_1)[3]), activation="relu")(conv_word_1)
  conv_word_2 = layers.MaxPooling3D((1,4,1))(conv_word_2)
  conv_sentence_1 = layers.Conv3D(8, (int(keras.backend.int_shape(conv_word_2)[1]/5), 1, 1))(conv_word_2)
  conv_sentence_1 = layers.MaxPooling3D((5, 1, 1))(conv_sentence_1)
  conv_sentence_2 = layers.Conv3D(8, (int(keras.backend.int_shape(conv_sentence_1)[1]/4), 1, 1))(conv_sentence_1)
  conv_sentence_2 = layers.MaxPooling3D((4, 1, 1))(conv_sentence_2)
  flattened_sentence = layers.Flatten()(conv_sentence_2)
  output_layer = layers.Dense(1, activation="softmax")(flattened_sentence)
  model = models.Model(inputs=input_layer, outputs=output_layer)
  return model

model_check_labeled = cnn_model((16, 50, 50, 39))
model_check_labeled.compile(optimizers.Adam(), losses.binary_crossentropy, metrics=['accuracy'])
model_check_labeled.summary()

As the code shows, I've dropped the part of Dos Santos and Gatti de Bayser's proposed architecture where they encode the words into vectors as well, since I don't think my vocabulary of words is large enough to generate a good encoding with the number of valid samples I have.

This led to the following accuracy and loss curves: Accuracy and Loss curves

As the graph shows, my classifier gave me ~25% accuracy - meaning that there have been no meaningful features captured at all. I would like to know if this is expected, given the architecture of my classifier and the dataset I have described. I would also like to know if there are other approaches I could use that would help at least classify new documents as labeled or unlabeled.

Please tell me if the description of the dataset sounds too vague. I'll try and present some examples that have similar features.

Papers cited:

  • Elkan, Charles & Noto, Keith. (2008). Learning classifiers from only positive and unlabeled data. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 213-220. 10.1145/1401890.1401920.
  • Dos Santos, Cicero & Gatti de Bayser, Maira. (2014). Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts.
Was it helpful?

Solution

Well, I went over the code a couple more times, and I figured it out. There is a glaring error in the following line:

output_layer = layers.Dense(1, activation="softmax")(flattened_sentence)

Softmax needs a one-hot vector for output. I'd used softmax because I wanted the probabilities for the two to be separate, but forgotten at some point when I was writing the generator for my input. Converting it to sigmoid leads to the classifier getting to about 90% accuracy on the test set.

The important lesson there is that softmax requires both classes to be present in the output for binary classification, while sigmoid requires only one class, which varies from 1 to 0.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top