Remove subwords from BERT output

https://datascience.stackexchange.com/questions/69749

09-12-2020
|

Question

I'm trying to build a multilingual WSD system with BERT on top as the embedding layer. In order to have better performances, after BERT finishes its job (and performs Transfer Learning), I need to remove the subwords from its output. Is there a way to do so?
I've tried to detach the model from the network's architecture, doing something like this... but I need to do this as a custom layer and I'm not 100% sure that this is even right

class Bert:
    def __init__(self):
        input_word_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="input_word_ids")
        input_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="input_mask")
        segment_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="segment_ids")
        print("dopwnloading BERT...")
        bert = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/1", trainable=False, name="BERT")
        print("BERT downloaded")
        pooled_output, sequence_output = bert([input_word_ids, input_mask, segment_ids])
        self.model = tf.keras.models.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])
        self.model.summary()


    def predict(self, input_word_ids, input_mask, segment_ids, positional_ids,needed_padding, train_mode:bool = False):
        print("Starting BERT prediction...")
        pool_embs, all_embs = self.model.predict(
            {'input_word_ids': input_word_ids, 'input_mask': input_mask, 'segment_ids': segment_ids},
            verbose=1,
            batch_size=64
        )
        del pool_embs
        to_return = []
        print("Conversion\nSoftware version 2.0...")
        for i in tqdm(range(len(positional_ids))):
            indexes_to_extrapolate = np.concatenate((positional_ids[i],needed_padding[i]))
            indexes_to_extrapolate = indexes_to_extrapolate[:63] if len(indexes_to_extrapolate) > 64 else indexes_to_extrapolate
            new_version = tf.gather(all_embs[i], tf.constant(indexes_to_extrapolate))
            if train_mode and new_version.shape[0] < 64:
                #Means that, originally, there has to be a padding!
                #And, if there is, it can surely be found in the first position of the needed_padding!
                how_much_iteration = 64 - new_version.shape[0]
                if how_much_iteration > 0:
                    for iteratore in range(how_much_iteration):
                        tmp_padding_for_iteration = needed_padding[i][0]
                        new_version = tf.concat([new_version, tf.constant(all_embs[i][tmp_padding_for_iteration], shape=(1,768))], 0)
            with open("registro_shape.txt","a") as registro:
                registro.write("Shape --> " +str(new_version.shape)+"\n")
            if new_version.shape[0] > 64:
                print("wth")
            to_return.append(new_version)
        return tf.stack(to_return)

EDIT: I'll try to contextualize the case with more information regarding the architecture of the network. In particular, this is the architecture of the network that I'm trying to build for the WSD task. Note that the network should perform a multitask learning task:

Bert
BiLSTM
Attention Layer
3 outputs layer

self.tokenizatore = FullTokenizer(bert_path,do_lower_case=False)

input_word_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32,name="input_word_ids")

input_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32,name="input_mask")

segment_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32,name="segment_ids")

print("dopwnloading BERT...")
bert = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/1", trainable=False)
print("BERT downloaded")
pooled_output, sequence_output = bert([input_word_ids, input_mask, segment_ids])
LSTM = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(
        units=hidden_size,
        dropout=dropout,
        recurrent_dropout=recurrent_dropout,
        return_sequences=True,
        return_state=True
    )
)(sequence_output)
LSTM = self.produce_attention_layer(LSTM)
LSTM = tf.keras.layers.Dropout(0.5)(LSTM)

babelnet_output = tf.keras.layers.Dense(outputs_size[0], activation="softmax", name="babelnet")(LSTM)
domain_output = tf.keras.layers.Dense(outputs_size[1], activation="softmax", name="domain")(LSTM)
lexicon_output = tf.keras.layers.Dense(outputs_size[2], activation="softmax", name="lexicon")(LSTM)



def produce_attention_layer(self, LSTM):
    """
    Produces an Attention Layer like the one mentioned in the Raganato et al. Neural Sequence Learning Models for Word Sense Disambiguation,
    chapter 3.2
    :param lstm: The LSTM that will be used in the task
    :return: The LSTM that was previously given in input with the enhancement of the Attention Layer
    """
    hidden_states = tf.keras.layers.Concatenate()([LSTM[1],LSTM[3]])
    ripetitore = tf.keras.layers.RepeatVector(tf.keras.backend.shape(LSTM[0])[1])(hidden_states)
    u = tf.keras.layers.Dense(1, activation="tanh")(ripetitore)
    attivazione = tf.keras.layers.Activation('softmax')(u)  # We are using a custom softmax(axis = 1) loaded in this notebook
    dotor = tf.keras.layers.Multiply()([LSTM[0],attivazione])

    return dotor

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange