Generating Dinosaur names with Tensorflow RNN

https://datascience.stackexchange.com/questions/74891

11-12-2020
|

Pergunta

I try to adapt "Text generation with an RNN" tutorial to generate new dinosaur names from a list of the existing ones. For training RNN tutorial text is divided into example character sequences of equal length:

# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# The unique characters in the file
vocab = sorted(set(text))
idx2char = np.array(vocab)
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)
# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
# Convert to sequences of the same length
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
# Sequences as text
for item in sequences.take(2):
    print("----")
    print(repr(''.join(idx2char[item.numpy()])))

Output:

----
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
----
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'

My problem differs from tutorial in that I have a list of names of different length instead of monolith of text:

aachenosaurus
aardonyx
abdallahsaurus
abelisaurus
abrictosaurus
abrosaurus
abydosaurus
acanthopholis

In my case character sequences are names. As long as I can't train RNN on sequences with different length (please, correct me if I am wrong here) I need to pad all my names with spaces to a size of a longest name, which is 26.

My longest name is lisboasaurusliubangosaurus, so, for example, aardonyx shoud be padded as:

"lisboasaurusliubangosaurus"
"aardonyx                  "

I tried to pad my sequences with:

# Convert individual characters to sequences of the desired size.
sequences = char_dataset.padded_batch(seq_length+1, padded_shapes=seq_length, drop_remainder=True)

Which results in error:

ValueError: The padded shape (26,) is not compatible with the corresponding input component shape ().

Questions:

Is it possible to train Tensorflow RNN with sequences of variable length?
How to pad short sequences?

Thanks!

Solução

Check out this answer:

https://stackoverflow.com/a/60230236/12642230

Alternative Solution:

Tensorflow provides a method pad_sequences() to do that:

https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

The default value of padding is 'pre', you might wanna change that to 'post' to do what you want, along with providing the maximum length, which is 26 in your case. You also would need to add a special padding character to you dictionary of characters to indices, and use its index to provide padding value for the method.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange