Architecture for linear regression with variable input where each input is n-sized one-hot encoded

https://datascience.stackexchange.com/questions/43517

01-11-2019
|

Question

I am relatively new to deep learning (got some experience with CNNs in PyTorch), and I am not sure how to tackle the following idea. I want to parse a sentence, e.g. I like trees., one-hot encoded the parse output of each word, and feed that into a ML system. The output of each sentence is a floating-point number. As an example, the sentence I like trees. could be pre-processed and encoded as fixed-size feature vectors per token:

[[0 1 0 0 1] [1 0 0 0 0] [1 0 1 1 0] [0 0 0 0 1]]

or flattened

[0 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1]

However, the length of sentences can differ of course. From what I know there are some fixes to this. Using padding (when shorter than your defined cut-off) or cutting off the length to that cut-off (when longer). Another solution I often see mentioned is the use of an RNN/LSTM but I have no experience with them (only the basic theoretical notion).

The expected output (label) of this sentence could be something like 50.2378.

My question, then, is which architecture is best suited for this task? RNNs are popular in NLP so I am leaning towards them, but I am not sure whether a regression task fits well (or how it fits) with the architecture of an RNN. I have looked for RNNs and regression but I can only find use cases involving time series, not NLP or one-hot encoded features.

In essence, what I have is a dataset of sentences that are preprocessed to get some features per token. Let's assume for brevity sake that these features are syntactic, e.g. the word's POS tag and its dependency information (e.g. subj, obj, and so on). These features would then get one-hot encoded (I assume) to get an easy-to-use dataset. The input, thus, are sentences encoded in such a way that the information of tokens is shown, such as the flattened example above.

The output for every sentence is some floating-point number, representing some value that has been calculated beforehand in theory these will indicate the sentence's equivalence with its translation, but that is not at all important for the system. The important part is that every input is a sentence mapped to a number.

Some dummy data. The actual sentence, it's vector representation, and the output.

I like trees.                   [0 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1]                           50.2378
What is that thing?             [1 1 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 1]               20.1237
Who are you?                    [0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 1]                           1.6589
The cookies smell good today.   [0 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1]     18.6295
I do!                           [0 1 0 0 1 1 0 1 1 0 0 0 1 0 1]                                     24.5489

The goal, then, is that an unseen given sentence can be pre-processed and one-hot encoded and given as input, and that an output (floating-point number) is predicted based on the trained model/function.

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange