Serialization, classification in pyBrain, machine learning, prediction

https://stackoverflow.com/questions/20460619

30-08-2022
|

Question

I have such example of my training Data(i have 1000 films for training), I need to predict a 'budget' of each film:

film_1 = {
    'title': 'The Hobbit: An Unexpected Journey',
    'article_size': 25000,
    'producer': ['Peter Jackson', 'Fran Walsh', 'Zane Weiner'],
    'release_date': some_date(2013, 11, 28),
    'running_time': 169,
    'country': ['New Zealand', 'UK', 'USA'],
    'budget': dec('200000000')
}

The keys such as 'title', 'producer', 'country' can be viewed as features in machine learning, while values such as 'The Hobbit: An Unexpected Journey', 25000, etc.,can be viewed as values used for learning process. However, in training, the input is mostly accepted as real numbers rather than strings format. Do I need to convert such fields like 'title', 'producer', 'country' (fields which are strings) to int( such thing like classification or serialization should take place?) or some other manipulations to make me able to use these data as training set for my network?

Solution

I was wondering whether this is what you need:

film_list=['title','article_size','producer','release_date','running_time','country','budget']
flist = [(i,j) for i, j in enumerate(film_list)]
label = [ seq[0] for seq in flist ]
name = [ seq[1] for seq in flist ]
print label 
print name

>>[0, 1, 2, 3, 4, 5, 6]
['title', 'article_size', 'producer', 'release_date', 'running_time', 'country', 'budget']

Or you can use your dictionary directly,

labels = film_1.keys()
print labels

# But the keys are sorted, labels[0] will give you 'producer' instead of 'title':
>>['producer', 'title', 'country', 'release_date', 'budget', 'article_size', 'running_time']

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow