Pergunta

I've tried several different methods, some of which I found on here which include making a Node class and nested dictionaries, but I can't seem to get them to work.

My code currently takes in several lines of DNA (a,t,g,c) and stores then as a numpy array. It then finds the attribute that gives the most gain and splits the data into 4 new numpy arrays (dependent upon an a, t, g, or c being present at the attribute).

I'm unable to make a recursive function which can build the tree. I'm quite new to python and programming itself, so please describe with detail what I should do.

Thanks for any help

Foi útil?

Solução 3

If you are looking to use a decision tree with python you can use the decision tree module from Sci-kit learn rather than write your own decision tree class and logic: http://scikit-learn.org/stable/modules/tree.html. Using the Scikit Learn decision tree module you can save the decision tree objects to memory or perhaps write certain attributes of the tree to a file or database.

Sci-kit learn, as well as the other python libraries that are a part of the Anacondas package are pretty much the standard in data exploration and analysis in python. You can get the Anaconda package from Continuum here: http://continuum.io/downloads

EDIT 1

I came across this on Hacker News. It's about building a decision tree in Python using PostgreSQL as the database you pull values from. Might be interesting to checkout: http://www.garysieling.com/blog/building-decision-tree-python-postgres-data

Outras dicas

If you want to implement a decision tree from scratch I recommend you to build your tree using classes. A tree is composed of nodes, where one node contains nodes recursively and leafs are terminal nodes. For the case of a binary tree, these classes can be something like:

class Node(object):
    def __init__(self):
        self.split_variable = None
        self.left_child = None
        self.right_child = None

    def get_name(self):
        return 'Node'

class Leaf(object):
    def __init__(self):
        self.value = None

    def get_name(self):
        return 'Leaf'

For the Node class: 'split_variable' will contain the variable name used in the split ie: [a,t,g,c] and 'left_child' and 'right_child' will be new instances of Node or Leaf. The True/False presence of that variable will be mapped into the left/right children. (In case of a regression tree you'll need to add a fourth variable to the Node class 'split_value' and map less/more than this value into the left/right children).

For the Leaf class: 'value' contains the assigned value of the tree class variable (ie majority in case of a discrete variable or mean in the case of a continuous one).

To complete your implementation you'll need functions to walk your tree evaluating and/or visualising it. These functions will be recursively called to complete walking through the tree. Here is where you can make use of the get_name() functions of the classes, to differentiate nodes from leafs. To implement this part it really depends on how you store your data, I suggest you to use pandas DataFrames which are like tables. A sample evaluate function could be (pseudocode):

def evaluate_tree(your_data, node):
    if your_data[node.split_variable]:
        if node.left_child.get_name() == 'Node':
            evaluate_tree(your_data, node.left_child)
        elif node.left_child.get_name() == 'Leaf':
            return node.left_child.value
    else:
        if node.right_child.get_name() == 'Node':
            evaluate_tree(your_data, node.right_child)
        elif node.right_child.get_name() == 'Leaf':
            return node.right_child.value

Good luck!

probably dict is what you want:

an example of node is:

{'sex': {'yes': 'send email', 'no': 'not send email'}}
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top