Question

background: so, I am working on an NLP problem. where I need to extract different types of features based on different types of context from text documents. and I currently have a setup where there is a FeatureExtractor base class, which is subclassed multiple times depending on the different types of context and all of them calculate a different set of features and return a pandas data frame as output, independently.

all these subclasses are further called by one wrapper type class called FeatureExtractionRunner which calls all the subclasses and calculates the features on the document and returns the output for all types of contexts.

Problem: this pattern of calculating features leads to lots of subclasses, the way it is setup right now. currently, I have like 14 subclasses, since I have 14 different contexts.it might expand further. and this is too many classes to maintain. Is there an alternative way of doing this? with less subclassing.

here is some sample representative code of what i explained:

from abc import ABCMeta, abstractmethod

class FeatureExtractor(metaclass=ABCMeta):
    #base feature extractor class
    def __init__(self, document):
        self.document = document
        
        
    @abstractmethod
    def doc_to_features(self):
        return NotImplemented
    
    
class ExtractorTypeA(FeatureExtractor):
    #do some feature calculations.....
    
    def _calculate_shape_features(self):
        return None
    
    def _calculate_size_features(self):
        return None
    
    def doc_to_features(self):
        #calls all the fancy feature calculation methods like 
        f1 = self._calculate_shape_features(self.document)
        f2 = self._calculate_size_features(self.document)
        #do some calculations on the document and return a pandas dataframe by merging them  (merge f1, f2....etc)
        data = "dataframe-1"
        return data
    
    
class ExtractorTypeB(FeatureExtractor):
    #do some feature calculations.....
    
    def _calculate_some_fancy_features(self):
        return None
    
    def _calculate_some_more_fancy_features(self):
        return None
    
    def doc_to_features(self):
        #calls all the fancy feature calculation methods
        f1 = self._calculate_some_fancy_features(self.document)
        f2 = self._calculate_some_more_fancy_features(self.document)
        #do some calculations on the document and return a pandas dataframe (merge f1, f2 etc)
        data = "dataframe-2"
        return data
    
class ExtractorTypeC(FeatureExtractor):
    #do some feature calculations.....
    
    def doc_to_features(self):
        #do some calculations on the document and return a pandas dataframe
        data = "dataframe-3"
        return data

class FeatureExtractionRunner:
    #a class to call all types of feature extractors 
    def __init__(self, document, *args, **kwargs):
        self.document = document
        self.type_a = ExtractorTypeA(self.document)
        self.type_b = ExtractorTypeB(self.document)
        self.type_c = ExtractorTypeC(self.document)
        #more of these extractors would be there
        
    def call_all_type_of_extractors(self):
        type_a_features = self.type_a.doc_to_features()
        type_b_features = self.type_b.doc_to_features()
        type_c_features = self.type_c.doc_to_features()
        #more such extractors would be there....
        
        return [type_a_features, type_b_features, type_c_features]
        
        
all_type_of_features = FeatureExtractionRunner("some document").call_all_type_of_extractors()

so, the base class, by default calculates some features, each expressed as a method. maybe around 10 different methods. later, when subclassed, each of the sub-class have all the default features plus some additional special features they calculate, they range from 2/3 methods to 6 methods, max. and these special methods/features are specific to each context and hence other subclasses would not know it/need it/wont be shared.

Was it helpful?

Solution

Your class structure seems reasonable. You already extracted common functionality into the base class. Maybe you can find additional commonalities between some of the extractors and put that into a new intermediate level of the class hierarchy. Or put it into a free function. Nothing forces you to use a strict OOP approach.

But in the end, if you have a use case with complex individual behaviour your implementation will reflect that complexity. There’s no way around it.

What you can do easily is avoid the code duplication in the runner. A simple solution is a list of Extractor types that acts as a registry:

class ExtractorTypeA …
class ExtractorTypeB …
#...

# Doesn’t have to be global.
# Could also be an attribute of the runner.
extractor_classes = [
    ExtractorTypeA,
    ExtractorTypeB,
    # ...
]

class FeatureExtractionRunner:
    def __init__(self, document, *args, **kwargs):
        self.document = document
        # instead of duplicated code a list comprehension using the registry
        self.extractors = [
            Extractor(self.document) for Extractor in extractor_classes
        ]

    def call_all_type_of_extractors(self):
        return [extr.doc_to_features() for extr in self.extractors]

Now the only bit of duplication left is the registry itself. If you add an extractor class you must add it to the registry explicitly. That might not be a problem because it’s easy to document, and forgetting it is easily caught by unit tests.

You could get fancy with a custom metaclass that takes care of registering automatically. Untested off the top of my head:

extractor_classes = []

class MetaFeatureExtractor(ABCMeta):
    def __init__(cls, name, bases, attrs):
        super().__init__(name, bases, attrs)

        # registry must only contain subclasses
        if name != 'FeatureExtractor':
            extractor_classes.append(cls)

class FeatureExtractor(metaclass=MetaFeatureExtractor):
    # ...

class ExtractorTypeA(FeatureExtractor) …
class ExtractorTypeB(FeatureExtractor) …
#...
Licensed under: CC-BY-SA with attribution
scroll top