How to parse a loosely structured document

https://stackoverflow.com/questions/11811060

24-06-2021
|

Question

I am analyzing data feeds which have data somewhat like this

RAM 4 GB DDR3
RAM 16GB DIMM
memory 4GB DDR3 MHz         // no value for MHz 
memory 4GB DDR3 1333 MHz    // no the order of MHz is not fixed
ram 6GB, 1333 MHz, DDR3     // comma used as delimiter

Processor Intel Core i7-3612QM
Processor Intel Core i7 2630QM
processor i3-380,2.53 GHz          //380 used for model number instead of 380M and model number separated by '-' and clock speed separated by ','
Processor Core i3-380 2.53 GHz 
Processor Intel Ci3 - 2330 (2nd Gen), 2.53 GHz   // multiple symbols used as delimiters(',','-')

Hard drive 500GB 5400RPM
Hard Disk Drive 1.5 TB
Hard Disk 256 GB

now i need to analyze what each specification means like in ram 6GB, 1333 MHz, DDR3 i need to figure out that 6GB is the capacity, 1333 MHz is the frequency and DDR3 is the type of ram. But the problem as you can see is these are very irregular(some entries have some fields and dont have others and sometimes whitespaces are used as separators ,sometimes ,s and sometimes -s). My first reaction was using regex but i soon realised that it was stupid. Then i thought that i can split on the separator(, in the above case) but even the separator is not fixed. Also this approach would be useless for entries like this memory 4 GB 1333 MHz DDR3 Using whitespace as separator for this entry would make it look like 4 GB 1333 MHz are different but actually 4 GB and 1333 MHz are different. Also how can i programatically decide that Intel Core i3, Core i3, i3-380 and Ci3 imply Intel Core i3? I understand that i have to tell the library once that Intel Core i3, Core i3 and Ci3 mean the same thing. But later when analysis the text it should be able to figure out. The above mentioned lists of entries show how variable can the entries be. Is there some python library(or in any other language) that can help me in dealing with these tasks?

Solution

If you're able to build a set of classes that directly correspond to each type of entry, then that's probably the way to go. For example, a class for RAM might be:

class Memory:
    def __init__(self, s):
        if not 'RAM' in s and not 'memory' in s:
            raise ValueError("Not a string that describes RAM.")

        self.capacity = int(re.match(r'(\d+) ?GB', s)[1])

Then just try each class until one fits.

OTHER TIPS

First off, are you sure there's no other systematic way to get the device info? Most system utilities provide a standardized way to export information.

If you absolutely must parse this structure, you will have to use regex or Regular Expressions for dealing with such kind of loosely structured documents.

Although this document in general doesn't have a uniform structure, each line in this document does have it's own standardized structure.

Logic:

1) Parse the file one line at a time. 2) Read the first token and use the approach for parsing the remainder of that line.

For eg: If you encounter the token "RAM", you know that it's always followed by a numeric size, unit and then the type.

Happy Coding!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow