python - Will this data mining approach work? Is it a good idea?

https://datascience.stackexchange.com/questions/6301

python
data

16-10-2019
|

Pergunta

I need to extract fields like the document number, date, and invoice amount from a bunch of .csv files, which I believe are referred to as "unstructured text." I have some labeled input files and will use the NLTK and Python to design a data extraction algorithm.

For the first round of classification, I plan to use tf-idf weighting with a classifier to identify the document type - there are multiple files that use the same format.

At this point, I need I way to extract the field from the document, given that it is X type of document. I thought about using features like the "most common numbers" or "largest number with a comma" to find the invoice amount, for example, but since the invoice amount can any numerical value I believe the sample size would be smaller than the number of possible features? (I have no training here, bear with me.)

Is there a better way to do the second part? I think the first part should be okay, but I'm not sure that second part will work or if I even really understand the problem. How is my approach in general? I'm new to this kind of thing and this was the best I could come up with.

Solução

I am not sure if using a classifier is the best way to approach this problem. If it is something which can be easily extracted using regex, then that is the best way to do it. If however, you want to use classifiers, here are two questions you need to ask yourself.

One, what does the unlabelled data look like and can you design good features from it? Depending on the kind of feature vector you design, the complexity of the classification task may range from very easy, to impossible. (A perceptron cannot solve XOR usually, except when you provide it with specific linear combinations of the input variable).

Two, what does the labelled data look like? Is it representative of the entire dataset or does it only contain very specific types of format? If it is the former, then your classifier will not work well on files which are not represented in the labelled data.

If you just want to test run a classifier first, you can solve the problem of having more features than training samples by using Regularization. Regularization forces the training algorithm of the classifier to accept the simplest possible solution (think occam's razor).

Almost all Machine Learning related packages in Python will have regularization options you can use, so enjoy.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange