Is there a process flow to follow for text analytics?

https://datascience.stackexchange.com/questions/16786

16-10-2019
|

Question

I am trying to draw a process flow (like a template) to be followed while on text analysis projects. So far, I've come up with this.

Text Analytics Steps

1. Data Collection a. Acquire data b. Convert data into plain text 2. Remove Duplicate Entries 3. Text Parsing and Extracting Features a. Tokenization b. Parsing i. Remove HTML characters ii. Decode complex symbols to UTF-8 iii. Spell check iv. Apostrophe look-up v. Remove punctuation marks vi. Remove expressions / emojis vii. Split attached words viii. Slangs look-up ix. Remove URLs c. Lemmatization / Stemming (Normalization of Tokens) d. Parts-of-Speech Tagging 4. Text Filtering a. Remove start-words b. Remove stop-words c. Remove irrelevant words based on frequency 5. Text Transformation a. Bag of Words Representation b. TF-IDF 6. Text Mining / Analysis (whichever analysis needed) a. Text Categorization b. Text Classification (supervised) c. Topic Modeling (unsupervised) d. Text Clustering e. Similarity Analysis f. Sentiment Analysis
Is this flow in the right order of steps?
What are the steps/sub-steps that I am missing?
Does the process flow look like a template or go-to flow chart when undertaking any text analytics project?

Edit: Updated process flow

Solution

This is a great place to start! While not catalogued in a "process flow", Daniel Jurafsky's book, "Speech and Language Processing" talks through the various calculations and steps related to analyzing text that you will find useful.

The reason I say that a process flow is not provided is because Jurafsky - in great detail - explains the pros and cons of particular methods applied throughout a pipeline, and how this could change results. As an example, when calculating perplexity (an inverse metric that quantifies how well a language model can predict the next word in a statement), you should capture beginnings, ends, and stop words of statements - as opposed to other methods that require the removal of stop words.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange