Does building a corpus make sense on a documentation project?

https://datascience.stackexchange.com/questions/67942

08-12-2020
|

Question

I have zero to experience in data science or machine learning. Because of this I am not able to determine if building a corpus does apply to the problem I am trying to solve.

I am trying to build a reference site for cloud technologies such as AWS Google Cloud.

I was able to build structured data and identify primary entities with in a single ecosystem using standard web scraping and sql.queries.

But I wanted to have the ability to have a mechanism that can autonomously identify entities and related information that is relevant to that entity and other entities it has relationships with.

Given that a specific ecosystem documentation follows a certain style can I use few entities as training docs and then have it classify the information I mentioned above.

Is the starting point to this is to build a corpus? I tried it out nltk categorized corpus builder.

Is it fine to include a specific document in multiple categories? For example an instance in AWS can be in category ec2 and a general category Computing unit

Anyways is this problem I am trying to solve fit into the general NLP ML space?

Solution

Anyways is this problem I am trying to solve fit into the general NLP ML space?

Generally speaking, feeding the source data in bulk to a ML system is unlikely to give the kind of structured output you expect. It's likely that you would have to somehow guide the process in the direction of what you want to obtain, and this might take a lot of time and effort (depending on the requirements).

That being said, there are indeed NLP methods which are meant to extract specific pieces of information from text, and it usually works quite well with domain-specific data (provided it's done correctly, caveats apply). I'm just going to list a few typical tasks which can be done:

Named Entity Recognition would be the most common and probably the most simple, since there are many existing libraries. Most libraries use a pre-trained model, but it's likely to give much better results when it's trained on the kind of data it's applied to (of course that usually means manually annotating your own training set).
Text classification can be used to automatically assign documents to a category (class) among a set of predefined categories. This is supervised so you would also need a training set containing labelled documents. Again there is a number of algorithms and libraries available.
Simple information retrieval methods based on measuring semantic similarity (see e.g here) between terms and documents can be used to search e.g. documents relevant to a term.
Topic modeling is an unsupervised approach which groups similar documents together (clustering). Since it's unsupervised it doesn't require any training data, but on the other hand what the algorithm finds as "topics" (groups) is often different than what a human would expect.
Extracting relations between concepts (typically between named entities) is a more advanced task which usually requires more work in order to capture the specifics of the job. I'm not aware of any general library for that.

Overall there are many things possible, but the first step would be to try to design the system precisely, typically using some of the existing tasks as building blocks.

Is it fine to include a specific document in multiple categories? For example an instance in AWS can be in category ec2 and a general category Computing unit

Yes it's fine, but if you want a ML classifier to be able to predict several classes you will need to use multi-label classification (the "standard" is single label).

Is the starting point to this is to build a corpus? I tried it out nltk categorized corpus builder.

I'd recommend building a corpus only once you have a clear idea of how you're going to use it. Also it's usually an experimental process with lots of back and forth, so try to progress iteratively rather than starting with strong assumptions/decisions which might later turn out not relevant.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange