Text mining on large database (data mining)

https://stackoverflow.com/questions/2633598

26-09-2019
|

Question

I have a large database of resumes (CV), and a certain table skills grouping all users skills.

inside that table there's a field skill_text that describes the skill in full text.

I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..

Here are some examples skills extracted from the DB :

Sectoral and competitive analysis
Business Development (incl. in international settings)
Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
Creative work (Photoshop, In-Design, Illustrator)
checking and reporting back on campaign progress
organising and attending events and exhibitions
Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.

The output shoud be something like :

Sectoral and competitive analysis
Business Development
Specific structure and road design software -
Macao
AutoCAD
Photoshop
In-Design
Illustrator
organising events
Development
Aptana Studio
PHP
HTML
CSS
JavaScript
SQL
AJAX
Mix marketing
Viral Marketing
Social network marketing
emailing
SEO
One to one marketing

As you see only skills remains no other representation text.

I know this is possible using text mining technics but how to do it ? the database is realy large.. it's a good thing because we can calculate text frequency and decide if it's a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ?

Edit : please don't tell me to use standard things like a text tokinzer, or regex .. because users input skills in a very arbitrary way !!

thanks

Solution

If I was doing this programmatically I would:

Extract all punctuation delimited data (or perhaps just brackets and commas) into a new table (with no primary key, just skill) so Creative work (Photoshop, In-Design, Illustrator) becomes

 Skill            
 -------------
 Creative work    
 Photoshop        
 In-Design        
 Illustrator

Then, after you've proceed all CVs, query for the most common skills (this is MySQL)

SELECT skill, COUNT(1) cnt FROM newTable GROUP BY skill ORDER BY cnt DESC;

Which may look like this contrived example

 Skill            Cnt
 ---------------------
 Photoshop        3293
 Illustrator      2134
 Creative work     932
 In-Design         123

Then you decide, from the top X skills, which you want to capture, which must map to other skills (Indesign and In-design should map to the same skill, for example) and which to discard, then script the process using a data map.

Use the data map to write a new word frequency table (this time skill_id, skill, frequency) and the second time when parsing the data also write to a lookup table (cv_id,skill_id). Your data will then be in a state where each CV is mapped to a number of skills, and each skill to a number of CVs. You can query for the most popular skills, CVs matching certain criteria etc.

OTHER TIPS

Many databases will do this for you via their full-text search functionality. I know that PostgreSQL's full-text search would be able to do this easily with the aid of a custom dictionary.

Alternatively, you can use PHP's strtok or equivalent to index your text. Once indexed you can compare to dictionary, or simply use occurrences to create a sheet for yourself. Word clouds are made in a similar fashion.

Doing this well requires knowledge; otherwise what's to tell "organising events" is a 'skill' while "creative work" isn't? But a stupid program can take a first cut at it by analyzing statistics of collocations: see the answers to How to extract common / significant phrases from a series of text entries and Algorithms to detect phrases and keywords from text.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow