문제

I am trying to classify (cluster) our companies Curriculum Vitae (CVs). There is about 100 CV's in total. The idea is to find similar people based on their CV content. I have already transformed the word docs into text files and read all of the candidates into a python dictionary with the format:

cvdict = { 'name1' : "cv text", 'name2', : 'cv text', ... }

I have also remove most punctuation, lowercased it, removed numbers etc., and removed words with a length less than x (4)

My questions:

  1. Is clustering the correct approach? If not, which Machine Learning algorithm would be a suitable initial focus for this task.

  2. Any pointers as to some python code i can use to transverse this dictionary and 'cluster' the content. Based on the clustering of the content it should output the ‘keys’=candidate names as clustered groups.

도움이 되었습니까?

해결책

So from what I understood you want to see potential groups/clusters in the set of CVs. the idea of cvdict is great, but you also need to convert all texts to numbers ! you are half way through. so think about matrix/excel sheet/table. where you have the profile of each employee in each line. name1,cv_text1 name2,cv_text2 name3,cv_text3 ...

Yes, as you can guess, the length of cv_text can vary. Some people have a lengthy resume some other not ! which words can categorize the company employee. Some how we need to make them all equal size; Also, not all words are informative, you need to think which words can capture your idea; In Machine Learning they call it "Feature" vector or matrix. So my suggestion would be drive a set of words and mark if the person has mentioned that word in his skill.

    managment   marketing   customers   statistics  programming
name1   1           1           0            0           0
name2   0           0           0                1               1
name3   0           0           1                1               0

or instead of a 0/1 matrix you can put how many times that word was mentioned in the resume. again you can just extract all possible words from all resumes. NLTK is an awesome module for doing text analysis and it has some built-in function for you to polish you text. have a look at the first half of this slide.

Then you can use any kind of clustering method, for example hierarchical https://code.activestate.com/recipes/578834-hierarchical-clustering-heatmap-python/ there are already packages for doing such analysis; either in scipy or scikit and I am sure for each you can find a tons of examples. The key step is the one you are already working on; representing your data as a matrix.

다른 팁

Couple more hints to earlier comment:

  1. I would not throw away words less than 4 characters long. Instead I would use a stop list of common words. You don't want to throw away things like C++ or C#

  2. One good technique of building a matrix above is to use TF-IDF metric. What it is is essentially a measure of how frequently some word occurs in a particular document vs. how frequently it occurs in the entire collection. So things like 'the' are very common so they will be downgraded very quickly. If only 5 people in your company now C++ this will boost up the metric for this word a lot.

  3. You might want to consider to use a stemmer like a 'porter algorithm'. This algorithm will combine words like 'statistics' and 'statistical'.

  4. Most machine learning algorithms have a problem with very wide matrices. Unfortunately, your resume base is only 100 documents which is considered quite low vs how many potential terms you will have. The reason these techniques work for google and NSA is because human languages tend to have tens of thousands words in active use vs billions of documents they have to index. For your task I would try to shrink you dataset to no more than 30-40 columns. Be very aggressive on throwing away the common words.

  5. Unfortunately the biggest weakness of most of the clustering techniques is that you have to set a number of clusters in advance. A common approach that people use is to set up some type of measure of how good your clusters are and start running the clustering algorithm first with very few clusters and keep increasing until your metrics starts to drop off. Look up Andrew Ng machine learning course on the interwebs. He explains these technique very well.

  6. Of course hierarchical clustering is not affected by the point 5.

  7. Instead of clustering you can try building decision tree. Although not super accurate, decision trees have a great advantage to visualize the built model. By looking at the three you can easily see the reason where built the way they are.

  8. Besides scipy and scikit, which are very good. Take a look at Orange Toolbox. It has a lot of good algorithms with good visualization tools. They way you program it is just by connecting boxes with arrows. After you got satisfied with your model you can easily dump it out to the run as a script.

    Hope this helps.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top