What algorithms should I use to perform job classification based on resume data? The Next CEO of Stack Overflow2019 Community Moderator ElectionResume Parsing - extracting skills from resume using Machine LearningClassification of skills based on job adsMulti-label text classification with minimum confidence thresholdA Text Sections ClassifierWhat techniques should I use to compare the similarity between a bunch of texts?Giving Emails as Input to Machine Learning AlgorithmsWhich algorithm should be used for an accurate job recommendation systemCreating labels for Text classification using kerasAdvice on what Machine Learning Algorithms to study for a Job to candidate matching algorithmDocument parsing modeling and approach?

Car headlights in a world without electricity

Mathematica command that allows it to read my intentions

Arrows in tikz Markov chain diagram overlap

Read/write a pipe-delimited file line by line with some simple text manipulation

Shortening a title without changing its meaning

Traveling with my 5 year old daughter (as the father) without the mother from Germany to Mexico

Why did the Drakh emissary look so blurred in S04:E11 "Lines of Communication"?

How can I replace x-axis labels with pre-determined symbols?

Why doesn't Shulchan Aruch include the laws of destroying fruit trees?

What happens if you break a law in another country outside of that country?

How to find if SQL server backup is encrypted with TDE without restoring the backup

Avoiding the "not like other girls" trope?

Could you use a laser beam as a modulated carrier wave for radio signal?

Direct Implications Between USA and UK in Event of No-Deal Brexit

Is a distribution that is normal, but highly skewed, considered Gaussian?

Is the offspring between a demon and a celestial possible? If so what is it called and is it in a book somewhere?

Are British MPs missing the point, with these 'Indicative Votes'?

Free fall ellipse or parabola?

Find the majority element, which appears more than half the time

Ising model simulation

Which acid/base does a strong base/acid react when added to a buffer solution?

Another proof that dividing by 0 does not exist -- is it right?

My boss doesn't want me to have a side project

What steps are necessary to read a Modern SSD in Medieval Europe?



What algorithms should I use to perform job classification based on resume data?



The Next CEO of Stack Overflow
2019 Community Moderator ElectionResume Parsing - extracting skills from resume using Machine LearningClassification of skills based on job adsMulti-label text classification with minimum confidence thresholdA Text Sections ClassifierWhat techniques should I use to compare the similarity between a bunch of texts?Giving Emails as Input to Machine Learning AlgorithmsWhich algorithm should be used for an accurate job recommendation systemCreating labels for Text classification using kerasAdvice on what Machine Learning Algorithms to study for a Job to candidate matching algorithmDocument parsing modeling and approach?










26












$begingroup$


Note that I am doing everything in R.



The problem goes as follow:



Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .



Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.



My original idea: make this a supervised learning problem.
Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.



Update
Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:



I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...



This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.



Is this approach wrong ? Please correct me if you think my approach is wrong.



Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.



Any ideas would be great.










share|improve this question











$endgroup$
















    26












    $begingroup$


    Note that I am doing everything in R.



    The problem goes as follow:



    Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .



    Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.



    My original idea: make this a supervised learning problem.
    Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.



    Update
    Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:



    I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...



    This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.



    Is this approach wrong ? Please correct me if you think my approach is wrong.



    Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.



    Any ideas would be great.










    share|improve this question











    $endgroup$














      26












      26








      26


      23



      $begingroup$


      Note that I am doing everything in R.



      The problem goes as follow:



      Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .



      Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.



      My original idea: make this a supervised learning problem.
      Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.



      Update
      Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:



      I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...



      This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.



      Is this approach wrong ? Please correct me if you think my approach is wrong.



      Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.



      Any ideas would be great.










      share|improve this question











      $endgroup$




      Note that I am doing everything in R.



      The problem goes as follow:



      Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .



      Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.



      My original idea: make this a supervised learning problem.
      Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.



      Update
      Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:



      I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...



      This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.



      Is this approach wrong ? Please correct me if you think my approach is wrong.



      Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.



      Any ideas would be great.







      machine-learning classification nlp text-mining






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jul 9 '14 at 0:19









      Stephane Rolland

      1134




      1134










      asked Jul 3 '14 at 16:11









      user1769197user1769197

      241135




      241135




















          4 Answers
          4






          active

          oldest

          votes


















          14












          $begingroup$

          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.






          share|improve this answer











          $endgroup$












          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52


















          10












          $begingroup$

          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)






          share|improve this answer











          $endgroup$












          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41


















          7












          $begingroup$

          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37


















          7












          $begingroup$

          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03











          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f662%2fwhat-algorithms-should-i-use-to-perform-job-classification-based-on-resume-data%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          4 Answers
          4






          active

          oldest

          votes








          4 Answers
          4






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          14












          $begingroup$

          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.






          share|improve this answer











          $endgroup$












          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52















          14












          $begingroup$

          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.






          share|improve this answer











          $endgroup$












          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52













          14












          14








          14





          $begingroup$

          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.






          share|improve this answer











          $endgroup$



          Check out this link.



          Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.



          You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.



          Example:



          1) Load libraries and build the example data



          library(tm)
          library(SnowballC)

          doc1 = "I am highly skilled in Java Programming. I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
          job1 = "Software Engineer"
          doc2 = "Tested new software releases for major program enhancements. Designed and executed test procedures and worked with relational databases. I helped organize and lead meetings and work independently and in a group setting."
          job2 = "Quality Assurance"
          doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers. Perform database design and debugging of current releases."
          job3 = "Software Engineer"
          jobInfo = data.frame("text" = c(doc1,doc2,doc3),
          "job" = c(job1,job2,job3))


          2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.



          # Convert to lowercase
          jobInfo$text = sapply(jobInfo$text,tolower)

          # Remove Punctuation
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

          # Remove extra white space
          jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

          # Remove stop words
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
          )

          # Stem words (Also try without stemming?)
          jobInfo$text = sapply(jobInfo$text, function(x)
          paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
          )


          3) Make a corpus source and document term matrix.



          # Create Corpus Source
          jobCorpus = Corpus(VectorSource(jobInfo$text))

          # Create Document Term Matrix
          jobDTM = DocumentTermMatrix(jobCorpus)

          # Create Term Frequency Matrix
          jobFreq = as.matrix(jobDTM)


          Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.



          Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.



          Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 27 mins ago









          Stephen Rauch

          1,52551330




          1,52551330










          answered Jul 3 '14 at 17:06









          nfmcclurenfmcclure

          463310




          463310











          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52
















          • $begingroup$
            I would love to see your example.
            $endgroup$
            – user1769197
            Jul 3 '14 at 22:03











          • $begingroup$
            Updated with quick example.
            $endgroup$
            – nfmcclure
            Jul 3 '14 at 22:52















          $begingroup$
          I would love to see your example.
          $endgroup$
          – user1769197
          Jul 3 '14 at 22:03





          $begingroup$
          I would love to see your example.
          $endgroup$
          – user1769197
          Jul 3 '14 at 22:03













          $begingroup$
          Updated with quick example.
          $endgroup$
          – nfmcclure
          Jul 3 '14 at 22:52




          $begingroup$
          Updated with quick example.
          $endgroup$
          – nfmcclure
          Jul 3 '14 at 22:52











          10












          $begingroup$

          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)






          share|improve this answer











          $endgroup$












          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41















          10












          $begingroup$

          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)






          share|improve this answer











          $endgroup$












          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41













          10












          10








          10





          $begingroup$

          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)






          share|improve this answer











          $endgroup$



          Just extract keywords and train a classifier on them. That's all, really.



          Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.



          Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.



          So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).



          (Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 23 '17 at 12:38









          Community

          1




          1










          answered Jul 4 '14 at 22:46









          ffriendffriend

          2,4911016




          2,4911016











          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41
















          • $begingroup$
            Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
            $endgroup$
            – user1769197
            Jul 5 '14 at 14:46











          • $begingroup$
            LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
            $endgroup$
            – ffriend
            Jul 5 '14 at 20:33










          • $begingroup$
            @ffriend, How do we get that keyword list ?
            $endgroup$
            – NG_21
            Jan 28 '16 at 9:53






          • 1




            $begingroup$
            @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:27






          • 3




            $begingroup$
            @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
            $endgroup$
            – ffriend
            Oct 4 '16 at 18:41















          $begingroup$
          Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
          $endgroup$
          – user1769197
          Jul 5 '14 at 14:46





          $begingroup$
          Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ?
          $endgroup$
          – user1769197
          Jul 5 '14 at 14:46













          $begingroup$
          LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
          $endgroup$
          – ffriend
          Jul 5 '14 at 20:33




          $begingroup$
          LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one.
          $endgroup$
          – ffriend
          Jul 5 '14 at 20:33












          $begingroup$
          @ffriend, How do we get that keyword list ?
          $endgroup$
          – NG_21
          Jan 28 '16 at 9:53




          $begingroup$
          @ffriend, How do we get that keyword list ?
          $endgroup$
          – NG_21
          Jan 28 '16 at 9:53




          1




          1




          $begingroup$
          @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
          $endgroup$
          – Khalid Usman
          Oct 4 '16 at 13:27




          $begingroup$
          @ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks
          $endgroup$
          – Khalid Usman
          Oct 4 '16 at 13:27




          3




          3




          $begingroup$
          @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
          $endgroup$
          – ffriend
          Oct 4 '16 at 18:41




          $begingroup$
          @KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.
          $endgroup$
          – ffriend
          Oct 4 '16 at 18:41











          7












          $begingroup$

          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37















          7












          $begingroup$

          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37













          7












          7








          7





          $begingroup$

          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.






          share|improve this answer









          $endgroup$



          This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".



          The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.



          After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.



          Doc1: project: (java, 3) (c, 4), education: (computer, 2), (physics, 1)



          Doc2: project: (java, 3) (python, 2), education: (maths, 3), (computer, 2)



          In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.



          Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jul 3 '14 at 20:47









          DebasisDebasis

          1,331810




          1,331810











          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37
















          • $begingroup$
            Algorithm-wise: what would you recommend ?
            $endgroup$
            – user1769197
            Jul 3 '14 at 21:59










          • $begingroup$
            you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
            $endgroup$
            – Debasis
            Jul 4 '14 at 11:35










          • $begingroup$
            I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 13:32










          • $begingroup$
            these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
            $endgroup$
            – Debasis
            Jul 4 '14 at 15:23










          • $begingroup$
            I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
            $endgroup$
            – user1769197
            Jul 4 '14 at 16:37















          $begingroup$
          Algorithm-wise: what would you recommend ?
          $endgroup$
          – user1769197
          Jul 3 '14 at 21:59




          $begingroup$
          Algorithm-wise: what would you recommend ?
          $endgroup$
          – user1769197
          Jul 3 '14 at 21:59












          $begingroup$
          you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
          $endgroup$
          – Debasis
          Jul 4 '14 at 11:35




          $begingroup$
          you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model...
          $endgroup$
          – Debasis
          Jul 4 '14 at 11:35












          $begingroup$
          I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
          $endgroup$
          – user1769197
          Jul 4 '14 at 13:32




          $begingroup$
          I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ?
          $endgroup$
          – user1769197
          Jul 4 '14 at 13:32












          $begingroup$
          these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
          $endgroup$
          – Debasis
          Jul 4 '14 at 15:23




          $begingroup$
          these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case).
          $endgroup$
          – Debasis
          Jul 4 '14 at 15:23












          $begingroup$
          I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
          $endgroup$
          – user1769197
          Jul 4 '14 at 16:37




          $begingroup$
          I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ?
          $endgroup$
          – user1769197
          Jul 4 '14 at 16:37











          7












          $begingroup$

          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03















          7












          $begingroup$

          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03













          7












          7








          7





          $begingroup$

          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.






          share|improve this answer









          $endgroup$



          I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.



          in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser chunker, and also some custom built phrases matched by regex's.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jul 7 '14 at 18:36









          SimonSimon

          68168




          68168











          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03
















          • $begingroup$
            Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
            $endgroup$
            – user1769197
            Jul 8 '14 at 15:40










          • $begingroup$
            You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
            $endgroup$
            – Simon
            May 18 '15 at 19:54










          • $begingroup$
            @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
            $endgroup$
            – Khalid Usman
            Oct 4 '16 at 13:17











          • $begingroup$
            @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
            $endgroup$
            – Simon
            Oct 16 '16 at 19:27










          • $begingroup$
            @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
            $endgroup$
            – Khalid Usman
            Oct 17 '16 at 11:03















          $begingroup$
          Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
          $endgroup$
          – user1769197
          Jul 8 '14 at 15:40




          $begingroup$
          Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ?
          $endgroup$
          – user1769197
          Jul 8 '14 at 15:40












          $begingroup$
          You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
          $endgroup$
          – Simon
          May 18 '15 at 19:54




          $begingroup$
          You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better.
          $endgroup$
          – Simon
          May 18 '15 at 19:54












          $begingroup$
          @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
          $endgroup$
          – Khalid Usman
          Oct 4 '16 at 13:17





          $begingroup$
          @Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks
          $endgroup$
          – Khalid Usman
          Oct 4 '16 at 13:17













          $begingroup$
          @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
          $endgroup$
          – Simon
          Oct 16 '16 at 19:27




          $begingroup$
          @KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity.
          $endgroup$
          – Simon
          Oct 16 '16 at 19:27












          $begingroup$
          @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
          $endgroup$
          – Khalid Usman
          Oct 17 '16 at 11:03




          $begingroup$
          @Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks
          $endgroup$
          – Khalid Usman
          Oct 17 '16 at 11:03

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f662%2fwhat-algorithms-should-i-use-to-perform-job-classification-based-on-resume-data%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Францішак Багушэвіч Змест Сям'я | Біяграфія | Творчасць | Мова Багушэвіча | Ацэнкі дзейнасці | Цікавыя факты | Спадчына | Выбраная бібліяграфія | Ушанаванне памяці | У філатэліі | Зноскі | Літаратура | Спасылкі | НавігацыяЛяхоўскі У. Рупіўся дзеля Бога і людзей: Жыццёвы шлях Лявона Вітан-Дубейкаўскага // Вольскі і Памідораў з песняй пра немца Адвакат, паэт, народны заступнік Ашмянскі веснікВ Минске появится площадь Богушевича и улица Сырокомли, Белорусская деловая газета, 19 июля 2001 г.Айцец беларускай нацыянальнай ідэі паўстаў у бронзе Сяргей Аляксандравіч Адашкевіч (1918, Мінск). 80-я гады. Бюст «Францішак Багушэвіч».Яўген Мікалаевіч Ціхановіч. «Партрэт Францішка Багушэвіча»Мікола Мікалаевіч Купава. «Партрэт зачынальніка новай беларускай літаратуры Францішка Багушэвіча»Уладзімір Іванавіч Мелехаў. На помніку «Змагарам за родную мову» Барэльеф «Францішак Багушэвіч»Памяць пра Багушэвіча на Віленшчыне Страчаная сталіца. Беларускія шыльды на вуліцах Вільні«Krynica». Ideologia i przywódcy białoruskiego katolicyzmuФранцішак БагушэвічТворы на knihi.comТворы Францішка Багушэвіча на bellib.byСодаль Уладзімір. Францішак Багушэвіч на Лідчыне;Луцкевіч Антон. Жыцьцё і творчасьць Фр. Багушэвіча ў успамінах ягоных сучасьнікаў // Запісы Беларускага Навуковага таварыства. Вільня, 1938. Сшытак 1. С. 16-34.Большая российская1188761710000 0000 5537 633Xn9209310021619551927869394п

          Беларусь Змест Назва Гісторыя Геаграфія Сімволіка Дзяржаўны лад Палітычныя партыі Міжнароднае становішча і знешняя палітыка Адміністрацыйны падзел Насельніцтва Эканоміка Культура і грамадства Сацыяльная сфера Узброеныя сілы Заўвагі Літаратура Спасылкі НавігацыяHGЯOiТоп-2011 г. (па версіі ej.by)Топ-2013 г. (па версіі ej.by)Топ-2016 г. (па версіі ej.by)Топ-2017 г. (па версіі ej.by)Нацыянальны статыстычны камітэт Рэспублікі БеларусьШчыльнасць насельніцтва па краінахhttp://naviny.by/rubrics/society/2011/09/16/ic_articles_116_175144/А. Калечыц, У. Ксяндзоў. Спробы засялення краю неандэртальскім чалавекам.І ў Менску былі мамантыА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіГ. Штыхаў. Балты і славяне ў VI—VIII стст.М. Клімаў. Полацкае княства ў IX—XI стст.Г. Штыхаў, В. Ляўко. Палітычная гісторыя Полацкай зямліГ. Штыхаў. Дзяржаўны лад у землях-княствахГ. Штыхаў. Дзяржаўны лад у землях-княствахБеларускія землі ў складзе Вялікага Княства ЛітоўскагаЛюблінская унія 1569 г."The Early Stages of Independence"Zapomniane prawdy25 гадоў таму было аб'яўлена, што Язэп Пілсудскі — беларус (фота)Наша вадаДакументы ЧАЭС: Забруджванне тэрыторыі Беларусі « ЧАЭС Зона адчужэнняСведения о политических партиях, зарегистрированных в Республике Беларусь // Министерство юстиции Республики БеларусьСтатыстычны бюлетэнь „Полаўзроставая структура насельніцтва Рэспублікі Беларусь на 1 студзеня 2012 года і сярэднегадовая колькасць насельніцтва за 2011 год“Индекс человеческого развития Беларуси — не было бы нижеБеларусь занимает первое место в СНГ по индексу развития с учетом гендерного факцёраНацыянальны статыстычны камітэт Рэспублікі БеларусьКанстытуцыя РБ. Артыкул 17Трансфармацыйныя задачы БеларусіВыйсце з крызісу — далейшае рэфармаванне Беларускі рубель — сусветны лідар па дэвальвацыяхПра змену коштаў у кастрычніку 2011 г.Бядней за беларусаў у СНД толькі таджыкіСярэдні заробак у верасні дасягнуў 2,26 мільёна рублёўЭканомікаГаласуем за ТОП-100 беларускай прозыСучасныя беларускія мастакіАрхитектура Беларуси BELARUS.BYА. Каханоўскі. Культура Беларусі ўсярэдзіне XVII—XVIII ст.Анталогія беларускай народнай песні, гуказапісы спеваўБеларускія Музычныя IнструментыБеларускі рок, які мы страцілі. Топ-10 гуртоў«Мясцовы час» — нязгаслая легенда беларускай рок-музыкіСЯРГЕЙ БУДКІН. МЫ НЯ ЗНАЕМ СВАЁЙ МУЗЫКІМ. А. Каладзінскі. НАРОДНЫ ТЭАТРМагнацкія культурныя цэнтрыПублічная дыскусія «Беларуская новая пьеса: без беларускай мовы ці беларуская?»Беларускія драматургі па-ранейшаму лепш ставяцца за мяжой, чым на радзіме«Працэс незалежнага кіно пайшоў, і дзяржаву турбуе яго непадкантрольнасць»Беларускія філосафы ў пошуках прасторыВсе идём в библиотекуАрхіваванаАб Нацыянальнай праграме даследавання і выкарыстання касмічнай прасторы ў мірных мэтах на 2008—2012 гадыУ космас — разам.У суседнім з Барысаўскім раёне пабудуюць Камандна-вымяральны пунктСвяты і абрады беларусаў«Мірныя бульбашы з малой краіны» — 5 непраўдзівых стэрэатыпаў пра БеларусьМ. Раманюк. Беларускае народнае адзеннеУ Беларусі скарачаецца колькасць злачынстваўЛукашэнка незадаволены мінскімі ўладамі Крадзяжы складаюць у Мінску каля 70% злачынстваў Узровень злачыннасці ў Мінскай вобласці — адзін з самых высокіх у краіне Генпракуратура аналізуе стан са злачыннасцю ў Беларусі па каэфіцыенце злачыннасці У Беларусі стабілізавалася крымінагеннае становішча, лічыць генпракурорЗамежнікі сталі здзяйсняць у Беларусі больш злачынстваўМУС Беларусі турбуе рост рэцыдыўнай злачыннасціЯ з ЖЭСа. Дазволіце вас абкрасці! Рэйтынг усіх службаў і падраздзяленняў ГУУС Мінгарвыканкама вырасАб КДБ РБГісторыя Аператыўна-аналітычнага цэнтра РБГісторыя ДКФРТаможняagentura.ruБеларусьBelarus.by — Афіцыйны сайт Рэспублікі БеларусьСайт урада БеларусіRadzima.org — Збор архітэктурных помнікаў, гісторыя Беларусі«Глобус Беларуси»Гербы и флаги БеларусиАсаблівасці каменнага веку на БеларусіА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіУ. Ксяндзоў. Сярэдні каменны век (мезаліт). Засяленне краю плямёнамі паляўнічых, рыбакоў і збіральнікаўА. Калечыц, М. Чарняўскі. Плямёны на тэрыторыі Беларусі ў новым каменным веку (неаліце)А. Калечыц, У. Ксяндзоў, М. Чарняўскі. Гаспадарчыя заняткі ў каменным векуЭ. Зайкоўскі. Духоўная культура ў каменным векуАсаблівасці бронзавага веку на БеларусіФарміраванне супольнасцей ранняга перыяду бронзавага векуФотографии БеларусиРоля беларускіх зямель ва ўтварэнні і ўмацаванні ВКЛВ. Фадзеева. З гісторыі развіцця беларускай народнай вышыўкіDMOZGran catalanaБольшая российскаяBritannica (анлайн)Швейцарскі гістарычны15325917611952699xDA123282154079143-90000 0001 2171 2080n9112870100577502ge128882171858027501086026362074122714179пппппп

          ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6 (SMOTE) The 2019 Stack Overflow Developer Survey Results Are InCan SMOTE be applied over sequence of words (sentences)?ValueError when doing validation with random forestsSMOTE and multi class oversamplingLogic behind SMOTE-NC?ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)SmoteBoost: Should SMOTE be ran individually for each iteration/tree in the boosting?solving multi-class imbalance classification using smote and OSSUsing SMOTE for Synthetic Data generation to improve performance on unbalanced dataproblem of entry format for a simple model in KerasSVM SMOTE fit_resample() function runs forever with no result