Natural Language Processing (NLP) is a field of study involving the interaction between computers and the human spoken and written languages, which implies both understanding and communicating. This can become very complicated as you might have already guessed, and many people have simplified it to the point of being about data driven word clouds, or sentiment analysis only. These are two things that would fall into the realm of NLP, but that is not the extent of it.
Over the next few posts, I am going to go from the relatively simple analysis of text and demonstrate how to develop the Corpus, how to clean up the data by using some standard preprocessing techniques, followed by word frequency analysis, word cloud, and cluster plot. Initially, I will use the R programming language, and then move to Python for the sentiment analysis, and finally use a combination of both R and Python to demonstrate the purpose of an ontology in the analysis to refine word meanings in a given context.
This series of posts will be a work in progress, so if you find errors, or see a way of doing it simpler, please post your comments. I will not be offended. More elegant code is preferred, but my focus for the posts will be demonstrating the concepts, and obtaining valid results — no matter how ugly the code, but for you coding efficanados, please contribute your refined solutions.
Data: Your Resume in Text Format
This exercise is data agnostic (hopefully), meaning that you can use the code as written below, and place your text formatted document(s) in a subdirectory ‘resumes’ in your working directory. For that matter, you can place text files of Shakespeare’s entire body of work in the ‘resumes’ directory and run the code with that data.
However, the idea is to keep it simple. Open two or three of your resumes in Word, or Pages, and save it/them as a text file in the ‘resumes’ directory. We will then apply some Natural Language Processing techniques to the data using R, and several well known packages that provide text mining capabilities.
Create a Corpus:
Note: the code is in RMarkdown. You can omit the first and last lines if you are running an R script.
```{r loadData, cache=TRUE} # Here we are simply setting a relative path to the text files. I placed the resume text files in a directory called 'resumes'. # dirname <- file.path("./resumes/") # Next we will load R's Text Mining package call 'tm'. This provides functions for manipulating text in the R programming language. library(tm) # Create Corpus docs <- Corpus(DirSource(dirname)) ```
Remove Punctuation and Special Characters, change to Lowercase:
With the documents loaded into the variable `docs`, the data must now be modified to remove punctuation, special characters, convert all letters to lowercase, and in general standardizing the data to make it easier for computer processing.
```{r dataPreprocessing, cache=TRUE} # for(i in seq(docs)) { docs[[i]] <- gsub("/", " ", docs[[i]]) docs[[i]] <- gsub("@", " ", docs[[i]]) docs[[i]] <- gsub("\|", " ", docs[[i]]) } # You can now inspect each document/resume with the following commands: # inspect(docs[1]) # inspect(docs[2]) # inspect(docs[3]) # With the following commands, punctuation is removed and all letters are converted to lower case. # docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, tolower) ```
Combining Words to maintain context:
In many instances, there are words that only have the correct meaning if the order of the words are maintained. For example, ‘oracle’ by itself could be thought of as the all knowing, all seeing ‘oracle’, but in the context of a resume in the technology field, ‘oracle’ is likely to mean the technology company Oracle, Inc. If you combine ‘oracle’ with ‘sales’ and ‘representative’ then you have a better idea of the meaning.
In the next step, words are combined where appropriate, and can be reduced to a meaningful acronym, or simply separate the words with an underscore. By using the `inspect (docs[#])` command, you can scan through the document(s), and determine which words should stay grouped.
```{r dataPreprocessing2, cache=TRUE} for (i in seq(docs)){ docs[[i]] <- gsub("oracle certified professional", "ocp", docs[[i]]) docs[[i]] <- gsub("oracle administrator", "oracle_admin", docs[[i]]) docs[[i]] <- gsub("machine learning", "mach_learn", docs[[i]]) docs[[i]] <- gsub(" r ", " r_program ", docs[[i]]) docs[[i]] <- gsub("sun certified systems administrator", "SCSA", docs[[i]]) docs[[i]] <- gsub("financial industry business ontology", "FIBO", docs[[i]]) } ```
Remove Numbers and Stopwords:
Removing numbers is another task typically performed on the Corpus, but for certain types of documents you might want to add them to the list of combined words above. For example, ’24×7 Doorman’, would be converted to ‘x’ and ‘Doorman’. Oracle 11g, would become ‘oracle’ and ‘g’ (not ‘oracle g’). So, removing numbers is something that should be considered, but with caution.
Following the removal of numbers, we remove what are called “stopwords”. With the ‘tm’ package loaded in R, you can see the list of words by using the command `stopwords(“english”)`. These are words that don’t carry a lot of meaning and as a result dilute the significance of a search query.
```{r dataPreprocessing3, cache=TRUE} docs <- tm_map(docs, removeNumbers) # docs <- tm_map(docs, removeWords, stopwords("english")) # inspect(docs[2]) # Check to see if it worked. # ```
Stemming and Removing Whitespace and Unwanted Words:
At times there are words that should be removed because they do not carry any real meaning in the context of the Corpus. For example, ‘ email ‘, ‘ inc ‘, or any other word that you might observe as you scan through the document. Removing these words from the Corpus is rather straightforward.
Also, to more accurately group words, common suffixes are removed, e.g., “es”, “ing”, and “s”. This process is referred to as “stemming” and its purpose is to make sure the computer counts and frequencies are not diluted. To accomplish “stemming”, another R package is used called “SnowballC”.
If you inspect your documents now, you will see that a lot of whitespace has been introduced. This is also removed in the processing steps below.
The final step in the data preprocessing is converting your documents into plain text documents.
```{r dataPreprocessing4, cache=TRUE} # Removing particular words: docs <- tm_map(docs, removeWords, c("north carolina", "texas", " inc ", "email", " tx ", " v ", " g ")) library(SnowballC) docs <- tm_map(docs, stemDocument) docs <- tm_map(docs, stripWhitespace) docs <- tm_map(docs, PlainTextDocument) ```
Data Analysis:
At this point you are hopefully ready to start processing and analyzing the text. The first step in this process is the conversion of the text documents into a document matrix.
```{r dataAnalysis, cache=TRUE} dtm <- DocumentTermMatrix(docs) dtm ```
<<DocumentTermMatrix (documents: 3, terms: 448)>>
Non-/sparse entries: 874/470
Sparsity : 35%
Maximal term length: 62
Weighting : term frequency (tf)
Running the “inspect(dtm)” will be a little messy, so you can first observe the objects structure by using the command: “str(dtm)”.
Once you see the structure, you can then use commands like “inspect(dtm[1:3, 1:20])” which will show the frequency of the first 20 terms in all 3 documents.
The following creates a transposed matrix that will be used in the analysis below. Just to refresh your memory, a matrix of A = m x n transposed is written A^T = n x m. Basically, the rows of matrix ‘A’ become the columns of matrix A^T.
```{r dataAnalysis2, cache=TRUE} tdm <- TermDocumentMatrix(docs) print(tdm) freq <- colSums(as.matrix(dtm)) ord <- order(freq) ```
<<TermDocumentMatrix (terms: 448, documents: 3)>>
Non-/sparse entries: 874/470
Sparsity : 35%
Maximal term length: 62
Weighting : term frequency (tf)
Remove Sparse Terms:
At this point, you will want to remove the terms that occur so rarely as to be insignificant, and sparse terms are removed.
```{r dataAnalysis3, cache=TRUE} dtms <- removeSparseTerms(dtm, 0.1) freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE) # findFreqTerms(dtm, lowfreq=10) wf <- data.frame(word=names(freq), freq=freq) #head(wf) library(ggplot2) p <- ggplot(subset(wf, freq>10), aes(word, freq)) p <- p + geom_bar(stat="identity") p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) p ```
Looks like someone uses the word “data” a little too often in their resume! 🙂
Possibly should have grouped “big data”, “data architect”, ” data modeling”, and “data analysis.” Also, keep in mind that this is three resumes all being analyzed as one Corpus.
Create Word cloud:
```{r wordcloud, cache=TRUE,fig.width=11, fig.height=6} library(wordcloud) set.seed(1) dark2 <- brewer.pal(6, "Dark2") wordcloud(names(freq), freq, min.freq = 2, max.words=100, rot.per=0.2, colors=dark2) ```
Not the most beautiful word cloud I’ve seen, but you can play around with it to make it what you are looking for. 🙂
Dendrogram and k-Means Cluster:
First we will create a dendrogram diagram that is a hierarchical graph showing the taxonomy of the clusters of data. It is useful in visualizing the clustering of data, and establishing the k value. The objective of k-means clustering is to partition data into k clusters with the nearest mean value. In R, the dist is calculated using the euclidean method for calculating distance. hclust is the function used to calculate the hierarchical cluster analysis:
1. After visualizing the dendrogram, determine the starting value for the number of clusters k
2. Randomly assign each data point to a cluster
3. Compute cluster centroids which becomes the new mean
4. Each point is then reassigned to the closest cluster centroid (mean)
5. Re-compute cluster centroid/means
6. Repeat steps 4 and 5 until no improvement is made
```{r dendogram, cache=TRUE} dtmss <- removeSparseTerms(dtm, 0.1) # inspect(dtmss) library(cluster) d <- dist(t(dtmss), method="euclidian") fit <- hclust(d=d, method="ward.D2") fit plot.new() plot(fit, hang=-1) ```
```{r dendrogramWGroups, cache=TRUE, fig.width=11, fig.height=6} plot.new() plot(fit, hang=-1) groups <- cutree(fit, k=6) # "k=" defines the number of clusters you are using rect.hclust(fit, k=6, border="red")# draw dendogram with red borders around the 5 clusters ```
```{r kMeansClustPlot, cache=TRUE,fig.width=11, fig.height=6} library(fpc) d <- dist(t(dtmss), method="euclidian") kfit <- kmeans(d, 6) clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=1, lines=0) ```
Leave a Reply
Your email is safe with us.