Latent Dirichlet allocation (LDA) is an unsupervised learning topic model, similar to k-means clustering, and one of its applications is to discover common themes, or topics, that might occur across a collection of documents. In a nutshell, the distribution of words characterizes a topic, and these latent, or undiscovered topics are represented as random mixtures within documents. The process followed here is based off a case study provided by MIT Professional Education Digital Programs, Data Science: Data to Insights, “Finding Themes In Project Descriptions.” The study is presented by Professor Tamara Broderick. In this course, MIT provided numerous Case Studies, and this is one of the case studies I decided to implement since it is pertinent to my current discussions on automated integration. All code for this project is provided on Github.
Topic Models:
Many text-mining algorithms are based on occurrence of words by frequency, and are associated with one and only one cluster. This can be misleading when a single word can occur multiple times in several articles, and each article can have one or multiple topics. So, unlike k-means clustering, it makes sense to allow each article (and word), to be associated with multiple topics. This is referred to as a mixed memberships model, and LDA is one of these models. By definition, feature allocation, mixed membership, and admixture are all terms that define the idea that data points can belong to multiple groups simultaneously.
This model has many scientific applications (e.g. genetics), but can be used as well in business and content management systems to capture concepts and topics that cross multiple lines of business. Just to provide one example, an algorithm such as LDA would be beneficial in marketing research, looking at multiple news articles from various news sources to see what topics pop up in the same articles or group of articles. New associations can be discovered that spark ideas for target markets, or product ideas.
Case Study using Articles Written by MIT EECS Faculty:
There are four main labs in in Electrical Engineering and Computer Science department at MIT:
- Computer Science and Artificial Intelligence (CSAIL)
- Laboratory for Information and Decision Systems (LIDS)
- Microsystems Technology Laboratories (MTL)
- Research Laboratory of Electronics (RLE)
The use case here is that there is a lot of research that occurs at MIT, and for professors it is difficult to keep up with everyone’s research. This study is designed to enable a professor to get a high level summary of what other professors are working on. According to Professor Tamara Broderick “this line of thought suggests we should try to find latent groups of research topics among the faculty.” Also, because professors can work on multiple research topics, we will need a mixed membership model to properly capture the topics. As mentioned earlier, LDA is one of these mixed membership models.
Similarly, for the project on automated integration, pulling together documents throughout a company or healthcare organization, possibly from a content management system or physician notes from the Electronic Medical Records (EMR) system, project plans, architecture review board submissions, and more, it would be possible to automate topic discovery for disparate lines of business. These topics could define possible Subject Areas (SA) within the organization and both entity and attribute associations within that topic, or SA.
Gathering Documents:
For this case study, I scraped abstracts from 1,119 articles written by MIT EECS faculty members. A service provided by Cornell University called arXiv.org (pronounced Archive) maintains a website of scholarly articles. By analyzing the HTML scraping of both MIT’s website for faculty member names, and then the arXiv.org site for articles authored by the MIT faculty, I can obtain the abstracts from each article.
You will notice in the code that I used a lot of exception handling to avoid working through the occasional anomaly in the data, like an article missing one or all of the 6 attributes I was looking for. I used a counter in the exception handling and identified 10 errors during the import process. Of course, this could be 25 articles per error where I could have missed 250 documents, but this number will suffice for the purposes here.
Exploratory Data Analysis and Data Preprocessing:
For most Natural Language Processing (NLP) data preprocessing, there are a number of standard modifications to the data that are usually performed. This consists of converting all text to lowercase, tokenizing, and removing meaningless words (stopwords, e.g., the, in, that, off, once, here, there) from what is called the Corpus, or body of text. Punctuation, numbers, and special characters are then removed from the Corpus, followed by grouping words that would only retain their meaning in the context of the other (e.g., no fees, dogs allowed, no pets, oracle database, data architect, blood pressure, arterial pressure, heart rate). Finally, whitespace is removed, and the Corpus is put through a process called stemming and lemmatization where common suffixes are removed, like “ing”, “ly”, “es”. and “s.” You might recognize the root word lemma from the “Words Have Power: Part 2” article: the heading that indicates the subject of an annotation or a literary composition or a dictionary entry.
Figure 1 below shows the distribution of abstracts since 1990 (1,119 articles). As you can see, most of the abstracts posted have been in the last eight to ten years. For visualization purposes, all abstracts from 1990-2009 were combined which is reflected in Fig 2.
In some cases, the data available online did not provide all information on MIT faculty such as their department, and I was not able to find articles for all faculty members. The faculty members with no published papers were removed from the faculty list. The number of faculty members with articles found on arXiv.org, by department are:
There are 5 groups/departments counting “UNK”.
In the CSAIL department, there are 52 faculty members.
In the RLE department, there are 29 faculty members.
In the MTL department, there are 4 faculty members.
In the LIDS department, there are 12 faculty members.
In the UNK department, there are 8 faculty members.
Therefore, 105 faculty members account for 1,119 abstracts/articles (avg 10.6 articles per faculty member). The figure below shows the distribution of articles by department:
The CSAIL department is much larger than the others, so you would expect them to have more published articles (also, this makes the assumption they all use arXiv.org). The following figure shows the frequency of the top 5 words used in abstracts by department:
Again, the code is available on github.com, along with the data. Just a cautionary note, I didn’t worry about optimizing the application for performance, and since my machine has 64GB of ram, I didn’t worry about memory. That said, I captured a lot of data and brought it all in as dictionary object with the following structure:
Dictionary Structure:
key: 2017-03-25_Williams
value: List of the following items:index: 0 type: <class ‘str’> value: Williams, Virginia
index: 1 type: <class ‘str’> value: 2017/03/25
index: 2 type: <class ‘tuple’> value: (‘2017’, 1)
index: 3 type: <class ‘str’> value: https://arxiv.org//abs/1703.08713
index: 4 type: <class ‘str’> value: Title: A systematic comparison of two-equation RANS turbulence models applied to shock-cloud interactions
index: 5 type: <class ‘str’> value: Abstract: Turbulence models attempt to account . . .
index: 6 type: <class ‘list’> value: [Abstract: list of tokenized and stemmed words . . . ]
index: 7 type: <class ‘list’> value: [Title: list of tokenized and stemmed words. . . . ]
index: 8 type:<class ‘list’> value: [Abstract/Title Combined: list of tokenized and stemmed words. . . . ]
index: 9 type: <class ‘str’> value: Department: CSAILThe key was constructed using the citation date, concatenated with an underscore and the author’s last name. I went on the assumption that two professors with the same last name would not submit an article on the same date. Obviously that is possible, but my guess is a low probability.
The Model: LDA-SVI
The analysis was performed by creating a vocabulary from the 50 most frequently used words from each department’s articles. This was then placed in a Collections Counter object. I followed Matthew Hoffman’s LDA Statistical Variational Inference (SVI) [3] protocol with a few exceptions. Also, the actual code required significant modifications since it was written in Python 2, and Python 3 was used for this analysis. Hoffman provides no visualization that I could find. However, the module that executes the algorithm (class onlineldavb.py) was virtually untouched. The only changes made to this class were those required to convert it to Python 3.6. The references below provide detail on the different algorithms and models.
Hoffman defines SVI as a derived application of stochastic optimization, that when used on the LDA model produces a more efficient algorithm.
Modifications:
Modifications: since the data sources are not randomly selected from Wikipedia, all of the code for data extraction from arXiv had to be written. Also, all of the data preprocessing and presentation had to be written. The data and code is provided on Github. One advantage of the Online LDA model is that it allows you to train incrementally in small batches which makes it a good choice for large datasets, and particularly streaming data.
The Model and Parameters:
With k-means clustering it is necessary to predefine the number of clusters, and likewise, with LDA-SVI, you must predefine the number of topics, and it also uses the variable K to define this parameter. There are numerous heuristics for determining the optimal K value, and with these heuristics applied to this model it was determined that a smaller number of 5 will yield more meaningful results.
Variable notation:
Here is a brief overview of the input variables for the LDA algorithm. Variables involved in the calculations:
- λ: what we want in the end (the posterior distribution for the topics for each word)
- vocab: As defined by Hoffman: “A set of words to recognize. When analyzing documents, any word not in this set will be ignored.”
- Κ: 5 (number of topics desired)
- D: 1119 (total number of documents available)
- α: 1/Κ (parameter for per-document topic distribution)
- η: 1/Κ (parameter for per-topic vocab distribution)
- τ: 1024 (delay that down weights early iterations)
- κ: 0.7 (forgetting rate, controls how quickly old information is forgotten; the larger the value, the slower it is)
- max: iterations: the number of maximum iterations the updates should go on for. The code written for this analysis checks the difference between two consecutive values of the models perplexity. When the difference reaches a certain value, the algorithm is said to have converged.
Results:
The max parameter was set to 0.001 and the model converged after 292 iterations. The model returned two file sets, gamma (γ), and lambda (λ). The γ file contains a column per topic with the variational distribution topic weights theta (Θ) — one document per line, or iteration. The λ file contains the variational distributions over topics. There is one row of distribution values per topic. For the vocabulary, there were 148 words, so the λ files contained 5 rows, each with 148 distribution values per row.
The data from the λ files were used to determine the words with the greatest values per topic. In the figure below, the size of the word reflects its proportional share of all words within that topic associated with each topic:
Looking at the exploratory analysis along with the results, my guesses without further analysis is that: 1) Topic #1 is associated with articles written by LIDS faculty, 2) Topic #2 crosses over between LIDS, and RLE, 3) Topic #3 would be predominantly RLE, Topic #4 would be predominantly CSAIL, and finally, Topic #5 would be a RLE. The γ files can now be analyzed to associate the topics with specific documents. Thirdly, we could also infer, using γ, the topic distribution of each document, and use this to guess the main focus of each publication scraped on arXiv. A way to do this is, for each topic, we sum up the normalized probability of each word (such that a less-used word would weigh less) in that topic over all words in the document, and then we compare this value across topics.
Conclusion:
K-means clustering is used to discover clusters, and Latent Dirichlet Allocation enables us to discover topics. Here we have discovered topics that cross disciplines within MIT’s EECS departments. At this point, the LDA analysis of the research documents at MIT could be extended for several use cases. The data could be analyzed within each department, and then cross trained with different vocabularies against other department research documentation. The bottom line from this analysis is that we are now able to define coherent themes of research across MIT EECS, and could use the same techniques for similar problems outside of this domain.
Topic models like LDA have numerous applications in scientific, as well as commercial domains. There is so much data available today that its harder to imagine ways in which these algorithms could not be used. Just with Wikipedia alone, they publish over 800 new articles per day! With all of this data out there, you could easily define a vocabulary with keywords of interest, and mine for a wealth of information. Of course, I am currently only talking about textual data, but these algorithms have applications for images as well. So as the data grows, we want our K to grow with it which will require even more complicated and advanced algorithms to manage it all. I think we have all experienced information overload, but with algorithms like LDA, we are now empowered with the ability to sift through the minutia and extract meaningful information.
References
Figure 2: Lab Group interests for different values of K.
Leave a Reply
Your email is safe with us.