In an effort to provide faster and better performance than most methods to date, a combined team from the MIT-IBM Watson AI Lab and the MIT Geometric Data Processing Group came up with a technique that combines a number of popular AI tools.
The researchers say their approach can scan millions of documents using only a person’s historical preferences, or the preferences of a group of people, as a basis.
“There’s a ton of text on the internet,” Justin Solomon, lead author on the research and MIT assistant professor, said. “Anything to help cut through all that material is extremely useful.”
The algorithm that was conceived by Solomon and his colleagues summarises collections of texts into themes, based on commonly used words in the text collection. The algorithm then divides each text into five to fifteen main topics, with a ranking indicating the importance of each topic for the text as a whole. Embedding, which comprises numerical representations of data (in this case, those data are words) helps to clarify the similarities between words. Also, optimal transport is used, which helps to calculate the most efficient way of moving objects (or in this case data points) between multiple destinations.
The embedding makes it possible to apply optimal transport twice. First, the aim is to compare topics within the text collection, and then to measure how themes that resemble each other actually overlap. This works particularly well when scanning large collections of books and documents, according to the researchers. In an evaluation of 1,720 title pairs in the dataset of the Gutenberg Project, the algorithm succeeded in comparing all these pairs in one second. According to the researchers, this is more than 800 times faster than the best method so far.