February 20, 2018
The latest project out of Google Brain, the company’s machine learning research lab, has been using AI software to write Wikipedia-style articles by summarizing information on the Internet. But it’s not easy to condense social media, blogs, articles, memes and other digital information into salient articles, and the project’s results have been mixed. The team, in a paper just accepted at the International Conference on Learning Representations (ICLR), describes how difficult it has been.
The Register reports that other companies have taken a stab at doing something similar, including “Salesforce, [which] trained a recurrent neural network with reinforcement learning to take information and retell it in a nutshell” and had decent results although “the computer-generated sentences are simple and short.”
In Google Brain’s results, “the sentences are longer and seem more natural,” although “the software-scribbled passage is a bit difficult to read without clear capital letters at the start of new sentences, and most sentences have the same rigid structure.” The results are also “longer than the corresponding entry in Wikipedia.”
The process involves scraping information from a Wikipedia article’s links in the reference section, most of which are used for training. Then the paragraphs from each page are ranked, with the text added to create a long document. The text is next “encoded and shortened, by splitting it into 32,000 individual words and used as input,” which is “then fed into an abstractive model, where the long sentences in the input are cut shorter.”
“The generated sentences are taken from the earlier extraction phase and aren’t built from scratch, which explains why the structure is pretty repetitive and stiff.”
“The extraction phase is a bottleneck that determines which parts of the input will be fed to the abstraction stage,” said Google Brain team member and co-author of the paper Mohammad Saleh. “Ideally, we would like to pass all the input from reference documents. Designing models and hardware that can support longer input sequences is currently an active area of research that can alleviate these limitations.”
But, The Register noted, “we are still a very long way off from effective text summarization or generation.”