A team from Google Brain and the Imperial College of London has built an AI system called Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence, or Pegasus. The system uses Google Transformers architecture in combination with pre-training objectives to generate text.
The team states that the system achieves state-of-the-art results in 12 fields, including news, science, stories, manuals, emails, patents, and legislative proposals, and that the system shows “surprising” performance in summaries where very few resources are available. In those cases, previous top results are surpassed.
The team devised a task that masks entire sentences in documents. The AI had to fill these voids by using datasets consisting of web and news articles, including articles from a new collection (called HugeNews) that the researchers compiled themselves.
Best model selected
In a number of experiments, the team selected the best performing Pegasus model. This was a model with 568 million parameters. This model was trained on 750GB of text from 350 million web pages or on data from HugeNews, representing 1.5 billion articles containing a total of 3.8TB of news.
Pegasus eventually reached a high linguistic level, e.g. in terms of text consistency and how well the text runs, according to the researchers. Moreover, no countermeasures were needed to remedy ‘disfluences’. Moreover, in an environment with few resources, and with only 100 sample articles, the AI system generated summaries of a quality comparable to a model trained on a complete dataset (from 20,000 to 200,000 articles).