THE NEED FOR TEXT GENERATION
Rapid advancements in artificial intelligence have changed how we live our everyday lives. We can text people quickly using autosuggested content, we can communicate freely online using translation services, and we can book plane tickets through automated dialog applications. These conveniences that we experience everyday are powered by text generation technology1. For example, the autosuggest on your phone analyzes the text you have already written and provides suggestions for what you might want to type next by understanding the distribution of likely next words. This fundamental technology is language modeling, and underlies many of the advancements in neural text generation systems today. While generation technology already permeates our everyday lives, what if we want to work towards more and more complex applications? Imagine systems that can help users draft entire emails, summarize long message threads, make a complicated book more easy to read, or freely chit-chat with someone all day. Unlike the next-word autosuggest on your phone, these applications are at the frontier of text generation research today. In this series of chapters, we discuss the challenges facing the development of more complex generation systems, and provide several novel methods to create such advanced systems.
The Challenge of Multilingual Generation
Our particular area of focus in the subsequent chapters of this section is on the challenge of multilingual text generation, or generation in languages beyond English. For a plethora of complex reasons, scientific progress in text generation has often focused on improving the performance of systems only in English. However, billions of people around the world speak languages beyond English, and the vast majority of text generation applications are applicable to them as well. For example, sentence simplification systems can help children more easily understand complex written text regardless of language. In part, systems have been limited to English due to lack of data for training models in other languages, though a number of other factors contribute (such as the lack of high-quality evaluation datasets). Further, developing systems for individual languages would explode the number of systems that need to be created. Thus, we focus on methodologies that can create datasets for multiple different languages, as well as models that can produce generations in more than one language at a time.
STRUCTURE OF THIS SECTION
In the subsequent chapters, we describe methodology and applications of text generation systems that first take as input natural language, and second a structured form of meaning representation called an Abstract Meaning Representation. We provide brief background on text-to-text and meaning-representation-to-text as broad paradigms, and introduce specific challenges in each area. We then present methods for specific applications of generation systems in each of these paradigms, with a particular focus on multilingual generation.
Leveraging Unsupervised Pretraining
Unsupervised pretraining has demonstrated large improvements on genera-tive tasks, by using sequence-to-sequence models as denoising auto-encoders on large quantities of data (Lewis et al., 2020a; Liu et al., 2020b), training with noising functions such as span-based masking or shuffling sentence order15. We leverage these pretrained models to further improve our un-supervised approach to text simplification. For English, we finetune the pretrained generative model BART (Lewis et al., 2020a) on our newly cre-ated mined training corpora. BART is a pretrained sequence-to-sequence model that can be seen as a generalization of other recent pretrained models such as BERT (Devlin et al., 2019). For non-English, we use its multilin-gual generalization MBART (Liu et al., 2020b), which was pretrained on 25 languages.
French and Spanish Simplification
Our unsupervised approach to simplification can be applied to many lan-guage. As for English, we first create a corpus of paraphrases composed of 1.4 million sequence pairs in French and 1.0 million sequence pairs in Span-ish (Table 3.1). We evaluate the quality of our mined corpus in Table 3.4. Unlike English, where labeled parallel training data has been created using Simple English Wikipedia, no such datasets exist for French or Spanish. We compare to several baselines, namely the identity, truncation and pivot baselines. training a Transformer sequence-to-sequence model on our mined data achieves stronger results in French and stronger results in Spanish except for the pivot baseline.
we use MBART. MBART was trained on 25 languages compared to only English for BART. Similar to what we observed in English, we achieve the best results by combining MBART+ACCESS, and training on mined data. It outperforms our strongest baseline by +8.25 SARI in French. In Spanish it matches the pivot baseline performance. As shown in the English results in Table 3.2, MBART has a small loss in performance of 1.54 SARI compared to its monolingual English counterpart BART, probably due to the fact that it handles 25 languages instead of one. Using monolingual BART trained for French or Spanish would perform even better. The pivot baseline also uses a supervised English simplification model (BART+ACCESS on MINED + WIKILARGE), compared to our unsupervised Spanish model.
ABSTRACT MEANING REPRESENTATIONS
Abstract Meaning Representations (AMRs) are a type of semantic mean-ing representation, first introduced in Banarescu et al. (2013). At a high level, AMRs represent sentences as single-root, directed acyclic graphs. This formalism lends AMRs three main advantages as semantic representations— they are easy to read and have a standard methodology for evaluation. Overall, AMRs are used to represent sentences in a form that can abstract away from morphological and syntactic variability.
… DEFINITION AMRs are defined with a unique root, corresponding to the top node of the tree structure. Each node in the graph has a variable asso-ciated with it, labeled with a concept. Each edge represents a relationship.
“There is no such thing as information overload. There is only bad design.” —Edward Tufte While this information loss from the original natural language into struc-tured information is a limitation, it is also an advantage. This advantage stems from the fact that AMRs project variable forms of the same sentence onto a consistent, simple representation. For example, the following four very different sentence structures:
• The man described the mission as a disaster.
• The man’s description of the mission: disaster.
• As the man described it, the mission was a disaster.
• The man described the mission as disastrous.
Table of contents :
1.1 Thesis Outline
II TEXT GENERATION WITHOUT RETRIEVAL
2 TEXT GENERATION
2.1 The Need for Text Generation
2.2 Structure of this Section
3 TEXT-TO-TEXT GENERATION
3.1 Text-to-Text Generation
3.2 Multilingual Sentence Simplification
4 MEANING REPRESENTATION-TO-TEXT GENERATION
4.1 Structured Input to Text Generation
4.2 Abstract Meaning Representations
4.3 Multilingual AMR-to-Text Generation
III TEXT GENERATION WITH RETRIEVAL
5 RETRIEVAL FOR KNOWLEDGE-BASED TEXT GENERATION
5.1 The Need for Knowledge
5.3 Structure of this Section
6 KNOWLEDGE FROM A SINGLE DOCUMENT
6.1 Motivation: Document-Level Knowledge
6.2 Fact Checking as a Knowledge-Based Text Generation Task
6.3 Generating Fact Checking Briefs
7 SCALING KNOWLEDGE ACCESS TO MULTIPLE DOCUMENTS IN WIKIPEDIA
7.1 Motivation: Knowledge from Multiple Documents
7.2 Dialogue as a Knowledge-Based Text Generation Task
7.3 Augmenting Transformers with KNN-Based Composite Memory
8 SCALING KNOWLEDGE ACCESS TO THE OPEN WEB
8.1 Motivation: Knowledge from the Open Web
8.2 Wikipedia Article Writing as a Knowledge-Based Text Generation Task
8.3 Generating Biographies for Marginalized Groups on Wikipedia
9 KNOWLEDGE ON THE WEB, IN STRUCTURED FORM
9.1 Motivation: Knowledge in Structured Form
9.2 Long-form Question Answering as a Knowledge- Based Text Generation Task
9.3 Using Local Knowledge Graph Construction to Scale Seq2Seq Models to Multi-Document Inputs