Leveraging Natural Supervision: Naturally-Occurring Data Structures

1 Jun 2024


(1) Mingda Chen.

2.2 Naturally-Occurring Data Structures

There are rich structures in textual data beyond plain text: e.g., conversational structures in internet forums and document structures in an online encyclopedia. These structures naturally emerge in people’s daily lives and possess knowledge that is unlikely to be captured by strings of words and has the potential to transfer to various downstream tasks. Leveraging such structures has a long history in a range of NLP tasks. A thorough review of these works is beyond the scope of the thesis. Below we seek to cover resources relevant to the remainder of the thesis (Section 2.2.1) and a few other resources that have attracted increasing attention in recent years (Section 2.2.2).

Bilingual Text (Bitext). Bitext is comprised of parallel corpora. Recent large-scale datasets are mostly mined from the web (Resnik, 1999; Resnik and Smith, 2003). One popular data resource is official documents on government websites: e.g., the European Parliament (Koehn, 2005) and the United Nations (Rafalovitch and Dale, 2009; Eisele and Chen, 2010; Chen and Eisele, 2012; Ziemski et al., 2016). Other resources involve parallel multilingual subtitles for movies and television series (Lison and Tiedemann, 2016) and transcripts for TED talks (Qi et al., 2018). Smith et al. (2013) and El-Kishky et al. (2020) exploit URLs in the Common Crawl corpus[3] to identify parallel text pairs. There have also been attempts to automatically extracts parallel sentences from the content of multilingual Wikipedia articles (Smith et al., 2010; Schwenk et al., 2021). Others have created datasets by mixing various data resources (Tiedemann, 2012; Bojar et al., 2016).

Bitext itself is a crucial training resource for machine translation systems (Brown et al., 1990; Brants et al., 2007; Wu et al., 2016). Outside of machine translation, bitext has been used for learning traditional word representations (Wang et al., 1996; Och, 1999; Faruqui and Dyer, 2014; Lu et al., 2015), contextualized word representations (Kawakami and Dyer, 2015; McCann et al., 2017), sentence representations (Hill et al., 2016; Espana-Bonet et al., 2017; Gregoire and Langlais ´ , 2018; Guo et al., 2018; Wieting et al., 2019) and paraphrase generation models (Barzilay and McKeown, 2001; Bannard and Callison-Burch, 2005; Mallinson et al., 2017; Wieting and Gimpel, 2018; Hu et al., 2019).

Wikipedia. Wikipedia is comprised of documents with rich metadata, which can be used as naturally-occurring supervision for a variety of NLP tasks. One example is hyperlinks, which have been used for parsing (Spitkovsky et al., 2010; Søgaard, 2017; Shi et al., 2021a), named entity recognition (Kazama and Torisawa, 2007; Nothman et al., 2008; Richman and Schone, 2008; Ghaddar and Langlais, 2017), entity disambiguation and linking (Bunescu and Pas¸ca, 2006; Cucerzan, 2007; Mihalcea, 2007; Mihalcea and Csomai, 2007; Milne and Witten, 2008; Hoffart et al., 2011; Le and Titov, 2019), coreference resolution (Rahman and Ng, 2011; Singh et al., 2012a; Zheng et al., 2013; Eirew et al., 2021), and generating Wikipedia articles (Liu* et al., 2018).

Wikipedia document categories have been used for text classification (Gantner and Schmidt-Thieme, 2009; Chu et al., 2021b,c), semantic parsing (Choi et al., 2015), and word similarities (Strube and Ponzetto, 2006). Besides particular tasks, there is work that attempts to study the Wikipedia categories from a non-empirical perspective. Zesch and Gurevych (2007) analyze the differences between the graphs from WordNet (Fellbaum, 1998) and the ones from Wikipedia categories. Ponzetto and Strube (2007) and Nastase and Strube (2008) extract knowledge of entities from the Wikipedia category graphs using predefined rules. Nastase et al. (2010) build a dataset based on Wikipedia article or category titles as well as the relations between categories and pages.

Wikipedia edit histories have been used for sentence compression (Yamangil and Nelken, 2008; Yatskar et al., 2010), writing assistants (Zesch, 2012; Cahill et al., 2013; Grundkiewicz and Junczys-Dowmunt, 2014; Boyd, 2018), paraphrasing (Max and Wisniewski, 2010), splitting and rephrasing (Botha et al., 2018), studying atomic edits (Faruqui et al., 2018), and modeling editors’ behaviors (Jaidka et al., 2021).

Additionally, by aligning sentences in Wikipedia to those in simple Wikipedia, Wikipedia has been used for text simplification (Zhu et al., 2010) and learning sentence representations (Wieting and Gimpel, 2017). Through pairing Wikipedia with structured information (e.g., knowledge bases, such as Wikidata (Vrandeciˇ c and ´ Krotzsch ¨ , 2014) and WordNet, or infoboxes on Wikipedia pages), researchers have created datasets for question answering (Hewlett et al., 2016), constructing knowledge graphs (Suchanek et al., 2007; Hoffart et al., 2013; Safavi and Koutra, 2020; Wang et al., 2021c), table parsing (Herzig et al., 2020; Yin et al., 2020a), and data-to-text generation (Lebret et al., 2016; Bao et al., 2018a; Jin et al., 2020b; Agarwal et al., 2021; Wang et al., 2021a).

Fandom. Fandom has rich information for individual wiki items (e.g., episodes for television series). Similar to Wikipedia, wiki items on Fandom have consistent article structures and comprehensive information contributed by fans. However, unlike Wikipedia, it hosts wikis mainly on entertainment. Due to these characteristics, Fandom has been used for text summarization (Yu et al., 2016), dialogue summarization (Rameshkumar and Bailey, 2020), paraphrase extraction (Regneri and Wang, 2012), constructing sensorial lexicon (Tekiroglu et al. ˘ , 2014), question answering (Maqsud et al., 2014), character description summarization (Shi et al., 2021b), entity linking (Logeswaran et al., 2019), and knowledge graph construction (Chu et al., 2021a).

Researchers build text classification datasets from movie reviews and news articles (Maas et al., 2011; Zhang et al., 2015). Lan et al. (2017) create a paraphrase dataset by linking tweets through shared URLs. Volske et al. ¨ (2017) create a summarization dataset by taking advantage of the common practice of appending a ”TL;DR” to long posts. Fan et al. (2018b) build a story generation dataset using the subreddit r/WritingPrompt. Khodak et al. (2018) create a dataset for sarcasm detection by leveraging the fact that reddit users tend to add the marker “/s” to the end of sarcastic statements. Yang et al. (2018) learn sentence embeddings from Reddit using its conversational structures. Joshi et al. (2017) construct question-answer pairs from 14 trivia and quiz-league websites. Fan et al. (2019a) build a question answering dataset with long-form answers using the subreddit r/explainlikeimfive. Chakrabarty et al. (2019) mine the acronyms IMO/IMHO (in my (humble) opinion) for claim detection. Zhang et al. (2020b) train a GPT-like model on conversations from Reddit for dialogue generation. Iyer et al. (2018) and Agashe et al. (2019) use GitHub to construct datasets for code generation. Other researchers have adapted Stack Overflow for question answering (Dhingra et al., 2017), question clarification (Rao and Daume III ´ , 2018), semantic parsing (Ye et al., 2020), and source code summarization (Iyer et al., 2016) datasets. In addition to the material presented in this thesis, I have also contributed a dataset for ill-formed question rewriting based on Stack Exchange question edit histories (Chu et al., 2020).

This paper is available on arxiv under CC 4.0 license.

[3] The Common Crawl corpus is a freely-available crawl of the web.