CADS: A Systematic Literature Review on the Challenges of Abstractive Dialogue Summarization

Frederic Kirstein; Jan Philip Wahle; Bela Gipp; Terry Ruas

doi:10.1613/jair.1.16674

PDF

Published: Jan 27, 2025

DOI: https://doi.org/10.1613/jair.1.16674

Keywords:

natural language, neural networks, dialog processing

Frederic Kirstein

Georg-August-Universität Göttingen

Jan Philip Wahle

Georg-August-Universität Göttingen

Bela Gipp

Georg-August-Universität Göttingen

Terry Ruas

Georg-August-Universität Göttingen

Abstract

Abstractive dialogue summarization is the task of distilling conversations into informative and concise summaries. Although focused reviews have been conducted on this topic, there is a lack of comprehensive work that details the core challenges of dialogue summarization, unifies the differing understanding of the task, and aligns proposed techniques, datasets, and evaluation metrics with the challenges. This article summarizes the research on Transformer-based abstractive summarization for English dialogues by systematically reviewing 1262 unique research papers published between 2019 and 2024, relying on the Semantic Scholar and DBLP databases. We cover the main challenges present in dialog summarization (i.e., language, structure, comprehension, speaker, salience, and factuality) and link them to corresponding techniques such as graph-based approaches, additional training tasks, and planning strategies, which typically overly rely on BART-based encoder-decoder models. Recent advances in training methods have led to substantial improvements in language-related challenges. However, challenges such as comprehension, factuality, and salience remain difficult and present significant research opportunities. We further investigate how these approaches are typically analyzed, covering the datasets for the subdomains of dialogue (e.g., meeting, customer service, and medical), the established automatic metrics (e.g., ROUGE), and common human evaluation approaches for assigning scores and evaluating annotator agreement. We observe that only a few datasets (i.e., SAMSum, AMI, DialogSum) are widely used. Despite its limitations, the ROUGE metric is the most commonly used, while human evaluation, considered the gold standard, is frequently reported without sufficient detail on the inter-annotator agreement and annotation guidelines. Additionally, we discuss the possible implications of the recently explored large language models and conclude that our described challenge taxonomy remains relevant despite a potential shift in relevance and difficulty.

Issue

Vol. 82 (2025)

Section

Articles

Author Biographies

Frederic Kirstein, Georg-August-Universität Göttingen

Frederic Kirstein is an industrial Ph.D. candidate under the guidance of Prof. Gipp and Dr. Ruas at Mercedes-Benz AG, Department of Research and Development.
His research focuses on artificial intelligence (AI), natural language processing (NLP), and text summarization. Frederic received his Master's degree in Computer Science from the University of Mainz.
During his Ph.D., he has aimed to publish his work in top-tier NLP and AI venues, such as EMNLP and LREC.
He has also been a reviewer for venues including IEEE, ACL, EMNLP, and JCDL.
His research interests include effective summarization of long and complex inputs, personalization of generated text, and reducing reliance on expensive human assessment.
Given his close ties to industry, his research typically contains a human-centered focus, exploring and pushing the boundaries of how new approaches can be of value to a wider audience.

Jan Philip Wahle, Georg-August-Universität Göttingen

Jan Philip Wahle is a researcher at the University of Göttingen in Germany under the guidance of Prof. Gipp and Dr. Ruas.
He received his Master’s degree in computer science from the University of Wuppertal and worked for two years for the automotive company Aptiv PLC before continuing with his Ph.D. studies.
During his Ph.D., Jan has been a visiting researcher at the National Research Council Canada and the University of Toronto.
His research has been published and presented at top-tier conferences such as ACL, EMNLP, EACL, LREC, JCDL, and others.
One of his primary interests is understanding how NLP research can be performed sustainably and responsibly.
His research is most concerned with how we influence the broader exchange of ideas, who the main actors of our field are, and how we can improve academic integrity through automated methods.
Specific to academic integrity, he has been researching paraphrasing and plagiarism detection from human and machine-generated text.

Bela Gipp, Georg-August-Universität Göttingen

Prof. Bela Gipp (Dr.-Ing.) is full professor of Scientific Information Analytics at the University of Göttingen in Germany.
Previously, he held professorships at the University of Konstanz, and the BUW.
He completed his postdoctoral research at the University of California Berkeley, and at the National Institute of Informatics (NII) in Tokyo.
The research focus of his lab lies at the intersection of Information Science and Data Science, with a primary focus on Information Retrieval and Natural Language Processing in domains such as recommender systems, text reuse, and the detection of media bias.

Terry Ruas, Georg-August-Universität Göttingen

Dr. Terry Ruas is a senior researcher at the University of Göttingen, Germany, in the GippLab.
His research is focused on natural language processing (NLP) and artificial intelligence (AI).
Terry obtained his Ph.D. in Computer Science at the University of Michigan (USA), a master’s in Information Engineering at the Federal University of ABC (Brazil), and a Bachelor’s in Computer Science and Science \& Technology at the Federal University of ABC (Brazil).
He also worked as a visiting researcher at the National Institute of Informatics (Japan) at the Aizawa Laboratory.
In the industry, he worked at IBM (Brazil) for six years in different positions (e.g., software product manager, Team Leader, IT Specialist).
His research interests mainly lie in the overlap of NLP, AI, and machine learning.
More specifically, he is passionate about research challenges that require extracting and applying semantic features from textual data to solve significant problems such as paraphrase generation and detection, text generation and summarization, media bias, and scientometrics.
Terry has published numerous papers in top-tier NLP and AI venues (e.g., ACL, EMNLP COLING, LREC, ESWA, etc.) and has been a reviewer in over 30 venues (e.g., ACM, IEEE, Elsevier, ACL, etc.).
His collaborators extend through Brazil, Canada, Czech, Germany, India, Japan, etc.
He is also an enthusiastic speaker and has given talks on NLP and AI worldwide.

Article Sidebar

Main Article Content

Abstract

Article Details

Frederic Kirstein, Georg-August-Universität Göttingen

Jan Philip Wahle, Georg-August-Universität Göttingen

Bela Gipp, Georg-August-Universität Göttingen

Terry Ruas, Georg-August-Universität Göttingen