Challenges in Pretraining
Track Editors:
Golnoosh Farnadi, McGill University
Ferdinando Fioretto, University of Virginia
Lucie Flek, University of Bonn
Raphael Fischer, Lamarr Institute for Machine Learning and Artificial Intelligence
Marco Antonio Stranisci, University of Turin
Overview
The invention of Large Language Models (LLM) determined a paradigm shift in machine learning, which changed its focus from task-specific training of systems to their self-supervised pretraining on large data resources. This turn had a broad impact in Artificial Intelligence (AI) research, as it raises new challenges in the design and development of Machine Learning systems. Whilst the growing interest in language modeling is driving rapid innovation, research on pretraining remains fragmented. Training LLMs is a highly intersectoral endeavor that raises research challenges spanning a wide range of topics, addressed by multiple communities involved in the design of these technologies. However, the resulting research outputs are scattered across many venues, limiting a comprehensive understanding of the general challenges related to pretraining.
The special track addresses this gap seeking submissions on the state of the art and existing challenges on pretraining. Specifically, the track focuses on the following broad content areas:
● Pretraining Datasets. The quality of pretraining datasets is crucial for developing LLMs. Recent research focuses not just on larger datasets but on data selection strategies aimed to improve performance efficiently. Data quality is also tied to governance in pretraining pipelines, requiring strong stewardship and archival practices. Methods like active learning and data minimization help reduce the amount of data needed while maintaining high-quality pretraining.
● Model Architectures. A large body of research explores what and how LLMs learn during pretraining, including studies on uncertainty, multilingual and token-free learning, and Mixture of Experts methods. It also examines architectural innovations: although autoregressive models dominate, alternatives such as neuro-symbolic and spiking neural networks aim to create more cognitively inspired learning approaches.
● Computational Resources. The growing computational need of training LLMs strongly influences pretraining strategies. Approaches like federated learning support distributed training while protecting data ownership and shaping model representations. Efficiently distributing computational loads to possibly heterogeneous platforms remains a complex optimization problem itself while alternative methods focus on training models under constrained computational budgets.
This special track seeks contributions on the following, non-exhaustive list of topics:
● Advanced data selection and filtering strategies for pretraining
● Data stewardship and archival practices of pretraining data
● Data minimization and Active Learning approaches
● Neurosymbolic and cognitively inspired learning architectures
● Quantification and analysis of uncertainty in pretraining
● Tokenizer-free and alternative input representation approaches
● Distributed and federated learning for large-scale pretraining
● Language modeling under low-computational settings
● Benchmarking and resource efficiency analyses in pretraining
The special track welcomes submissions that introduce innovative methodological contributions relevant to its topics. Papers that primarily rely on standard methods or established benchmarks fall outside the scope of this track.
Key Dates:
JAIR special tracks have a submission window. Papers can be submitted anytime during that window. They are reviewed as they arrive and accepted papers will go to production and will be published asynchronously as soon as they are ready, first as part of a usual JAIR pipeline and later on a JAIR webpage dedicated to the special track.
Target timeline:
● Submission period: April 30, 2026 - October 31, 2026
● First round of review and authors' notification: December 2026
● Resubmissions: February 2027
● Second round of review and authors' notification: April 2027
● Final manuscripts: June 2027
Accepted submissions will be added to this page on publication.