An Overview of the Data-Loader Landscape: Abstract and Intro

4 Jun 2024


(1) Iason Ofeidis, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};

(2) Diego Kiedanski, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};

(3) Leandros TassiulasLevon Ghukasyan, Activeloop, Mountain View, CA, USA, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven.


Dataloaders, in charge of moving data from storage into GPUs while training machine learning models, might hold the key to drastically improving the performance of training jobs. Recent advances have shown promise not only by considerably decreasing training time but also by offering new features such as loading data from remote storage like S3. In this paper, we are the first to distinguish the dataloader as a separate component in the Deep Learning (DL) workflow and to outline its structure and features. Finally, we offer a comprehensive comparison of the different dataloading libraries available, their trade-offs in terms of functionality, usability, and performance and the insights derived from them.


Training a (deep) machine learning model requires a dataset, a model, and the hardware, which for real problems involves one or more GPUs.

We are always interested in reducing the total computational time required to train a model. This is desirable for several reasons: lower costs, easier to iterate, and more accessible for smaller teams, among other things.

The relationship between the main components of an ML pipeline and the running time is often explicit: a larger dataset takes longer, a larger model takes longer, and a faster GPU reduces the total running time. One key piece in the puzzle that is often overlooked is the glue between all these parts: the dataloader. The dataloader is in charge of loading the data from its permanent storage (RAM, disk, or the network), applying the necessary transformations, and sending the transformed data to the appropriate device so the model can ingest it.

Most developers assume that the default dataloader in their respective machine learning framework (Pytorch, Tensorflow, Jax) is already optimized for their application and do not often rely on third-party data loaders. Interestingly enough, it has recently been shown that data loaders can be one of the more significant bottlenecks of ML pipelines(Mohan et al., 2020). As a result, we have seen many new libraries and research projects dedicated to optimizing and improving the dataloader performance.

For example, FFCV (Leclerc et al., 2022), a new opensource library developed by a research team at MIT, managed to train ImageNet in a fraction of the time it would take using the default PyTorch dataloader. Such gains can dramatically reduce the operational costs of companies and research teams that depend on infrastructure as a service (IaaS), such as Amazon Web Services (AWS) and Google Cloud Platform (GPC).

Another promising feature offered by dataloaders is the ability to load data stored remotely; for example, from an S3 bucket. This has many practical advantages: the time to set up the dataset locally is avoided, the required disk capacity on the computing machine is reduced, and the risk of team members using different versions of the same dataset is diminished. The natural drawback of having to stream the data while training is that, usually, network transfer speeds are slower than disk I/O, and, as a result, the model should take longer to train. Interestingly, we have observed that some libraries such as Hub (Team, 2022a) and Deep Lake (Hambardzumyan et al., 2022), achieve better performance over the network than the default Pytorch dataloader reading data locally for some scenarios. This is possible because the dataloader manages to pre-fetch the required data before the GPU needs it. We will offer a more extensive discussion in Section 5.

Not all libraries support remote loading, and those that do, not necessarily integrate with the same remote storage services. Since the number of available libraries implementing dataloaders is growing, we set out to build a comprehensive benchmark to illuminate the current state-of-the-art, what problems seem to have been solved already and to discover the most promising areas for improvement in future research.

At this point, it should be mentioned that one particular difference from our experiments from other works such as (Kumar & Sivathanu, 2020), (Mohan et al., 2020) is that we focus on jobs that run on small to medium workstations with limited capacity (GPU, RAM, SSD). These are more likely to reflect the hardware available to most individuals and small teams in the industry, for whom the budget does not permit the usage of large-scale clusters.

1.1 Contributions

We can summarize our contributions as follows:

• Open-source Code: We built an open source benchmark that compares the most popular data loading libraries in Pytorch[1]. The project will remain available to the community so new libraries and datasets can be added as interest in them increases. We also expect to update the numerical results obtained in these benchmarks following any major updates to any of the libraries benchmarked in this paper.

• Viability of Remote Training: We show that it is possible to train a machine learning model using a data stream over a public internet connection under reasonable circumstances. In particular, we point out the impact of the computing serving the data. Our result is different than that of (Mohan et al., 2020) since we do not assume that the dataset is cached locally after the download.

• Hyperparameter Optimization for Speed: Traditional hyperparameter approaches aim to increase the overall accuracy of the model being trained. In this paper, we show how we can also optimize for speed (processed samples over time) as a proxy of total running time. This optimization is hardware-dependant, so it makes sense to perform it before long running jobs. This process should be at least an order of magnitude faster than equivalent time-to-acurracy metrics.

This paper is available on arxiv under CC 4.0 license.

[1] Github Repository: