Links

GitHub

Tags

#Deduplication #LLM #Big Data #Machine Learning

LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

Deduplicating training datasets is a critical step in training Large Language Models (LLMs). Duplicate documents in training data can degrade model performance and lead to memorization. However, deduplication at the scale of modern datasets (billions of documents) is computationally expensive and memory-intensive.

We introduce LSHBloom, a novel approach that extends MinHashLSH by replacing the traditional LSH index with a memory-efficient Bloom filter index. This approach significantly reduces memory usage and enhances throughput while maintaining high deduplication accuracy.

Our experiments show that LSHBloom achieves a 12x speedup and reduces disk space usage by 18x compared to MinHash LSH, making it feasible to deduplicate extreme-scale datasets on commodity hardware.

Publications

Funding and Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory (ANL) and is based on research supported by the DOE Office of Science–Advanced Scientific Computing Research Program and by Laboratory Directed Research and Development (LDRD) funding from ANL, provided by the Director, DOE Office of Science, both under Contract No. DE-AC02-06CH11357.

People

Arham Khan
Aswathy Ajith
Ian Foster
Kyle Chard