XTask
High synchronization overhead in frameworks like GNU OpenMP impedes fine-grained task parallelism on many-core architectures. We introduce three advances to GNU OpenMP: a lock-less concurrent queue (XQueue), a scalable distributed tree barrier, and two NUMA-aware, lock-less load-balancing strategies.
Evaluated with Barcelona OpenMP Task Suite (BOTS) benchmarks, our XQueue and tree barrier improve performance by up to 1522.8× over the original GNU OpenMP. The load-balancing strategies provide an additional performance improvement of up to 4×.
Publications
- W. Wang, M. Gonthier, H. Lai, P. Nookala, H. Pan, I. Foster, I. Raicu, K. Chard: Exploring Fine-Grained Parallelism in Data-flow Runtime Systems on Many-Core Systems. Proceedings of the SC ‘25 Research Posters of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC Posters’25)
- W. Wang, M. Gonthier, P. Nookala, H. Pan, I. Foster, I. Raicu, K. Chard: Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems. 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS25)