Introduction

What is PCCL?

The Prime Collective Communications Library (PCCL) is a lightweight, fault-tolerant framework for collective operations over standard TCP/IP—designed from the ground up to span everything from tightly coupled clusters to Internet-scale, wide-area networks. Unlike traditional HPC-focused solutions (e.g., MPI or vendor-specific libraries such as NCCL), PCCL lets peers dynamically join or leave at any time. A central master node simply tracks membership and orchestrates group operations, so you get all the benefits of collective algorithms—All-Reduce, broadcast, model-state sync—while retaining the flexibility to run on heterogeneous, unreliable links without a synchronized launch.

Under the hood, PCCL’s core design principle is fault tolerance at wire-speed. Any I/O failure—socket drop, abrupt peer exit, or flaky WAN link—unwinds as quickly as a successful operation, with failure-handling paths engineered to match the latency of the success case. PCCL exposes a thin C99 API (with C++ internals), plus first-class Python bindings and native PyTorch/FSDP support. Joining peers receive only an initial state snapshot; after that, no extra communication is needed beyond the regular pseudo-gradient reduction, as PCCL guarantees bit-identical results of collective operations on all peers, allowing each peer to advance independently.

On performance, PCCL can saturate long-fat-pipe WAN links, achieving sustained throughputs of 25 Gbit/s across transatlantic runs and up to 45 Gbit/s in more collocated European tests (limited only by the underlying NIC speeds) utilizing light-weight concurrent all reduces dispatching to a connection pool.

What is PCCL?​

What is PCCL?