Skip to main content

TinySockets

PCCL relies on standard TCP sockets for:

  • Master connections (long-lived to the orchestrator)
  • Peer-to-Peer ring connections (one or two per ring neighbor)
  • Shared-State Distribution “one-off” ephemeral connections (similar to HTTP GET/POST style transfers)
  • Bandwidth Benchmark connections (short-lived, used to measure throughput between pairs)

Queued Socket Mechanism

For the master connection—and potentially any socket that might carry messages for multiple “logical” operations—PCCL uses a queued socket approach. That is, we maintain an internal queue of incoming packets, and let each part of the library read only the packets intended for it by consuming only packets that match a particular predicate. This helps avoid concurrency issues where multiple threads might accidentally consume each other’s data.

Dedicated RX/TX Threads

PCCL uses dedicated RX/TX threads for sending and receiving concurrently. Threads add read or write requests to a queue for a particular tag, and data will be sent or received on that threads behalf. The RX thread will read from the socket and dispatch the data to the correct queue given the received tag. The TX thread will send data from the queue to the socket while prepending the tag for distinction. Waking up the TX thread is done via threadpark, a custom-built lightweight thread parking library that utilizing futex-like apis on all major operating systems to facilitate efficient wakes.

P2P Frame Wrapping

PCCL wraps all data transmitted over the P2P sockets in a frame of the following format:

struct Frame {
uint64_t size_bytes; // size of the data that follows ; Big Endian
uint64_t tag; // tag of the data that follows ; Big Endian
uint64_t stream_ctr; // counter used to differentiate logical streams ; Big Endian
}

The size_bytes field is the size of the data that follows, in bytes. The tag field is used to distinguish messages. The stream_ctr field is used to differentiate logical streams of data. This is used e.g., to distinguish messages from previous aborted collective operations, where some set of messages may make it out despite the operation already having been observed aborted from the perspective of the peer.