A Distributed Inference Framework Enabling Running Models Exceeding Total Memory

driaforall 2 hours ago

Today we are shipping dnet, a distributed inference framework that lets Apple Silicon clusters run models that exceed their physical memory.

We fuse pipelined-ring parallelism, disk streaming and UMA-aware scheduling so “out of memory” stops being the limit.

https://github.com/firstbatchxyz/dnet?tab=readme-ov-file

In alpha, we ship a pipelined-ring strategy inspired by PRIMA.CPP. dnet’s solver (distilp) extends it so devices can punch above memory: layers stream from disk mid-round and overlap with compute, so total model size can exceed total cluster RAM.

Please let us know if you have any questions or feedback!