This repository contains a Python-based simulation of the paper: “Congestion-Aware Routing and Dynamic Replication for Scalable Mixture-of-Experts Inference”. The project aims to implement and validate the core concepts of the Helios framework in a simulated environment.
Mixture-of-Experts (MoE) models are becoming increasingly large, necessitating their deployment across distributed GPU clusters. This introduces a significant performance bottleneck due to inter-server communication for expert routing. The Helios paper proposes a two-phase framework to address this challenge:
This adaptive approach allows the system to handle the dynamic nature of production workloads, significantly outperforming static placement methods.
This project provides a discrete-event simulation environment to model and evaluate the Helios framework. You can:
networkx
.simpy
to model the asynchronous nature of a distributed system.pulp
).helios/
├── src/
│ ├── __init__.py
│ ├── helios.py # Core Helios framework implementation (placement and adaptation)
│ ├── network.py # Network topology generation (Fat-Tree, Dragonfly)
│ ├── simulation.py # The main simulation logic and components
│ └── utils.py # Utility functions and data structures
├── main.py # Main script to run simulations
├── requirements.txt # Python dependencies
└── README.md # This file
pulp
).git clone https://github.com/ahmadpanah/helios.git
cd helios
pip install -r requirements.txt
The main entry point for running simulations is main.py
. You can configure the simulation parameters within this file.
To run a simulation comparing Helios with other static placement strategies under a non-stationary workload:
python main.py
You can modify main.py
to change parameters such as:
'fat_tree'
or 'dragonfly'
)The initial placement is modeled as an Integer Linear Program (ILP) in src/helios.py
.
X_l,e,u
which is 1 if expert e
of layer l
is placed on GPU u
.The runtime adaptation logic is also implemented in src/helios.py
and is triggered periodically within the simulation loop in src/simulation.py
.
The src/network.py
module contains functions to generate graph representations of:
When you run the simulation, you will see output similar to this, showing the performance metrics for the different strategies:
--- Initial Placement Performance (Stationary Workload) on Fat-Tree ---
Strategy: RoundRobin, P99 Latency: 240.5 ms, Goodput: 1247.4 tokens/sec, Max Link Congestion: 75.8%
Strategy: Greedy, P99 Latency: 205.1 ms, Goodput: 1462.7 tokens/sec, Max Link Congestion: 63.2%
Strategy: Helios-Static, P99 Latency: 145.3 ms, Goodput: 2064.7 tokens/sec, Max Link Congestion: 33.1%
--- Dynamic Adaptation Performance (Non-Stationary Workload) on Fat-Tree ---
Strategy: Helios-Static, P99 Latency: 185.6 ms, Goodput: 1616.4 tokens/sec, Max Link Congestion: 68.9%
Strategy: Helios-Dynamic, P99 Latency: 99.8 ms, Goodput: 3006.0 tokens/sec, Max Link Congestion: 35.4%
Contributions to this project are welcome! Please feel free to open an issue or submit a pull request.