Helios

Helios: A Python Simulation of Dynamic MoE Inference

This repository contains a Python-based simulation of the paper: “Congestion-Aware Routing and Dynamic Replication for Scalable Mixture-of-Experts Inference”. The project aims to implement and validate the core concepts of the Helios framework in a simulated environment.

Helios: A Python Simulation of Dynamic MoE Inference

About the Paper

Mixture-of-Experts (MoE) models are becoming increasingly large, necessitating their deployment across distributed GPU clusters. This introduces a significant performance bottleneck due to inter-server communication for expert routing. The Helios paper proposes a two-phase framework to address this challenge:

Congestion-Aware Initial Placement: An initial, robust placement of experts onto GPUs is computed by solving an Integer Linear Program (ILP). This model optimizes for true network latency (considering bandwidth) rather than just hop count.
Runtime Adaptation Engine: A continuous runtime engine monitors expert utilization and network congestion. It mitigates performance hotspots in real-time by dynamically replicating “hot” experts onto underutilized GPUs.

This adaptive approach allows the system to handle the dynamic nature of production workloads, significantly outperforming static placement methods.

Project Overview

This project provides a discrete-event simulation environment to model and evaluate the Helios framework. You can:

Define different network topologies like Fat-Tree and Dragonfly.
Configure MoE models with a specific number of experts.
Simulate different workloads (stationary and non-stationary).
Compare the performance of Helios against baseline placement strategies.

Features

Network Topology Generation: Functions to create Fat-Tree and Dragonfly network graphs using networkx.
Discrete-Event Simulation: Built with simpy to model the asynchronous nature of a distributed system.
Initial Placement Strategies:
- Round Robin
- Greedy (based on expert load)
- Helios-Static: ILP-based placement that is congestion-aware (using pulp).
Runtime Adaptation:
- Hotspot detection based on expert utilization and network congestion.
- Dynamic expert replication as described in the paper.
Performance Metrics: Tracks and reports key metrics like p99 latency, goodput (tokens/sec), and maximum link congestion.

Project Structure

helios/
├── src/
│   ├── __init__.py
│   ├── helios.py           # Core Helios framework implementation (placement and adaptation)
│   ├── network.py          # Network topology generation (Fat-Tree, Dragonfly)
│   ├── simulation.py       # The main simulation logic and components
│   └── utils.py            # Utility functions and data structures
├── main.py                 # Main script to run simulations
├── requirements.txt        # Python dependencies
└── README.md               # This file

Getting Started

Prerequisites

Python 3.8+
A solver for the ILP, such as CBC (which is installed with pulp).

Installation

Clone the repository:

git clone https://github.com/ahmadpanah/helios.git
cd helios

Install the required Python packages:
```
pip install -r requirements.txt
```

How to Run the Simulation

The main entry point for running simulations is main.py. You can configure the simulation parameters within this file.

To run a simulation comparing Helios with other static placement strategies under a non-stationary workload:

python main.py

You can modify main.py to change parameters such as:

Network topology ('fat_tree' or 'dragonfly')
Number of GPUs
MoE model configuration
Workload characteristics

Core Concepts Implemented

Phase 1: Congestion-Aware Initial Placement

The initial placement is modeled as an Integer Linear Program (ILP) in src/helios.py.

Objective Function: Minimize the total expected communication latency per token, weighted by the activation frequency of each expert. The latency is a function of both the base latency and the bandwidth of the network links.
Decision Variables: A binary variable X_l,e,u which is 1 if expert e of layer l is placed on GPU u.
Constraints:
1. Unique Placement: Each expert is placed on exactly one GPU.
2. VRAM Capacity: The total size of experts on a GPU cannot exceed its VRAM capacity.
3. Layer Diversity: A limit on the number of experts from the same layer that can be co-located on a single GPU to prevent computational bottlenecks.

Phase 2: Runtime Adaptation Engine

The runtime adaptation logic is also implemented in src/helios.py and is triggered periodically within the simulation loop in src/simulation.py.

Monitoring: The simulation tracks expert activation counts and the data volume on each network link.
Hotspot Detection: An expert is identified as a “hotspot” if its activation frequency and the congestion on its communication path exceed predefined thresholds.
Greedy Replica Placement: When a hotspot is detected, the system attempts to create a replica of the hot expert on an underutilized GPU that is topologically advantageous (minimizes communication cost).

Network Topologies

The src/network.py module contains functions to generate graph representations of:

Fat-Tree: A hierarchical network common in data centers, characterized by uniform bisection bandwidth.
Dragonfly: A high-radix, cost-effective topology with a lower network diameter.

Example Output

When you run the simulation, you will see output similar to this, showing the performance metrics for the different strategies:

--- Initial Placement Performance (Stationary Workload) on Fat-Tree ---
Strategy: RoundRobin, P99 Latency: 240.5 ms, Goodput: 1247.4 tokens/sec, Max Link Congestion: 75.8%
Strategy: Greedy, P99 Latency: 205.1 ms, Goodput: 1462.7 tokens/sec, Max Link Congestion: 63.2%
Strategy: Helios-Static, P99 Latency: 145.3 ms, Goodput: 2064.7 tokens/sec, Max Link Congestion: 33.1%

--- Dynamic Adaptation Performance (Non-Stationary Workload) on Fat-Tree ---
Strategy: Helios-Static, P99 Latency: 185.6 ms, Goodput: 1616.4 tokens/sec, Max Link Congestion: 68.9%
Strategy: Helios-Dynamic, P99 Latency: 99.8 ms, Goodput: 3006.0 tokens/sec, Max Link Congestion: 35.4%

Contributing

Contributions to this project are welcome! Please feel free to open an issue or submit a pull request.