Helios

Helios: A Python Simulation of Dynamic MoE Inference

This repository contains a Python-based simulation of the paper: “Congestion-Aware Routing and Dynamic Replication for Scalable Mixture-of-Experts Inference”. The project aims to implement and validate the core concepts of the Helios framework in a simulated environment.

Table of Contents

About the Paper

Mixture-of-Experts (MoE) models are becoming increasingly large, necessitating their deployment across distributed GPU clusters. This introduces a significant performance bottleneck due to inter-server communication for expert routing. The Helios paper proposes a two-phase framework to address this challenge:

  1. Congestion-Aware Initial Placement: An initial, robust placement of experts onto GPUs is computed by solving an Integer Linear Program (ILP). This model optimizes for true network latency (considering bandwidth) rather than just hop count.
  2. Runtime Adaptation Engine: A continuous runtime engine monitors expert utilization and network congestion. It mitigates performance hotspots in real-time by dynamically replicating “hot” experts onto underutilized GPUs.

This adaptive approach allows the system to handle the dynamic nature of production workloads, significantly outperforming static placement methods.

Project Overview

This project provides a discrete-event simulation environment to model and evaluate the Helios framework. You can:

Features

Project Structure

helios/
├── src/
│   ├── __init__.py
│   ├── helios.py           # Core Helios framework implementation (placement and adaptation)
│   ├── network.py          # Network topology generation (Fat-Tree, Dragonfly)
│   ├── simulation.py       # The main simulation logic and components
│   └── utils.py            # Utility functions and data structures
├── main.py                 # Main script to run simulations
├── requirements.txt        # Python dependencies
└── README.md               # This file

Getting Started

Prerequisites

Installation

  1. Clone the repository:
    git clone https://github.com/ahmadpanah/helios.git
    cd helios
    
  2. Install the required Python packages:
    pip install -r requirements.txt
    

How to Run the Simulation

The main entry point for running simulations is main.py. You can configure the simulation parameters within this file.

To run a simulation comparing Helios with other static placement strategies under a non-stationary workload:

python main.py

You can modify main.py to change parameters such as:

Core Concepts Implemented

Phase 1: Congestion-Aware Initial Placement

The initial placement is modeled as an Integer Linear Program (ILP) in src/helios.py.

Phase 2: Runtime Adaptation Engine

The runtime adaptation logic is also implemented in src/helios.py and is triggered periodically within the simulation loop in src/simulation.py.

Network Topologies

The src/network.py module contains functions to generate graph representations of:

Example Output

When you run the simulation, you will see output similar to this, showing the performance metrics for the different strategies:

--- Initial Placement Performance (Stationary Workload) on Fat-Tree ---
Strategy: RoundRobin, P99 Latency: 240.5 ms, Goodput: 1247.4 tokens/sec, Max Link Congestion: 75.8%
Strategy: Greedy, P99 Latency: 205.1 ms, Goodput: 1462.7 tokens/sec, Max Link Congestion: 63.2%
Strategy: Helios-Static, P99 Latency: 145.3 ms, Goodput: 2064.7 tokens/sec, Max Link Congestion: 33.1%

--- Dynamic Adaptation Performance (Non-Stationary Workload) on Fat-Tree ---
Strategy: Helios-Static, P99 Latency: 185.6 ms, Goodput: 1616.4 tokens/sec, Max Link Congestion: 68.9%
Strategy: Helios-Dynamic, P99 Latency: 99.8 ms, Goodput: 3006.0 tokens/sec, Max Link Congestion: 35.4%

Contributing

Contributions to this project are welcome! Please feel free to open an issue or submit a pull request.