This repository contains a Python simulation of the framework presented in the research paper: “Collaborative Deployment of Large AI Models on the Edge: A Microservice Approach to Heterogeneous Training and Quantized Inference”.
This project provides a functional, high-level implementation of the core architectural components and algorithms, demonstrating how to manage the lifecycle of Large AI Models (LAMs) in resource-constrained and heterogeneous edge computing environments.
Deploying Large AI Models (LAMs) like GPT-4 and Gemini on real-time Internet-of-Things (IoT) devices is a significant challenge. This is due to the severe mismatch between the massive computational and memory requirements of LAMs and the limited resources of edge devices. The problem is compounded by deep system heterogeneity, where devices vary widely in computational power (e.g., high-end GPUs vs. low-power MCUs) and supported numerical precisions (e.g., FP32 vs. INT8).
This framework provides a unified, modular, and adaptive solution to overcome these barriers.
This simulation implements the two core innovations presented in the paper:
The framework is coordinated by a central Edge Server that manages two synergistic workflows: Collaborative Training and Dynamic Inference.
+-------------------------------------------------------------------------+
| EDGE SERVER (Orchestrator) |
| |
| +---------------------------+ +------------------------+ |
| | Heterogeneity-Aware | Global LoRA | Precision-Aware | |
| | Aggregation |<--------------->| Inference Orchestrator| |
| | (Algorithm 1) | Update | (Algorithm 2) | |
| | - De-Quantization | | - Lyapunov Optimization| |
| | - Adaptive Rank Aggregation | | - QoS-based Scheduling | |
| +---------------------------+ +------------------------+ |
| ^ | | |
| Uplink | | Downlink | Deployment |
| (LoRA) | | (LoRA) | Results |
+--------|-------------|------------------------------------|--------------+
| | |
v v v
+-------------------------------------------------------------------------+
| HETEROGENEOUS EDGE NETWORK |
| |
| [COLLABORATIVE TRAINING] [DYNAMIC INFERENCE] |
| |
| +------------+ +------------+ +-------------+ +-----------+ |
| | Device 1 | | Device 2 | | Device 1 | | Device 3 | |
| | (H-Tier) | | (M-Tier) | | (H-Tier) | | (L-Tier) | |
| | r=16, FP16 | | r=8, FP16 | | m1_FP16 | | m2_INT8 | |
| +------------+ +------------+ +-------------+ +-----------+ |
| |
| +------------+ +------------+ +-------------+ | | |
| | Device 3 | | Device 4 | | Device 2 | | | |
| | (L-Tier) | | (M-Tier) | | (M-Tier) | | | |
| | r=4, INT8 | | | | m3_FP16 | | | |
| +------------+ +------------+ +-------------+ +-----------+ |
+-------------------------------------------------------------------------+
The codebase is organized into a modular structure to separate concerns.
collaborative_edge_lam/
├── main.py # Main script to run all simulations
├── config.py # Central configuration for devices, model, network
└── framework/
├── __init__.py
├── server.py # Implements the EdgeServer orchestrator logic
├── device.py # Implements the EdgeDevice client logic
├── microservice.py # Defines microservice variants and the portfolio
├── lora.py # Helper data class for LoRA updates
└── utils.py # Logging and other utility functions
This project uses only standard Python libraries (NumPy) and requires no special installation.
git clone https://github.com/ahmadpanah/Collaborative-Edge-LAM.git
cd Collaborative-Edge-LAM
Ensure you have Python 3.8+ installed.
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
To run all three experiments described in the paper, simply execute the main.py
script:
python collaborative_edge_lam/main.py
The script will print a detailed log of the simulation process and a summary of the results for each experiment, comparing our framework against the baselines.
This workflow, detailed in Algorithm 1
of the paper, allows devices of varying capabilities to train a model together.
device.py
):
r
).server.py
):
r_max
) in the federation.This workflow, detailed in Algorithm 2
, enables efficient, parallelized inference.
Microservice Portfolio (microservice.py
): The system maintains a portfolio of services, where each logical function (e.g., CoT_Step
) exists in multiple versions (FP16
, INT8
, Pruned_INT8
), each with different accuracy/performance trade-offs.
Orchestration (server.py
):
min_accuracy > 95%
) is received.(microservice, device)
pairs that satisfy the QoS.PerformanceCost
(latency, energy) against the QueueCost
(current device load).server.py
): Instead of retraining the entire model from scratch, we simulate the “right to be forgotten.” The federated_unlearning
method retrieves the forgotten client’s last weighted contribution and effectively “subtracts” it from the global model. This process is orders of magnitude faster than retraining, as demonstrated in the simulation.Running the main.py
script will produce an output similar to the following, summarizing the results of the three experiments.
--- Training Performance Summary (after 50 rounds as per paper) ---
Training Strategy Final Accuracy (%) Total Comm. Cost (GB)
--------------------------------------------------------------------------------
Full-Model FedAvg 85.5% ~280.0 GB
Naive Fed-LoRA (r=4) 72.1% ~5.1 GB
Ours (FedFT-H) 84.9% (Simulated) ~5.3 GB
FedFT-H achieves near full-model accuracy with >98% communication savings.
--- Inference Performance Summary (4-Step CoT Task) ---
Deployment Strategy Avg. Latency (ms) Memory Footprint
--------------------------------------------------------------------------------
Cloud-Centric 850.4 N/A (Cloud-Side)
Monolithic Edge 400.0 14.5 GB (on one device)
Ours (Microservice) 172.0 4.2 GB (Total Active)
Our microservice approach shows a ~57.0% latency reduction through parallelization.
--- Federated Unlearning Efficacy & Cost Comparison ---
State / Method Model Accuracy (%) Time Cost
--------------------------------------------------------------------------------
Original Trained Model 84.9% N/A
Full Retraining 84.5% ~5 Hours
Ours (Orthogonal Unlearn) 84.3% (Simulated) ~0.01 Seconds
Our unlearning method is orders of magnitude faster than retraining with minimal accuracy loss.
Contributions are welcome! If you have ideas for improvements, please open an issue to discuss what you would like to change. Pull requests are also appreciated.
This project is licensed under the MIT License.