Swarm Development Guide¶

Learn how to develop distributed algorithms using the Manta Swarm API. This guide covers the complete process from basic concepts to advanced orchestration patterns.

Note

A Swarm represents a distributed computing workflow that coordinates multiple tasks across nodes in a cluster.

Overview¶

The Swarm API provides a high-level abstraction for defining distributed algorithms:

Tasks: Individual units of computation
Graphs: Task dependencies and execution flow
Modules: Packaged code for task execution
Orchestration: Coordination and scheduling

Core Concepts¶

Swarm: A complete distributed algorithm definition including tasks, dependencies, and global state.
Task: An individual computation unit that executes on one or more nodes.
Module: Packaged Python code that implements task logic.
Graph: The execution flow defining task dependencies and conditions.
Scheduling: Rules for task distribution across nodes.

Quick Example¶

from manta import Swarm, Task, Module

class MySwarm(Swarm):
    """Simple federated learning swarm"""

    def __init__(self):
        super().__init__()
        self.name = "federated_mnist"

        # Set global parameters
        self.set_global("epochs", 10)
        self.set_global("batch_size", 32)

    def execute(self):
        # Define module
        fl_module = Module(
            name="fl_trainer",
            python_program="trainer.py",
            image="manta:pytorch"
        )

        # Create tasks
        aggregator = Task(
            module=fl_module,
            method="any",
            maximum=1,
            alias="aggregator"
        )

        workers = Task(
            module=fl_module,
            method="all",
            alias="workers"
        )

        # Define execution flow
        return aggregator(workers)

Development Workflow¶

Design Your Algorithm - Identify computation units - Define data flow - Plan resource requirements
Create Task Modules - Write task implementation - Define entry points - Specify dependencies
Build Task Graph - Connect tasks with dependencies - Set execution conditions - Configure scheduling
Configure Swarm - Set global parameters - Define network settings - Configure fault tolerance
Deploy and Monitor - Deploy to cluster - Monitor execution - Collect results

Key Components¶

Swarm Class

The base class for all swarms:

from manta import Swarm

class MySwarm(Swarm):
    name = "my_algorithm"

    def execute(self):
        # Define task graph
        pass

Task Definition

Tasks represent computation units:

from manta import Task, Module

task = Task(
    module=module,          # Module to execute
    method="all",          # Scheduling method
    fixed=False,           # Dynamic node assignment
    maximum=10,            # Max nodes to use
    network="overlay",     # Network type
    gpu=False,            # GPU requirement
    alias="worker"        # Task identifier
)

Module Packaging

Modules contain task implementation:

from manta import Module

module = Module(
    name="my_module",
    python_program=code,    # Python code or file
    image="manta:latest",   # Container image
    requirements=["numpy"], # Dependencies
    resources={            # Resource requirements
        "cpu": 2,
        "memory": "4Gi"
    }
)

Graph Construction

Connect tasks to form execution flow:

# Sequential execution
task1 = Task(module1)
task2 = Task(module2)
task3 = Task(module3)

# Chain tasks
flow = task3(task2(task1))

# Parallel execution
parallel_tasks = [Task(module) for _ in range(5)]
aggregator = Task(aggregator_module)

# Aggregate parallel results
flow = aggregator(parallel_tasks)

Execution Flow¶

Task Lifecycle

Initialization: Task is created and configured
Scheduling: Task assigned to nodes
Execution: Code runs on assigned nodes
Result Collection: Outputs gathered
Condition Check: Next tasks triggered

Scheduling Methods

"all": Execute on all available nodes
"any": Execute on any single node
"fixed": Execute on specific nodes
"maximum": Execute on up to N nodes

Execution Conditions

FINISHED: Trigger when task completes
RESULT: Trigger when result available

Advanced Features¶

Global State Management

class StatefulSwarm(Swarm):
    def __init__(self):
        super().__init__()
        # Set global parameters
        self.set_global("model_weights", initial_weights)
        self.set_global("hyperparameters", {
            "learning_rate": 0.01,
            "momentum": 0.9
        })

Network Configuration

class NetworkedSwarm(Swarm):
    def __init__(self):
        super().__init__()
        # Add custom networks
        self.add_network("backend", Driver.BRIDGE)
        self.add_network("frontend", Driver.OVERLAY)

    def execute(self):
        # Assign tasks to networks
        backend_task = Task(
            module=backend_module,
            network="backend"
        )

Dynamic Task Generation

class DynamicSwarm(Swarm):
    def execute(self):
        # Generate tasks based on conditions
        num_workers = self.get_global("num_workers", 5)

        workers = [
            Task(
                module=worker_module,
                alias=f"worker_{i}"
            )
            for i in range(num_workers)
        ]

        return self.aggregator(workers)

Error Handling¶

Task Failure Recovery

task = Task(
    module=module,
    method="any",
    maximum=3,  # Try up to 3 nodes
    excepted_ids=["failed_node_1"]  # Exclude failed nodes
)

Validation

# Validate swarm before deployment
swarm = MySwarm()
graph = swarm.prepare_graph()

# Check for cycles
if has_cycles(graph):
    raise ValueError("Task graph contains cycles")

Performance Optimization¶

Resource Allocation

# Optimize resource usage
task = Task(
    module=module,
    method="any",
    maximum=0.5,  # Use 50% of available nodes
    gpu=True      # Require GPU nodes
)

Data Locality

# Keep data local to nodes
task = Task(
    module=module,
    fixed=True,  # Pin to specific nodes
    specified_ids=["node1", "node2"]  # Data-local nodes
)

Examples¶

Federated Learning

class FederatedLearning(Swarm):
    name = "fl_experiment"

    def execute(self):
        # Training tasks on each node
        trainers = Task(
            module=trainer_module,
            method="all",
            alias="trainers"
        )

        # Aggregation on single node
        aggregator = Task(
            module=aggregator_module,
            method="any",
            maximum=1,
            alias="aggregator"
        )

        # Scheduler for iterations
        scheduler = Task(
            module=scheduler_module,
            method="any",
            maximum=1,
            alias="scheduler"
        )

        # Connect in cycle for iterations
        return scheduler(aggregator(trainers))

MapReduce Pattern

class MapReduce(Swarm):
    name = "mapreduce_job"

    def execute(self):
        # Map phase - parallel processing
        mappers = [
            Task(
                module=map_module,
                alias=f"mapper_{i}"
            )
            for i in range(10)
        ]

        # Reduce phase - aggregation
        reducer = Task(
            module=reduce_module,
            method="any",
            maximum=1,
            alias="reducer"
        )

        # Connect mappers to reducer
        return reducer(mappers)

Best Practices¶

Modular Design: Keep tasks focused and reusable
Error Handling: Implement retry logic and fallbacks
Resource Planning: Set appropriate resource limits
Testing: Test locally before cluster deployment
Monitoring: Add logging and metrics collection
Documentation: Document task dependencies and flow

Next Steps¶

Swarm Basics - Fundamental concepts
Task Configuration - Task implementation details

Swarm Development Guide¶

Overview¶

Core Concepts¶

Quick Example¶

Navigation¶

Development Workflow¶

Key Components¶

Execution Flow¶

Advanced Features¶

Error Handling¶

Performance Optimization¶

Examples¶

Best Practices¶

Next Steps¶