Cluster Command¶

The manta_node cluster command manages multiple node instances as a coordinated group, simplifying deployment of multi-node setups on a single machine.

Overview¶

The cluster command enables:

Starting multiple nodes with one command
Flexible configuration per node
Bulk management of node groups
Simplified testing and development
Easy cluster teardown

Synopsis¶

# Quick start format
manta_node cluster <count>

# Full command format
manta_node cluster start <count> [options]
manta_node cluster stop

Arguments and Options¶

Quick Start¶

manta_node cluster <count>

Quickly start a cluster, prompting for each node’s configuration

count: Number of nodes to start (e.g., 2, 3, 5)
Interactive configuration selection

Start Subcommand¶

manta_node cluster start <count> [options]

Start a cluster of nodes

Arguments:

count: Number of nodes to start

Options:

--config <name>: Use same configuration for all nodes
--count <n>: Alternative way to specify count

Stop Subcommand¶

manta_node cluster stop: Stop all cluster nodes (nodes with “-cluster-” in their name)

Usage Examples¶

Interactive Cluster Setup¶

Start 3 nodes with individual configuration:

$ manta_node cluster 3
Starting cluster with 3 nodes
You will be prompted to select a configuration for each node.

Available configurations:
  1. default
  2. gpu-config
  3. cpu-config

Node 1/3:
Select configuration for node 1 [default]: gpu-config
Alias for node 1 [gpu-config]: gpu-worker-1

Node 2/3:
Select configuration for node 2 [default]: gpu-config
Alias for node 2 [gpu-config]: gpu-worker-2

Node 3/3:
Select configuration for node 3 [default]: cpu-config
Alias for node 3 [cpu-config]: cpu-worker-1

Cluster configuration summary:
  Node 1: config='gpu-config', alias='gpu-worker-1'
  Node 2: config='gpu-config', alias='gpu-worker-2'
  Node 3: config='cpu-config', alias='cpu-worker-1'

Start cluster with this configuration? [Y/n]: y

Starting nodes...
Starting node 1/3: gpu-worker-1
  ✓ Started gpu-worker-1
Starting node 2/3: gpu-worker-2
  ✓ Started gpu-worker-2
Starting node 3/3: cpu-worker-1
  ✓ Started cpu-worker-1

Successfully started 3 nodes

Uniform Cluster Setup¶

Start multiple nodes with same configuration:

$ manta_node cluster start 5 --config production
Starting cluster with 5 nodes using config 'production'...
Starting 5 nodes...
Starting node 1/5: production-cluster-1
  ✓ Started production-cluster-1
Starting node 2/5: production-cluster-2
  ✓ Started production-cluster-2
Starting node 3/5: production-cluster-3
  ✓ Started production-cluster-3
Starting node 4/5: production-cluster-4
  ✓ Started production-cluster-4
Starting node 5/5: production-cluster-5
  ✓ Started production-cluster-5

Successfully started 5 nodes

Stop Cluster¶

Stop all cluster nodes:

$ manta_node cluster stop
Found 5 cluster nodes:
  - production-cluster-1 (PID: 12345)
  - production-cluster-2 (PID: 12346)
  - production-cluster-3 (PID: 12347)
  - production-cluster-4 (PID: 12348)
  - production-cluster-5 (PID: 12349)

Stop all cluster nodes? [y/N]: y

Stopping cluster nodes...
  ✓ Stopped production-cluster-1
  ✓ Stopped production-cluster-2
  ✓ Stopped production-cluster-3
  ✓ Stopped production-cluster-4
  ✓ Stopped production-cluster-5

All 5 cluster nodes stopped successfully

Cluster Management¶

Node Naming¶

Cluster nodes are named automatically:

With uniform config:

Pattern: <config>-cluster-<number>
Example: production-cluster-1, production-cluster-2

With individual configs:

Uses specified aliases or config names
Maintains user-provided names

Identification:

Cluster nodes contain -cluster- in their name
Makes bulk operations possible
Distinguishes from standalone nodes

Configuration Selection¶

Interactive mode prompts for:

Configuration choice: Select from available configs
Node alias: Optionally override default alias
Confirmation: Review before starting

Uniform mode uses:

Single configuration: Applied to all nodes
Auto-generated aliases: Sequential numbering
No interaction: Fully automated

Resource Considerations¶

When starting multiple nodes:

Resource multiplication:

Each node consumes configured resources
5 nodes × 2GB RAM = 10GB total RAM needed
CPU cores shared among nodes

Recommended limits:

Nodes | Min RAM | Min CPU | Recommended
------|---------|---------|-------------
2     | 8 GB    | 4 cores | Development
3-5   | 16 GB   | 8 cores | Testing
5-10  | 32 GB   | 16 cores| Small cluster
10+   | 64+ GB  | 32+ cores| Production

Safety checks:

Warning for >10 nodes
Confirmation required for large clusters
Resource validation before starting

Cluster Operations¶

Starting Clusters¶

Best practices for cluster startup:

Check resources first:

# Check available resources
free -h
nproc
df -h

Verify configurations:

# List and validate configs
manta_node config list
manta_node config validate production

Start incrementally:

# Start small, then scale
manta_node cluster 2
manta_node status
manta_node cluster 3

Managing Clusters¶

Monitor and control cluster nodes:

View cluster status:

# See all nodes including cluster
manta_node status

# Filter cluster nodes
manta_node status | grep cluster

Stop specific cluster nodes:

# Stop individual cluster node
manta_node stop production-cluster-3

# Stop range of nodes
for i in {1..3}; do
    manta_node stop production-cluster-$i
done

Restart cluster:

# Stop all cluster nodes
manta_node cluster stop

# Start fresh cluster
manta_node cluster start 5 --config production

Cluster Patterns¶

Development Cluster¶

For local development and testing:

# Create dev cluster with mixed configs
manta_node cluster 3
# Choose: dev, dev, test configs

# Run tests
python run_tests.py

# Clean up
manta_node cluster stop

GPU Cluster¶

For machine learning workloads:

# Start GPU cluster
manta_node cluster start 4 --config gpu-enabled

# Verify GPU nodes
manta_node status | grep gpu

# Deploy ML tasks
python deploy_training.py

Heterogeneous Cluster¶

Mixed node types:

# Start with different roles
manta_node cluster 5
# Node 1: gpu-config (trainer)
# Node 2: gpu-config (trainer)
# Node 3: cpu-config (aggregator)
# Node 4: edge-config (data source)
# Node 5: edge-config (data source)

Automation Examples¶

Bash Script¶

Automated cluster management:

#!/bin/bash
# cluster_manager.sh

start_cluster() {
    local count=$1
    local config=$2

    echo "Starting cluster of $count nodes..."

    if manta_node cluster start "$count" --config "$config"; then
        echo "Cluster started successfully"
        manta_node status
    else
        echo "Failed to start cluster"
        exit 1
    fi
}

stop_cluster() {
    echo "Stopping cluster..."
    manta_node cluster stop
}

restart_cluster() {
    stop_cluster
    sleep 2
    start_cluster "$@"
}

# Usage
case "$1" in
    start)
        start_cluster "${2:-3}" "${3:-default}"
        ;;
    stop)
        stop_cluster
        ;;
    restart)
        restart_cluster "${2:-3}" "${3:-default}"
        ;;
    *)
        echo "Usage: $0 {start|stop|restart} [count] [config]"
        ;;
esac

Python Script¶

Programmatic cluster control:

import subprocess
import time

class ClusterManager:
    def __init__(self):
        self.nodes = []

    def start_cluster(self, count, config=None):
        """Start a cluster of nodes."""
        cmd = ['manta_node', 'cluster', 'start', str(count)]
        if config:
            cmd.extend(['--config', config])

        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode == 0:
            print(f"Started cluster of {count} nodes")
            self.nodes = self.get_cluster_nodes()
            return True
        else:
            print(f"Failed: {result.stderr}")
            return False

    def stop_cluster(self):
        """Stop all cluster nodes."""
        result = subprocess.run(
            ['manta_node', 'cluster', 'stop'],
            input='y\n',
            capture_output=True,
            text=True
        )
        return result.returncode == 0

    def get_cluster_nodes(self):
        """Get list of cluster nodes."""
        result = subprocess.run(
            ['manta_node', 'status', '--plain'],
            capture_output=True,
            text=True
        )

        nodes = []
        for line in result.stdout.split('\n'):
            if 'cluster' in line and 'Instance:' in line:
                node_id = line.split(': ')[1]
                nodes.append(node_id)

        return nodes

    def scale_cluster(self, new_count):
        """Scale cluster to new size."""
        current = len(self.nodes)

        if new_count > current:
            # Scale up
            additional = new_count - current
            print(f"Scaling up by {additional} nodes")
            # Start additional nodes
        elif new_count < current:
            # Scale down
            remove = current - new_count
            print(f"Scaling down by {remove} nodes")
            # Stop specific nodes

# Usage
manager = ClusterManager()
manager.start_cluster(5, 'production')
time.sleep(10)
manager.stop_cluster()

Troubleshooting¶

Cluster Start Failures¶

Some nodes fail to start:

Starting nodes...
Starting node 1/3: node-1
  ✓ Started node-1
Starting node 2/3: node-2
  ✗ Failed to start node-2: Port already in use
Starting node 3/3: node-3
  ✓ Started node-3

Failed to start 1 nodes:
  - node-2

Solutions:

Check port conflicts
Verify configurations
Check resource availability
Review log files

Resource Exhaustion¶

System runs out of resources:

# Check resource usage
free -h
top -bn1 | head -20

# Reduce cluster size
manta_node cluster stop
manta_node cluster start 2 --config lightweight

Cluster Node Identification¶

Can’t distinguish cluster nodes:

# List only cluster nodes
manta_node status | grep '\-cluster\-'

# Get cluster node PIDs
for pid in $(manta_node status --plain | \
             grep cluster | \
             grep -oP 'PID: \K\d+'); do
    echo "Cluster node PID: $pid"
done

Performance Optimization¶

Cluster Configuration¶

Optimize for performance:

# cluster-optimized.toml
[tasks]
max_concurrent = 1  # Reduce per node

[resources]
reserve_cpu_percent = 5  # Lower reservation
reserve_memory_mb = 256

[logging]
level = "WARNING"  # Reduce log overhead
log_to_console = false

Load Balancing¶

Distribute work evenly:

Use similar configurations: Ensures uniform capacity
Monitor node load: Check CPU/memory regularly
Adjust task distribution: Configure task limits
Stagger startup: Add delays between starts

Resource Isolation¶

Prevent resource conflicts:

# CPU affinity for nodes
taskset -c 0-3 manta_node start node1 &
taskset -c 4-7 manta_node start node2 &

# Memory limits
systemd-run --uid=$USER \
  --property=MemoryLimit=4G \
  manta_node start limited

Best Practices¶

Development Clusters¶

Start small: Begin with 2-3 nodes
Use lightweight configs: Reduce resource usage
Quick iteration: Stop/start frequently
Monitor logs: Watch for errors
Clean shutdown: Always use cluster stop

Production Clusters¶

Resource planning: Calculate total needs
Gradual scaling: Start nodes incrementally
Health monitoring: Check status regularly
Failover planning: Handle node failures
Maintenance windows: Schedule restarts

Testing Clusters¶

Consistent configs: Use same config for reproducibility
Automated setup: Script cluster creation
Baseline metrics: Record normal resource usage
Stress testing: Run at maximum capacity
Clean environment: Stop all nodes between tests