Cluster Command

The manta_node cluster command manages multiple node instances as a coordinated group, simplifying deployment of multi-node setups on a single machine.

Overview

The cluster command enables:

  • Starting multiple nodes with one command

  • Flexible configuration per node

  • Bulk management of node groups

  • Simplified testing and development

  • Easy cluster teardown

Synopsis

# Quick start format
manta_node cluster <count>

# Full command format
manta_node cluster start <count> [options]
manta_node cluster stop

Arguments and Options

Quick Start

manta_node cluster <count>

Quickly start a cluster, prompting for each node’s configuration

  • count: Number of nodes to start (e.g., 2, 3, 5)

  • Interactive configuration selection

Start Subcommand

manta_node cluster start <count> [options]

Start a cluster of nodes

Arguments:

  • count: Number of nodes to start

Options:

  • --config <name>: Use same configuration for all nodes

  • --count <n>: Alternative way to specify count

Stop Subcommand

manta_node cluster stop

Stop all cluster nodes (nodes with “-cluster-” in their name)

Usage Examples

Interactive Cluster Setup

Start 3 nodes with individual configuration:

$ manta_node cluster 3
Starting cluster with 3 nodes
You will be prompted to select a configuration for each node.

Available configurations:
  1. default
  2. gpu-config
  3. cpu-config

Node 1/3:
Select configuration for node 1 [default]: gpu-config
Alias for node 1 [gpu-config]: gpu-worker-1

Node 2/3:
Select configuration for node 2 [default]: gpu-config
Alias for node 2 [gpu-config]: gpu-worker-2

Node 3/3:
Select configuration for node 3 [default]: cpu-config
Alias for node 3 [cpu-config]: cpu-worker-1

Cluster configuration summary:
  Node 1: config='gpu-config', alias='gpu-worker-1'
  Node 2: config='gpu-config', alias='gpu-worker-2'
  Node 3: config='cpu-config', alias='cpu-worker-1'

Start cluster with this configuration? [Y/n]: y

Starting nodes...
Starting node 1/3: gpu-worker-1
   Started gpu-worker-1
Starting node 2/3: gpu-worker-2
   Started gpu-worker-2
Starting node 3/3: cpu-worker-1
   Started cpu-worker-1

Successfully started 3 nodes

Uniform Cluster Setup

Start multiple nodes with same configuration:

$ manta_node cluster start 5 --config production
Starting cluster with 5 nodes using config 'production'...
Starting 5 nodes...
Starting node 1/5: production-cluster-1
   Started production-cluster-1
Starting node 2/5: production-cluster-2
   Started production-cluster-2
Starting node 3/5: production-cluster-3
   Started production-cluster-3
Starting node 4/5: production-cluster-4
   Started production-cluster-4
Starting node 5/5: production-cluster-5
   Started production-cluster-5

Successfully started 5 nodes

Stop Cluster

Stop all cluster nodes:

$ manta_node cluster stop
Found 5 cluster nodes:
  - production-cluster-1 (PID: 12345)
  - production-cluster-2 (PID: 12346)
  - production-cluster-3 (PID: 12347)
  - production-cluster-4 (PID: 12348)
  - production-cluster-5 (PID: 12349)

Stop all cluster nodes? [y/N]: y

Stopping cluster nodes...
   Stopped production-cluster-1
   Stopped production-cluster-2
   Stopped production-cluster-3
   Stopped production-cluster-4
   Stopped production-cluster-5

All 5 cluster nodes stopped successfully

Cluster Management

Node Naming

Cluster nodes are named automatically:

With uniform config:

  • Pattern: <config>-cluster-<number>

  • Example: production-cluster-1, production-cluster-2

With individual configs:

  • Uses specified aliases or config names

  • Maintains user-provided names

Identification:

  • Cluster nodes contain -cluster- in their name

  • Makes bulk operations possible

  • Distinguishes from standalone nodes

Configuration Selection

Interactive mode prompts for:

  1. Configuration choice: Select from available configs

  2. Node alias: Optionally override default alias

  3. Confirmation: Review before starting

Uniform mode uses:

  1. Single configuration: Applied to all nodes

  2. Auto-generated aliases: Sequential numbering

  3. No interaction: Fully automated

Resource Considerations

When starting multiple nodes:

Resource multiplication:

  • Each node consumes configured resources

  • 5 nodes × 2GB RAM = 10GB total RAM needed

  • CPU cores shared among nodes

Recommended limits:

Nodes | Min RAM | Min CPU | Recommended
------|---------|---------|-------------
2     | 8 GB    | 4 cores | Development
3-5   | 16 GB   | 8 cores | Testing
5-10  | 32 GB   | 16 cores| Small cluster
10+   | 64+ GB  | 32+ cores| Production

Safety checks:

  • Warning for >10 nodes

  • Confirmation required for large clusters

  • Resource validation before starting

Cluster Operations

Starting Clusters

Best practices for cluster startup:

  1. Check resources first:

    # Check available resources
    free -h
    nproc
    df -h
    
  2. Verify configurations:

    # List and validate configs
    manta_node config list
    manta_node config validate production
    
  3. Start incrementally:

    # Start small, then scale
    manta_node cluster 2
    manta_node status
    manta_node cluster 3
    

Managing Clusters

Monitor and control cluster nodes:

View cluster status:

# See all nodes including cluster
manta_node status

# Filter cluster nodes
manta_node status | grep cluster

Stop specific cluster nodes:

# Stop individual cluster node
manta_node stop production-cluster-3

# Stop range of nodes
for i in {1..3}; do
    manta_node stop production-cluster-$i
done

Restart cluster:

# Stop all cluster nodes
manta_node cluster stop

# Start fresh cluster
manta_node cluster start 5 --config production

Cluster Patterns

Development Cluster

For local development and testing:

# Create dev cluster with mixed configs
manta_node cluster 3
# Choose: dev, dev, test configs

# Run tests
python run_tests.py

# Clean up
manta_node cluster stop

GPU Cluster

For machine learning workloads:

# Start GPU cluster
manta_node cluster start 4 --config gpu-enabled

# Verify GPU nodes
manta_node status | grep gpu

# Deploy ML tasks
python deploy_training.py

Heterogeneous Cluster

Mixed node types:

# Start with different roles
manta_node cluster 5
# Node 1: gpu-config (trainer)
# Node 2: gpu-config (trainer)
# Node 3: cpu-config (aggregator)
# Node 4: edge-config (data source)
# Node 5: edge-config (data source)

Automation Examples

Bash Script

Automated cluster management:

#!/bin/bash
# cluster_manager.sh

start_cluster() {
    local count=$1
    local config=$2

    echo "Starting cluster of $count nodes..."

    if manta_node cluster start "$count" --config "$config"; then
        echo "Cluster started successfully"
        manta_node status
    else
        echo "Failed to start cluster"
        exit 1
    fi
}

stop_cluster() {
    echo "Stopping cluster..."
    manta_node cluster stop
}

restart_cluster() {
    stop_cluster
    sleep 2
    start_cluster "$@"
}

# Usage
case "$1" in
    start)
        start_cluster "${2:-3}" "${3:-default}"
        ;;
    stop)
        stop_cluster
        ;;
    restart)
        restart_cluster "${2:-3}" "${3:-default}"
        ;;
    *)
        echo "Usage: $0 {start|stop|restart} [count] [config]"
        ;;
esac

Python Script

Programmatic cluster control:

import subprocess
import time

class ClusterManager:
    def __init__(self):
        self.nodes = []

    def start_cluster(self, count, config=None):
        """Start a cluster of nodes."""
        cmd = ['manta_node', 'cluster', 'start', str(count)]
        if config:
            cmd.extend(['--config', config])

        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode == 0:
            print(f"Started cluster of {count} nodes")
            self.nodes = self.get_cluster_nodes()
            return True
        else:
            print(f"Failed: {result.stderr}")
            return False

    def stop_cluster(self):
        """Stop all cluster nodes."""
        result = subprocess.run(
            ['manta_node', 'cluster', 'stop'],
            input='y\n',
            capture_output=True,
            text=True
        )
        return result.returncode == 0

    def get_cluster_nodes(self):
        """Get list of cluster nodes."""
        result = subprocess.run(
            ['manta_node', 'status', '--plain'],
            capture_output=True,
            text=True
        )

        nodes = []
        for line in result.stdout.split('\n'):
            if 'cluster' in line and 'Instance:' in line:
                node_id = line.split(': ')[1]
                nodes.append(node_id)

        return nodes

    def scale_cluster(self, new_count):
        """Scale cluster to new size."""
        current = len(self.nodes)

        if new_count > current:
            # Scale up
            additional = new_count - current
            print(f"Scaling up by {additional} nodes")
            # Start additional nodes
        elif new_count < current:
            # Scale down
            remove = current - new_count
            print(f"Scaling down by {remove} nodes")
            # Stop specific nodes

# Usage
manager = ClusterManager()
manager.start_cluster(5, 'production')
time.sleep(10)
manager.stop_cluster()

Troubleshooting

Cluster Start Failures

Some nodes fail to start:

Starting nodes...
Starting node 1/3: node-1
  ✓ Started node-1
Starting node 2/3: node-2
  ✗ Failed to start node-2: Port already in use
Starting node 3/3: node-3
  ✓ Started node-3

Failed to start 1 nodes:
  - node-2

Solutions:

  1. Check port conflicts

  2. Verify configurations

  3. Check resource availability

  4. Review log files

Resource Exhaustion

System runs out of resources:

# Check resource usage
free -h
top -bn1 | head -20

# Reduce cluster size
manta_node cluster stop
manta_node cluster start 2 --config lightweight

Cluster Node Identification

Can’t distinguish cluster nodes:

# List only cluster nodes
manta_node status | grep '\-cluster\-'

# Get cluster node PIDs
for pid in $(manta_node status --plain | \
             grep cluster | \
             grep -oP 'PID: \K\d+'); do
    echo "Cluster node PID: $pid"
done

Performance Optimization

Cluster Configuration

Optimize for performance:

# cluster-optimized.toml
[tasks]
max_concurrent = 1  # Reduce per node

[resources]
reserve_cpu_percent = 5  # Lower reservation
reserve_memory_mb = 256

[logging]
level = "WARNING"  # Reduce log overhead
log_to_console = false

Load Balancing

Distribute work evenly:

  1. Use similar configurations: Ensures uniform capacity

  2. Monitor node load: Check CPU/memory regularly

  3. Adjust task distribution: Configure task limits

  4. Stagger startup: Add delays between starts

Resource Isolation

Prevent resource conflicts:

# CPU affinity for nodes
taskset -c 0-3 manta_node start node1 &
taskset -c 4-7 manta_node start node2 &

# Memory limits
systemd-run --uid=$USER \
  --property=MemoryLimit=4G \
  manta_node start limited

Best Practices

Development Clusters

  1. Start small: Begin with 2-3 nodes

  2. Use lightweight configs: Reduce resource usage

  3. Quick iteration: Stop/start frequently

  4. Monitor logs: Watch for errors

  5. Clean shutdown: Always use cluster stop

Production Clusters

  1. Resource planning: Calculate total needs

  2. Gradual scaling: Start nodes incrementally

  3. Health monitoring: Check status regularly

  4. Failover planning: Handle node failures

  5. Maintenance windows: Schedule restarts

Testing Clusters

  1. Consistent configs: Use same config for reproducibility

  2. Automated setup: Script cluster creation

  3. Baseline metrics: Record normal resource usage

  4. Stress testing: Run at maximum capacity

  5. Clean environment: Stop all nodes between tests

See Also