Cluster Command¶
The manta_node cluster
command manages multiple node instances as a coordinated group, simplifying deployment of multi-node setups on a single machine.
Overview¶
The cluster command enables:
Starting multiple nodes with one command
Flexible configuration per node
Bulk management of node groups
Simplified testing and development
Easy cluster teardown
Synopsis¶
# Quick start format
manta_node cluster <count>
# Full command format
manta_node cluster start <count> [options]
manta_node cluster stop
Arguments and Options¶
Quick Start¶
manta_node cluster <count>
Quickly start a cluster, prompting for each node’s configuration
count
: Number of nodes to start (e.g., 2, 3, 5)Interactive configuration selection
Start Subcommand¶
manta_node cluster start <count> [options]
Start a cluster of nodes
Arguments:
count
: Number of nodes to start
Options:
--config <name>
: Use same configuration for all nodes--count <n>
: Alternative way to specify count
Stop Subcommand¶
manta_node cluster stop
Stop all cluster nodes (nodes with “-cluster-” in their name)
Usage Examples¶
Interactive Cluster Setup¶
Start 3 nodes with individual configuration:
$ manta_node cluster 3
Starting cluster with 3 nodes
You will be prompted to select a configuration for each node.
Available configurations:
1. default
2. gpu-config
3. cpu-config
Node 1/3:
Select configuration for node 1 [default]: gpu-config
Alias for node 1 [gpu-config]: gpu-worker-1
Node 2/3:
Select configuration for node 2 [default]: gpu-config
Alias for node 2 [gpu-config]: gpu-worker-2
Node 3/3:
Select configuration for node 3 [default]: cpu-config
Alias for node 3 [cpu-config]: cpu-worker-1
Cluster configuration summary:
Node 1: config='gpu-config', alias='gpu-worker-1'
Node 2: config='gpu-config', alias='gpu-worker-2'
Node 3: config='cpu-config', alias='cpu-worker-1'
Start cluster with this configuration? [Y/n]: y
Starting nodes...
Starting node 1/3: gpu-worker-1
✓ Started gpu-worker-1
Starting node 2/3: gpu-worker-2
✓ Started gpu-worker-2
Starting node 3/3: cpu-worker-1
✓ Started cpu-worker-1
Successfully started 3 nodes
Uniform Cluster Setup¶
Start multiple nodes with same configuration:
$ manta_node cluster start 5 --config production
Starting cluster with 5 nodes using config 'production'...
Starting 5 nodes...
Starting node 1/5: production-cluster-1
✓ Started production-cluster-1
Starting node 2/5: production-cluster-2
✓ Started production-cluster-2
Starting node 3/5: production-cluster-3
✓ Started production-cluster-3
Starting node 4/5: production-cluster-4
✓ Started production-cluster-4
Starting node 5/5: production-cluster-5
✓ Started production-cluster-5
Successfully started 5 nodes
Stop Cluster¶
Stop all cluster nodes:
$ manta_node cluster stop
Found 5 cluster nodes:
- production-cluster-1 (PID: 12345)
- production-cluster-2 (PID: 12346)
- production-cluster-3 (PID: 12347)
- production-cluster-4 (PID: 12348)
- production-cluster-5 (PID: 12349)
Stop all cluster nodes? [y/N]: y
Stopping cluster nodes...
✓ Stopped production-cluster-1
✓ Stopped production-cluster-2
✓ Stopped production-cluster-3
✓ Stopped production-cluster-4
✓ Stopped production-cluster-5
All 5 cluster nodes stopped successfully
Cluster Management¶
Node Naming¶
Cluster nodes are named automatically:
With uniform config:
Pattern:
<config>-cluster-<number>
Example:
production-cluster-1
,production-cluster-2
With individual configs:
Uses specified aliases or config names
Maintains user-provided names
Identification:
Cluster nodes contain
-cluster-
in their nameMakes bulk operations possible
Distinguishes from standalone nodes
Configuration Selection¶
Interactive mode prompts for:
Configuration choice: Select from available configs
Node alias: Optionally override default alias
Confirmation: Review before starting
Uniform mode uses:
Single configuration: Applied to all nodes
Auto-generated aliases: Sequential numbering
No interaction: Fully automated
Resource Considerations¶
When starting multiple nodes:
Resource multiplication:
Each node consumes configured resources
5 nodes × 2GB RAM = 10GB total RAM needed
CPU cores shared among nodes
Recommended limits:
Nodes | Min RAM | Min CPU | Recommended
------|---------|---------|-------------
2 | 8 GB | 4 cores | Development
3-5 | 16 GB | 8 cores | Testing
5-10 | 32 GB | 16 cores| Small cluster
10+ | 64+ GB | 32+ cores| Production
Safety checks:
Warning for >10 nodes
Confirmation required for large clusters
Resource validation before starting
Cluster Operations¶
Starting Clusters¶
Best practices for cluster startup:
Check resources first:
# Check available resources free -h nproc df -h
Verify configurations:
# List and validate configs manta_node config list manta_node config validate production
Start incrementally:
# Start small, then scale manta_node cluster 2 manta_node status manta_node cluster 3
Managing Clusters¶
Monitor and control cluster nodes:
View cluster status:
# See all nodes including cluster
manta_node status
# Filter cluster nodes
manta_node status | grep cluster
Stop specific cluster nodes:
# Stop individual cluster node
manta_node stop production-cluster-3
# Stop range of nodes
for i in {1..3}; do
manta_node stop production-cluster-$i
done
Restart cluster:
# Stop all cluster nodes
manta_node cluster stop
# Start fresh cluster
manta_node cluster start 5 --config production
Cluster Patterns¶
Development Cluster¶
For local development and testing:
# Create dev cluster with mixed configs
manta_node cluster 3
# Choose: dev, dev, test configs
# Run tests
python run_tests.py
# Clean up
manta_node cluster stop
GPU Cluster¶
For machine learning workloads:
# Start GPU cluster
manta_node cluster start 4 --config gpu-enabled
# Verify GPU nodes
manta_node status | grep gpu
# Deploy ML tasks
python deploy_training.py
Heterogeneous Cluster¶
Mixed node types:
# Start with different roles
manta_node cluster 5
# Node 1: gpu-config (trainer)
# Node 2: gpu-config (trainer)
# Node 3: cpu-config (aggregator)
# Node 4: edge-config (data source)
# Node 5: edge-config (data source)
Automation Examples¶
Bash Script¶
Automated cluster management:
#!/bin/bash
# cluster_manager.sh
start_cluster() {
local count=$1
local config=$2
echo "Starting cluster of $count nodes..."
if manta_node cluster start "$count" --config "$config"; then
echo "Cluster started successfully"
manta_node status
else
echo "Failed to start cluster"
exit 1
fi
}
stop_cluster() {
echo "Stopping cluster..."
manta_node cluster stop
}
restart_cluster() {
stop_cluster
sleep 2
start_cluster "$@"
}
# Usage
case "$1" in
start)
start_cluster "${2:-3}" "${3:-default}"
;;
stop)
stop_cluster
;;
restart)
restart_cluster "${2:-3}" "${3:-default}"
;;
*)
echo "Usage: $0 {start|stop|restart} [count] [config]"
;;
esac
Python Script¶
Programmatic cluster control:
import subprocess
import time
class ClusterManager:
def __init__(self):
self.nodes = []
def start_cluster(self, count, config=None):
"""Start a cluster of nodes."""
cmd = ['manta_node', 'cluster', 'start', str(count)]
if config:
cmd.extend(['--config', config])
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
print(f"Started cluster of {count} nodes")
self.nodes = self.get_cluster_nodes()
return True
else:
print(f"Failed: {result.stderr}")
return False
def stop_cluster(self):
"""Stop all cluster nodes."""
result = subprocess.run(
['manta_node', 'cluster', 'stop'],
input='y\n',
capture_output=True,
text=True
)
return result.returncode == 0
def get_cluster_nodes(self):
"""Get list of cluster nodes."""
result = subprocess.run(
['manta_node', 'status', '--plain'],
capture_output=True,
text=True
)
nodes = []
for line in result.stdout.split('\n'):
if 'cluster' in line and 'Instance:' in line:
node_id = line.split(': ')[1]
nodes.append(node_id)
return nodes
def scale_cluster(self, new_count):
"""Scale cluster to new size."""
current = len(self.nodes)
if new_count > current:
# Scale up
additional = new_count - current
print(f"Scaling up by {additional} nodes")
# Start additional nodes
elif new_count < current:
# Scale down
remove = current - new_count
print(f"Scaling down by {remove} nodes")
# Stop specific nodes
# Usage
manager = ClusterManager()
manager.start_cluster(5, 'production')
time.sleep(10)
manager.stop_cluster()
Troubleshooting¶
Cluster Start Failures¶
Some nodes fail to start:
Starting nodes...
Starting node 1/3: node-1
✓ Started node-1
Starting node 2/3: node-2
✗ Failed to start node-2: Port already in use
Starting node 3/3: node-3
✓ Started node-3
Failed to start 1 nodes:
- node-2
Solutions:
Check port conflicts
Verify configurations
Check resource availability
Review log files
Resource Exhaustion¶
System runs out of resources:
# Check resource usage
free -h
top -bn1 | head -20
# Reduce cluster size
manta_node cluster stop
manta_node cluster start 2 --config lightweight
Cluster Node Identification¶
Can’t distinguish cluster nodes:
# List only cluster nodes
manta_node status | grep '\-cluster\-'
# Get cluster node PIDs
for pid in $(manta_node status --plain | \
grep cluster | \
grep -oP 'PID: \K\d+'); do
echo "Cluster node PID: $pid"
done
Performance Optimization¶
Cluster Configuration¶
Optimize for performance:
# cluster-optimized.toml
[tasks]
max_concurrent = 1 # Reduce per node
[resources]
reserve_cpu_percent = 5 # Lower reservation
reserve_memory_mb = 256
[logging]
level = "WARNING" # Reduce log overhead
log_to_console = false
Load Balancing¶
Distribute work evenly:
Use similar configurations: Ensures uniform capacity
Monitor node load: Check CPU/memory regularly
Adjust task distribution: Configure task limits
Stagger startup: Add delays between starts
Resource Isolation¶
Prevent resource conflicts:
# CPU affinity for nodes
taskset -c 0-3 manta_node start node1 &
taskset -c 4-7 manta_node start node2 &
# Memory limits
systemd-run --uid=$USER \
--property=MemoryLimit=4G \
manta_node start limited
Best Practices¶
Development Clusters¶
Start small: Begin with 2-3 nodes
Use lightweight configs: Reduce resource usage
Quick iteration: Stop/start frequently
Monitor logs: Watch for errors
Clean shutdown: Always use cluster stop
Production Clusters¶
Resource planning: Calculate total needs
Gradual scaling: Start nodes incrementally
Health monitoring: Check status regularly
Failover planning: Handle node failures
Maintenance windows: Schedule restarts
Testing Clusters¶
Consistent configs: Use same config for reproducibility
Automated setup: Script cluster creation
Baseline metrics: Record normal resource usage
Stress testing: Run at maximum capacity
Clean environment: Stop all nodes between tests
See Also¶
Start Command - Start individual nodes
Stop Command - Stop nodes
Status Command - Check cluster status
Configuration Command - Configure nodes
Identity Configuration - Configuration reference