Dataset Configuration¶
The datasets section configures dataset mappings that allow your node to provide local data access to running tasks.
Overview¶
Dataset configuration enables:
Local dataset registration through path mappings
Named dataset access for tasks
Secure read-only data mounting
Multiple dataset support per node
Configuration Example¶
[datasets]
base_path = "/data/datasets"
mount_readonly = true
[datasets.mappings]
mnist = "/data/datasets/mnist"
cifar10 = "/data/datasets/cifar10"
custom_data = "/mnt/storage/project_data"
Configuration Fields¶
base_path¶
Type: string
Default: "/data/datasets"
Description: Base directory for dataset organization (informational)
[datasets]
base_path = "/data/datasets"
mount_readonly¶
Type: boolean
Default: true
Description: Whether to mount datasets as read-only in task containers
Details:
true
: Prevents tasks from modifying datasets (recommended)false
: Allows tasks to write to datasets
# Protect datasets from modification
mount_readonly = true
mappings¶
Type: table
(dictionary)
Default: {}
(empty)
Description: Named dataset paths that tasks can access
Format: name = "path"
Details:
Maps friendly names to filesystem paths
Paths must be absolute
Names are used by tasks to request datasets
Supports any directory or file path that exists
Examples:
[datasets.mappings]
# Machine learning datasets
mnist = "/data/ml/mnist"
cifar10 = "/data/ml/cifar10"
imagenet = "/data/ml/imagenet"
# Project-specific data
training_data = "/projects/current/train"
validation_data = "/projects/current/val"
test_data = "/projects/current/test"
# Shared resources
pretrained_models = "/models/pretrained"
embeddings = "/data/embeddings"
# Time-series data
sensor_data = "/data/sensors/2024"
logs = "/var/log/application"
How Dataset Mappings Work¶
Path Mapping¶
When you configure dataset mappings:
Node registers the dataset paths with their names
Tasks request datasets by name (e.g., “mnist”)
Node mounts the local path into the task container
Container sees the dataset at
/datasets/<name>
Example Flow¶
Configuration:
[datasets.mappings]
mnist = "/data/datasets/mnist"
models = "/data/models"
Task container sees:
/datasets/
├── mnist/ # Mounted from /data/datasets/mnist
└── models/ # Mounted from /data/models
Task code accesses:
# In task container
import os
# Access mapped datasets
mnist_path = "/datasets/mnist"
models_path = "/datasets/models"
# List files
mnist_files = os.listdir(mnist_path)
print(f"MNIST files: {mnist_files}")
Common Dataset Configurations¶
Machine Learning Datasets¶
Standard ML dataset setup:
[datasets]
mount_readonly = true
[datasets.mappings]
# Popular ML datasets
mnist = "/data/ml/mnist"
cifar10 = "/data/ml/cifar10"
cifar100 = "/data/ml/cifar100"
fashion_mnist = "/data/ml/fashion_mnist"
# Preprocessed versions
mnist_normalized = "/data/ml/processed/mnist_norm"
cifar_augmented = "/data/ml/processed/cifar_aug"
# Model checkpoints
checkpoints = "/data/ml/checkpoints"
pretrained = "/data/ml/pretrained"
Federated Learning Setup¶
For federated learning with local data:
[datasets]
mount_readonly = true # Protect local data
[datasets.mappings]
# Node's local data partition
local_data = "/data/federated/client_001"
# Shared validation set
validation = "/data/federated/validation"
# Local model storage
local_models = "/data/federated/models"
Development Environment¶
For development and experimentation:
[datasets]
mount_readonly = false # Allow writes for development
[datasets.mappings]
# Project structure
raw_data = "/home/user/project/data/raw"
processed = "/home/user/project/data/processed"
features = "/home/user/project/data/features"
results = "/home/user/project/results"
# Development resources
notebooks = "/home/user/project/notebooks"
scripts = "/home/user/project/scripts"
Edge Device Configuration¶
For resource-constrained edge devices:
[datasets]
mount_readonly = false # Allow local caching
[datasets.mappings]
# Limited local storage
sensor_buffer = "/storage/sensor_data"
model_cache = "/storage/models"
config = "/storage/config"
Production Deployment¶
For production environments:
[datasets]
mount_readonly = true # Always read-only in production
[datasets.mappings]
# Production datasets
production_data = "/mnt/nfs/production/data"
models = "/mnt/nfs/production/models"
# Reference data
reference = "/mnt/nfs/reference"
# Logs for analysis
logs = "/var/log/application"
Directory Structure Examples¶
Recommended Organization¶
Organize your datasets for clarity:
/data/
├── datasets/ # Raw datasets
│ ├── mnist/
│ │ ├── train/
│ │ ├── test/
│ │ └── README.md
│ ├── cifar10/
│ │ ├── train/
│ │ ├── test/
│ │ └── README.md
│ └── custom/
│ ├── data.csv
│ └── metadata.json
├── models/ # Model files
│ ├── checkpoints/
│ └── production/
└── processed/ # Preprocessed data
├── normalized/
└── augmented/
Validation and Troubleshooting¶
Validating Dataset Paths¶
Before starting a node, verify datasets exist:
# Check if dataset paths exist
ls -la /data/datasets/mnist
ls -la /data/datasets/cifar10
# Verify permissions
stat /data/datasets/mnist
# Test read access
head -n 1 /data/datasets/mnist/train.csv
Common Issues¶
Dataset not found:
Error: Dataset path does not exist: /data/datasets/mnist
Solution: Ensure the path exists and is accessible:
# Create directory if needed
mkdir -p /data/datasets/mnist
# Check path in configuration
manta_node config show default | grep mnist
Permission denied:
Error: Permission denied accessing dataset
Solution: Fix permissions:
# Check current permissions
ls -la /data/datasets/
# Make readable by node process
chmod -R 755 /data/datasets/
# Fix ownership if needed
sudo chown -R $USER:$USER /data/datasets/
Path not absolute:
Error: Dataset path must be absolute
Solution: Use full paths starting with /:
[datasets.mappings]
# Correct - absolute path
mnist = "/data/datasets/mnist"
# Incorrect - relative path
# mnist = "datasets/mnist" # Don't use this
Best Practices¶
Use Descriptive Names
Choose clear, meaningful dataset names
Use consistent naming conventions
Document what each dataset contains
Organize Datasets Logically
Group related datasets together
Separate raw and processed data
Keep different projects isolated
Security Considerations
Always use
mount_readonly = true
in productionDon’t expose sensitive data unnecessarily
Use appropriate file permissions
Path Management
Always use absolute paths
Verify paths exist before starting nodes
Document dataset locations
Version Control
Track dataset versions when possible
Document dataset changes
Keep dataset metadata
Environment Variables¶
You can use environment variables in dataset paths:
# Set environment variable
export DATA_ROOT="/mnt/storage"
export PROJECT_DATA="/home/user/project"
[datasets.mappings]
# Reference environment variables
mnist = "${DATA_ROOT}/datasets/mnist"
project = "${PROJECT_DATA}/data"
See Also¶
Configuration Command - Configuration management
Task Configuration - Task execution settings
Identity Configuration - Node identity configuration
Start Command - Starting nodes with datasets