Dataset Configuration¶

The datasets section configures dataset mappings that allow your node to provide local data access to running tasks.

Overview¶

Dataset configuration enables:

Local dataset registration through path mappings
Named dataset access for tasks
Secure read-only data mounting
Multiple dataset support per node

Configuration Example¶

[datasets]
base_path = "/data/datasets"
mount_readonly = true

[datasets.mappings]
mnist = "/data/datasets/mnist"
cifar10 = "/data/datasets/cifar10"
custom_data = "/mnt/storage/project_data"

Configuration Fields¶

base_path¶

Type: string

Default: "/data/datasets"

Description: Base directory for dataset organization (informational)

[datasets]
base_path = "/data/datasets"

mount_readonly¶

Type: boolean

Default: true

Description: Whether to mount datasets as read-only in task containers

Details:

true: Prevents tasks from modifying datasets (recommended)
false: Allows tasks to write to datasets

# Protect datasets from modification
mount_readonly = true

mappings¶

Type: table (dictionary)

Default: {} (empty)

Description: Named dataset paths that tasks can access

Format: name = "path"

Details:

Maps friendly names to filesystem paths
Paths must be absolute
Names are used by tasks to request datasets
Supports any directory or file path that exists

Examples:

[datasets.mappings]
# Machine learning datasets
mnist = "/data/ml/mnist"
cifar10 = "/data/ml/cifar10"
imagenet = "/data/ml/imagenet"

# Project-specific data
training_data = "/projects/current/train"
validation_data = "/projects/current/val"
test_data = "/projects/current/test"

# Shared resources
pretrained_models = "/models/pretrained"
embeddings = "/data/embeddings"

# Time-series data
sensor_data = "/data/sensors/2024"
logs = "/var/log/application"

How Dataset Mappings Work¶

Path Mapping¶

When you configure dataset mappings:

Node registers the dataset paths with their names
Tasks request datasets by name (e.g., “mnist”)
Node mounts the local path into the task container
Container sees the dataset at /datasets/<name>

Example Flow¶

Configuration:

[datasets.mappings]
mnist = "/data/datasets/mnist"
models = "/data/models"

Task container sees:

/datasets/
├── mnist/     # Mounted from /data/datasets/mnist
└── models/    # Mounted from /data/models

Task code accesses:

# In task container
import os

# Access mapped datasets
mnist_path = "/datasets/mnist"
models_path = "/datasets/models"

# List files
mnist_files = os.listdir(mnist_path)
print(f"MNIST files: {mnist_files}")

Common Dataset Configurations¶

Machine Learning Datasets¶

Standard ML dataset setup:

[datasets]
mount_readonly = true

[datasets.mappings]
# Popular ML datasets
mnist = "/data/ml/mnist"
cifar10 = "/data/ml/cifar10"
cifar100 = "/data/ml/cifar100"
fashion_mnist = "/data/ml/fashion_mnist"

# Preprocessed versions
mnist_normalized = "/data/ml/processed/mnist_norm"
cifar_augmented = "/data/ml/processed/cifar_aug"

# Model checkpoints
checkpoints = "/data/ml/checkpoints"
pretrained = "/data/ml/pretrained"

Federated Learning Setup¶

For federated learning with local data:

[datasets]
mount_readonly = true  # Protect local data

[datasets.mappings]
# Node's local data partition
local_data = "/data/federated/client_001"

# Shared validation set
validation = "/data/federated/validation"

# Local model storage
local_models = "/data/federated/models"

Development Environment¶

For development and experimentation:

[datasets]
mount_readonly = false  # Allow writes for development

[datasets.mappings]
# Project structure
raw_data = "/home/user/project/data/raw"
processed = "/home/user/project/data/processed"
features = "/home/user/project/data/features"
results = "/home/user/project/results"

# Development resources
notebooks = "/home/user/project/notebooks"
scripts = "/home/user/project/scripts"

Edge Device Configuration¶

For resource-constrained edge devices:

[datasets]
mount_readonly = false  # Allow local caching

[datasets.mappings]
# Limited local storage
sensor_buffer = "/storage/sensor_data"
model_cache = "/storage/models"
config = "/storage/config"

Production Deployment¶

For production environments:

[datasets]
mount_readonly = true  # Always read-only in production

[datasets.mappings]
# Production datasets
production_data = "/mnt/nfs/production/data"
models = "/mnt/nfs/production/models"

# Reference data
reference = "/mnt/nfs/reference"

# Logs for analysis
logs = "/var/log/application"

Directory Structure Examples¶

Recommended Organization¶

Organize your datasets for clarity:

/data/
├── datasets/           # Raw datasets
│   ├── mnist/
│   │   ├── train/
│   │   ├── test/
│   │   └── README.md
│   ├── cifar10/
│   │   ├── train/
│   │   ├── test/
│   │   └── README.md
│   └── custom/
│       ├── data.csv
│       └── metadata.json
├── models/            # Model files
│   ├── checkpoints/
│   └── production/
└── processed/         # Preprocessed data
    ├── normalized/
    └── augmented/

Validation and Troubleshooting¶

Validating Dataset Paths¶

Before starting a node, verify datasets exist:

# Check if dataset paths exist
ls -la /data/datasets/mnist
ls -la /data/datasets/cifar10

# Verify permissions
stat /data/datasets/mnist

# Test read access
head -n 1 /data/datasets/mnist/train.csv

Common Issues¶

Dataset not found:

Error: Dataset path does not exist: /data/datasets/mnist

Solution: Ensure the path exists and is accessible:

# Create directory if needed
mkdir -p /data/datasets/mnist

# Check path in configuration
manta_node config show default | grep mnist

Permission denied:

Error: Permission denied accessing dataset

Solution: Fix permissions:

# Check current permissions
ls -la /data/datasets/

# Make readable by node process
chmod -R 755 /data/datasets/

# Fix ownership if needed
sudo chown -R $USER:$USER /data/datasets/

Path not absolute:

Error: Dataset path must be absolute

Solution: Use full paths starting with /:

[datasets.mappings]
# Correct - absolute path
mnist = "/data/datasets/mnist"

# Incorrect - relative path
# mnist = "datasets/mnist"  # Don't use this

Best Practices¶

Use Descriptive Names
- Choose clear, meaningful dataset names
- Use consistent naming conventions
- Document what each dataset contains
Organize Datasets Logically
- Group related datasets together
- Separate raw and processed data
- Keep different projects isolated
Security Considerations
- Always use mount_readonly = true in production
- Don’t expose sensitive data unnecessarily
- Use appropriate file permissions
Path Management
- Always use absolute paths
- Verify paths exist before starting nodes
- Document dataset locations
Version Control
- Track dataset versions when possible
- Document dataset changes
- Keep dataset metadata

Environment Variables¶

You can use environment variables in dataset paths:

# Set environment variable
export DATA_ROOT="/mnt/storage"
export PROJECT_DATA="/home/user/project"

[datasets.mappings]
# Reference environment variables
mnist = "${DATA_ROOT}/datasets/mnist"
project = "${PROJECT_DATA}/data"