Dataset Configuration

The datasets section configures dataset mappings that allow your node to provide local data access to running tasks.

Overview

Dataset configuration enables:

  • Local dataset registration through path mappings

  • Named dataset access for tasks

  • Secure read-only data mounting

  • Multiple dataset support per node

Configuration Example

[datasets]
base_path = "/data/datasets"
mount_readonly = true

[datasets.mappings]
mnist = "/data/datasets/mnist"
cifar10 = "/data/datasets/cifar10"
custom_data = "/mnt/storage/project_data"

Configuration Fields

base_path

Type: string

Default: "/data/datasets"

Description: Base directory for dataset organization (informational)

[datasets]
base_path = "/data/datasets"

mount_readonly

Type: boolean

Default: true

Description: Whether to mount datasets as read-only in task containers

Details:

  • true: Prevents tasks from modifying datasets (recommended)

  • false: Allows tasks to write to datasets

# Protect datasets from modification
mount_readonly = true

mappings

Type: table (dictionary)

Default: {} (empty)

Description: Named dataset paths that tasks can access

Format: name = "path"

Details:

  • Maps friendly names to filesystem paths

  • Paths must be absolute

  • Names are used by tasks to request datasets

  • Supports any directory or file path that exists

Examples:

[datasets.mappings]
# Machine learning datasets
mnist = "/data/ml/mnist"
cifar10 = "/data/ml/cifar10"
imagenet = "/data/ml/imagenet"

# Project-specific data
training_data = "/projects/current/train"
validation_data = "/projects/current/val"
test_data = "/projects/current/test"

# Shared resources
pretrained_models = "/models/pretrained"
embeddings = "/data/embeddings"

# Time-series data
sensor_data = "/data/sensors/2024"
logs = "/var/log/application"

How Dataset Mappings Work

Path Mapping

When you configure dataset mappings:

  1. Node registers the dataset paths with their names

  2. Tasks request datasets by name (e.g., “mnist”)

  3. Node mounts the local path into the task container

  4. Container sees the dataset at /datasets/<name>

Example Flow

Configuration:

[datasets.mappings]
mnist = "/data/datasets/mnist"
models = "/data/models"

Task container sees:

/datasets/
├── mnist/     # Mounted from /data/datasets/mnist
└── models/    # Mounted from /data/models

Task code accesses:

# In task container
import os

# Access mapped datasets
mnist_path = "/datasets/mnist"
models_path = "/datasets/models"

# List files
mnist_files = os.listdir(mnist_path)
print(f"MNIST files: {mnist_files}")

Common Dataset Configurations

Machine Learning Datasets

Standard ML dataset setup:

[datasets]
mount_readonly = true

[datasets.mappings]
# Popular ML datasets
mnist = "/data/ml/mnist"
cifar10 = "/data/ml/cifar10"
cifar100 = "/data/ml/cifar100"
fashion_mnist = "/data/ml/fashion_mnist"

# Preprocessed versions
mnist_normalized = "/data/ml/processed/mnist_norm"
cifar_augmented = "/data/ml/processed/cifar_aug"

# Model checkpoints
checkpoints = "/data/ml/checkpoints"
pretrained = "/data/ml/pretrained"

Federated Learning Setup

For federated learning with local data:

[datasets]
mount_readonly = true  # Protect local data

[datasets.mappings]
# Node's local data partition
local_data = "/data/federated/client_001"

# Shared validation set
validation = "/data/federated/validation"

# Local model storage
local_models = "/data/federated/models"

Development Environment

For development and experimentation:

[datasets]
mount_readonly = false  # Allow writes for development

[datasets.mappings]
# Project structure
raw_data = "/home/user/project/data/raw"
processed = "/home/user/project/data/processed"
features = "/home/user/project/data/features"
results = "/home/user/project/results"

# Development resources
notebooks = "/home/user/project/notebooks"
scripts = "/home/user/project/scripts"

Edge Device Configuration

For resource-constrained edge devices:

[datasets]
mount_readonly = false  # Allow local caching

[datasets.mappings]
# Limited local storage
sensor_buffer = "/storage/sensor_data"
model_cache = "/storage/models"
config = "/storage/config"

Production Deployment

For production environments:

[datasets]
mount_readonly = true  # Always read-only in production

[datasets.mappings]
# Production datasets
production_data = "/mnt/nfs/production/data"
models = "/mnt/nfs/production/models"

# Reference data
reference = "/mnt/nfs/reference"

# Logs for analysis
logs = "/var/log/application"

Directory Structure Examples

Validation and Troubleshooting

Validating Dataset Paths

Before starting a node, verify datasets exist:

# Check if dataset paths exist
ls -la /data/datasets/mnist
ls -la /data/datasets/cifar10

# Verify permissions
stat /data/datasets/mnist

# Test read access
head -n 1 /data/datasets/mnist/train.csv

Common Issues

Dataset not found:

Error: Dataset path does not exist: /data/datasets/mnist

Solution: Ensure the path exists and is accessible:

# Create directory if needed
mkdir -p /data/datasets/mnist

# Check path in configuration
manta_node config show default | grep mnist

Permission denied:

Error: Permission denied accessing dataset

Solution: Fix permissions:

# Check current permissions
ls -la /data/datasets/

# Make readable by node process
chmod -R 755 /data/datasets/

# Fix ownership if needed
sudo chown -R $USER:$USER /data/datasets/

Path not absolute:

Error: Dataset path must be absolute

Solution: Use full paths starting with /:

[datasets.mappings]
# Correct - absolute path
mnist = "/data/datasets/mnist"

# Incorrect - relative path
# mnist = "datasets/mnist"  # Don't use this

Best Practices

  1. Use Descriptive Names

    • Choose clear, meaningful dataset names

    • Use consistent naming conventions

    • Document what each dataset contains

  2. Organize Datasets Logically

    • Group related datasets together

    • Separate raw and processed data

    • Keep different projects isolated

  3. Security Considerations

    • Always use mount_readonly = true in production

    • Don’t expose sensitive data unnecessarily

    • Use appropriate file permissions

  4. Path Management

    • Always use absolute paths

    • Verify paths exist before starting nodes

    • Document dataset locations

  5. Version Control

    • Track dataset versions when possible

    • Document dataset changes

    • Keep dataset metadata

Environment Variables

You can use environment variables in dataset paths:

# Set environment variable
export DATA_ROOT="/mnt/storage"
export PROJECT_DATA="/home/user/project"
[datasets.mappings]
# Reference environment variables
mnist = "${DATA_ROOT}/datasets/mnist"
project = "${PROJECT_DATA}/data"

See Also