Dataset Discovery#
The photon-mosaic pipeline includes a flexible dataset discovery system that uses regex patterns to find and process datasets. This functionality allows users to process datasets with different naming conventions and structures other than the NeuroBlueprint specification.
Configuration#
Configure dataset discovery in your config.yaml file:
dataset_discovery:
# Match directories starting with '2'
pattern: "^2.*"
# Find tiffs. Different patterns will correspond to different sessions
tiff_patterns: ["*.tif"]
# Exclude specific dataset patterns (folder names) and session patterns
exclude_datasets:
- ".*_test$"
exclude_sessions: []
Key Parameters#
pattern: Regular expression to identify dataset directories
tiff_patterns: List of glob patterns for TIFF files. Each pattern corresponds to a session (numbered starting from 0). Tiffs from differest sessions will be stored in different folders. Tiffs in a same session will be analysed together by Suite2p.
exclude_datasets: List of regex patterns for dataset folder names to skip during processing
exclude_sessions: List of regex patterns for session folder names to skip during processing
Examples#
Single Session#
dataset_discovery:
pattern: "^2.*"
tiff_patterns: ["*.tif"] # Single session with all .tif files
Multiple Sessions#
dataset_discovery:
pattern: "^2.*"
tiff_patterns: ["session1_*.tif", "session2_*.tif"] # Session 0, Session 1
Date-based Datasets#
dataset_discovery:
pattern: "^2\\d{3}_\\d{2}_\\d{2}" # Match YYYY_MM_DD format
tiff_patterns: ["*.tif"]
# Note: regex substitutions have been removed; use `exclude_datasets`/`exclude_sessions` instead
Experiment-based Datasets#
dataset_discovery:
pattern: "exp_\\d+"
tiff_patterns: ["*.tif"]
# Use `exclude_datasets` or `exclude_sessions` instead of name substitutions
Animal ID-based Datasets#
dataset_discovery:
pattern: "mouse_[A-Z]\\d{3}" # Match mouse IDs like mouse_A123
tiff_patterns: ["*.tif"]
# Use `exclude_datasets` or `exclude_sessions` instead of name substitutions
Session-based Datasets#
dataset_discovery:
pattern: "session_\\d{3}" # Match session_001, session_002, etc.
tiff_patterns: ["*.tif"]
# Use `exclude_datasets` or `exclude_sessions` instead of name substitutions
Multi-level Directory Structure#
dataset_discovery:
pattern: "subject_\\d+/session_\\d+" # Match subject_1/session_1, etc.
tiff_patterns: ["*.tif"]
# Use `exclude_datasets` and `exclude_sessions` to avoid test/backup directories
exclude_datasets:
- ".*/test/.*"
- ".*/backup/.*"
Complex Pattern Matching#
dataset_discovery:
pattern: "^(?:raw|processed)_\\d{8}_[A-Z]{2}" # Match raw_20240315_AB or processed_20240315_AB
tiff_patterns: ["*.tif"]
# Use `exclude_datasets`/`exclude_sessions`; name substitutions are not supported
The discovered datasets are automatically used in the Snakemake workflow.