Dataset Discovery#
The photon-mosaic
pipeline includes a flexible dataset discovery system that uses regex patterns to find and process datasets. This functionality allows users to process datasets with different naming conventions and structures other than the NeuroBlueprint specification.
Configuration#
Configure dataset discovery in your config.yaml
file:
dataset_discovery:
# Match directories starting with '2'
pattern: "^2.*"
# Find tiffs. Different patterns will correspond to different sessions
tiff_patterns: ["*.tif"]
# Exclude specific patterns
exclude_patterns:
- ".*_test$"
# Transform names using regex substitutions
substitutions:
- pattern: "_"
repl: "" # The string "a_b" will be replaced with "ab"
Key Parameters#
pattern: Regular expression to identify dataset directories
tiff_patterns: List of glob patterns for TIFF files. Each pattern corresponds to a session (numbered starting from 0). Tiffs from differest sessions will be stored in different folders. Tiffs in a same session will be analysed together by Suite2p.
exclude_patterns: List of regex patterns for datasets to skip during processing
substitutions: List of regex substitution rules to transform dataset names
Examples#
Single Session#
dataset_discovery:
pattern: "^2.*"
tiff_patterns: ["*.tif"] # Single session with all .tif files
Multiple Sessions#
dataset_discovery:
pattern: "^2.*"
tiff_patterns: ["session1_*.tif", "session2_*.tif"] # Session 0, Session 1
Date-based Datasets#
dataset_discovery:
pattern: "^2\\d{3}_\\d{2}_\\d{2}" # Match YYYY_MM_DD format
tiff_patterns: ["*.tif"]
substitutions:
- pattern: "_"
repl: ""
Experiment-based Datasets#
dataset_discovery:
pattern: "exp_\\d+"
tiff_patterns: ["*.tif"]
substitutions:
- pattern: "exp_(\\d+)"
repl: "experiment\\\\1"
Animal ID-based Datasets#
dataset_discovery:
pattern: "mouse_[A-Z]\\d{3}" # Match mouse IDs like mouse_A123
tiff_patterns: ["*.tif"]
substitutions:
- pattern: "mouse_([A-Z]\\d{3})"
repl: "subject_\\\\1"
Session-based Datasets#
dataset_discovery:
pattern: "session_\\d{3}" # Match session_001, session_002, etc.
tiff_patterns: ["*.tif"]
substitutions:
- pattern: "session_(\\d{3})"
repl: "s\\\\1" # Convert to shorter format like s001
Multi-level Directory Structure#
dataset_discovery:
pattern: "subject_\\d+/session_\\d+" # Match subject_1/session_1, etc.
tiff_patterns: ["*.tif"]
substitutions:
- pattern: "subject_(\\d+)/session_(\\d+)"
repl: "s\\\\1_s\\\\2" # Convert to s1_s1 format
exclude_patterns:
- ".*/test/.*" # Exclude test directories
- ".*/backup/.*" # Exclude backup directories
Complex Pattern Matching#
dataset_discovery:
pattern: "^(?:raw|processed)_\\d{8}_[A-Z]{2}" # Match raw_20240315_AB or processed_20240315_AB
tiff_patterns: ["*.tif"]
substitutions:
- pattern: "^(raw|processed)_(\\d{8})_([A-Z]{2})"
repl: "\\\\2_\\\\3_\\\\1" # Reorder to 20240315_AB_raw
The discovered datasets are automatically used in the Snakemake workflow.