avid.common.artefact.crawler.DirectoryCrawler
- class avid.common.artefact.crawler.DirectoryCrawler(root_path: str | PathLike, file_functor: Callable | Callable[[], Callable], replace_existing_artefacts: bool = False, n_processes: int = 1, scan_directory_break_delegate: Callable[[DirEntry], bool] | None = None)
Bases:
objectHelper class that crawls a directory tree starting from the given rootPath.
The crawler assumes that every file found is a potential artefact and calls the provided file functor to interpret each file. If the functor returns an artefact, it is added to the result collection. Crawling is distributed to multiple parallel processes for improved performance on large directory trees.
- Parameters:
root_path – Path to the root directory. All subdirectories will be recursively crawled.
file_functor – A callable or factory for callables which processes each file. If file_functor is a factory, a new callable will be generated for each subdirectory.
replace_existing_artefacts – If True, newly found artefacts will overwrite similar existing ones. If False, duplicates will be dropped.
n_processes – Number of parallel processes to use for crawling
scan_directory_break_delegate (Optional[Callable[[os.DirEntry], bool]]) – Optional delegate to control directory scanning. Called for each directory entry - if it returns True, scanning stops for that directory.
Example break delegate for DICOM optimization:
def dicom_break_delegate(path: os.DirEntry) -> bool: # Stop scanning subdirs if we found a .dcm file (assumes one series per folder) return os.path.isfile(path) and path.endswith('.dcm')
- __init__(root_path: str | PathLike, file_functor: Callable | Callable[[], Callable], replace_existing_artefacts: bool = False, n_processes: int = 1, scan_directory_break_delegate: Callable[[DirEntry], bool] | None = None)
Methods
__init__(root_path, file_functor[, ...])Execute the crawling operation and return collected artefacts.
Attributes
Returns the number of artefacts that were finally added in the last crawl (and not overwritten or dropped) So that is the number of artefacts finally in the list.
Returns the number of artefacts that were dropped due to being duplicates to already found artefacts in the last crawl.
Returns the number of irrelevant files (not ended up in artefats) of the last crawl.
Returns the number of artefacts that were overwritten by simelar artefact in the cause of crawling.
- getArtefacts() ArtefactCollection
Execute the crawling operation and return collected artefacts.
This method orchestrates the entire crawling process: 1. Scans directories to find all folders to process 2. Distributes folder processing across multiple processes 3. Collects and merges results while handling duplicates 4. Updates internal statistics
- Returns:
Collection of all discovered artefacts
- Raises:
OSError – If root directory cannot be accessed
- property number_of_last_added
Returns the number of artefacts that were finally added in the last crawl (and not overwritten or dropped) So that is the number of artefacts finally in the list.
- property number_of_last_dropped
Returns the number of artefacts that were dropped due to being duplicates to already found artefacts in the last crawl.
- property number_of_last_irrelevant
Returns the number of irrelevant files (not ended up in artefats) of the last crawl.
- property number_of_last_overwites
Returns the number of artefacts that were overwritten by simelar artefact in the cause of crawling.