Datumaro is:
- a tool to build composite datasets and iterate over them
- a tool to create and maintain datasets
- Version control of annotations and images
- Publication (with removal of sensitive information)
- Editing
- Joining and splitting
- Exporting, format changing
- Image preprocessing
- a dataset storage
- a tool to debug datasets
- A network can be used to generate informative data subsets (e.g. with false-positives) to be analyzed further
- User interfaces
- a library
- a console tool with visualization means
- Targets: single datasets, composite datasets, single images / videos
- Built-in support for well-known annotation formats and datasets: CVAT, COCO, PASCAL VOC, Cityscapes, ImageNet
- Extensibility with user-provided components
- Lightweightness - it should be easy to start working with Datumaro
- Minimal dependency on environment and configuration
- It should be easier to use Datumaro than writing own code for computation of statistics or dataset manipulations
- Blur sensitive areas on dataset images
- Dataset annotation filters, relabelling etc.
- Dataset augmentation
- Calculation of statistics:
- Mean & std, custom stats
- "Edit" command to modify annotations
- Versioning (for images, annotations, subsets, sources etc., comparison)
- Documentation generation
- Provision of iterators for user code
- Dataset downloading
- Dataset generation
- Dataset building (export in a specific format, indexation, statistics, documentation)
- Dataset exporting to other formats
- Dataset debugging (run inference, generate dataset slices, compute statistics)
- "Explainable AI" - highlight network attention areas (paper)
- Black-box approach
- Classification, Detection, Segmentation, Captioning
- White-box approach
- Black-box approach
- exploration of network prediction uncertainty (aka Bayessian approach) Use case: explanation of network "quality", "stability", "certainty"
- adversarial attacks on networks
- dataset minification / reduction Use case: removal of redundant information to reach the same network quality with lesser training time
- dataset expansion and filtration of additions Use case: add only important data
- guidance for key frame selection for tracking (paper) Use case: more effective annotation, better predictions
In the first version Datumaro should be a project manager for CVAT. It should only consume data from CVAT. The collected dataset can be downloaded by user to be operated on with Datumaro CLI.
User
|
v
+------------------+
| CVAT |
+--------v---------+ +------------------+ +--------------+
| Datumaro module | ----> | Datumaro project | <---> | Datumaro CLI | <--- User
+------------------+ +------------------+ +--------------+
- Python API for user code
- Installation as a package
- A command-line tool for dataset manipulations
-
Dataset format support (reading, writing)
- Own format
- CVAT
- COCO
- PASCAL VOC
- YOLO
- TF Detection API
- Cityscapes
- ImageNet
-
Dataset visualization (
show
)- Ability to visualize a dataset
- with TensorBoard
- Ability to visualize a dataset
-
Calculation of statistics for datasets
- Pixel mean, std
- Object counts (detection scenario)
- Image-Class distribution (classification scenario)
- Pixel-Class distribution (segmentation scenario)
- Image similarity clusters
- Custom statistics
-
Dataset building
- Composite dataset building
- Class remapping
- Subset splitting
- Dataset filtering (
extract
) - Dataset merging (
merge
) - Dataset item editing (
edit
)
-
Dataset comparison (
diff
)- Annotation-annotation comparison
- Annotation-inference comparison
- Annotation quality estimation (for CVAT)
- Provide a simple method to check annotation quality with a model and generate summary
-
Dataset and model debugging
- Inference explanation (
explain
) - Black-box approach (RISE paper)
- Ability to run a model on a dataset and read the results
- Inference explanation (
-
CVAT-integration features
- Task export
- Datumaro project export
- Dataset export
- Original raw data (images, a video file) can be downloaded (exported)
together with annotations or just have links
on CVAT server (in future, support S3, etc)
- Be able to use local files instead of remote links
- Specify cache directory
- Be able to use local files instead of remote links
- Use case "annotate for model training"
- create a task
- annotate
- export the task
- convert to a training format
- train a DL model
- Use case "annotate - reannotate problematic images - merge"
- Use case "annotate and estimate quality"
- create a task
- annotate
- estimate quality of annotations
- Task export
-
Dataset publishing
- Versioning (for annotations, subsets, sources, etc.)
- Blur sensitive areas on images
- Tracking of legal information
- Documentation generation
-
Dataset building
- Dataset minification / Extraction of the most representative subset
- Use case: generate low-precision calibration dataset
- Dataset minification / Extraction of the most representative subset
-
Dataset and model debugging
- Training visualization
- Inference explanation (
explain
)- White-box approach
- Lightweightness
- Modularity
- Extensibility