This repository contains Dockerfiles and Python scripts to build and run LLaVA models for image-based wildfire detection. It uses variants of (vip-llava-13b-hf
, llava-1.5-7b-hf
) that can be run in 4-bit or 8-bit precision on NVIDIA GPUs, enabling faster inference while conserving GPU memory.
- Overview
- Contents of This Repository
- How to Build and Run
- File-Specific Usage Notes
- Directory Structure
- Contributing
- License
- Acknowledgments
The goal of this project is to detect indications of wildfire (e.g. smoke or flames) from images using LLaVA-based vision-language models. The scripts can:
- Classify images as likely containing wildfire smoke/fire or not.
- Split the input image (e.g., into an optical camera image and an IR pseudo-color image) and process each half separately.
- Generate outputs and internal hidden representations for advanced analysis.
- Save checkpoints and outputs to persistent storage for further review.
-
Dockerfile
- Based on
nvcr.io/nvidia/pytorch:24.01-py3
. - Installs the required libraries from
requirements.txt
. - Sets environment variables for the Hugging Face Transformers cache.
- Copies all repository files into the container.
- (Commented out lines show examples of alternative base images and usage hints.)
- Based on
-
Dockerfile_polaris
- Similar to
Dockerfile
but usesnvcr.io/nvidia/pytorch:22.06-py3
as the base image. - Installs libraries specified in
requirements_polaris.txt
.
- Similar to
-
requirements.txt
Contains the standard Python dependencies:numpy transformers[torch] accelerate bitsandbytes
-
requirements_polaris.txt
Contains additional dependencies for the Polaris environment:numpy transformers[torch] accelerate bitsandbytes sentencepiece mpi4py
Below is a quick summary of each script. All scripts assume the /images
folder contains the images to process, and /RESULTS
is where outputs are stored. Many of these scripts share a similar structure: they load a model, loop over images, run inference, and save results/checkpoints.
-
konza_run_vip-llava-13b-hf_model.py
- Uses the VIP LLaVA 13B model (
llava-hf/vip-llava-13b-hf
) to analyze images for wildfire indicators. - Splits the image into two halves (RGB + IR), processes them separately, and logs results to
/RESULTS
.
- Uses the VIP LLaVA 13B model (
-
run_llava-1.5-7b-hf_model.py
- Uses
llava-hf/llava-1.5-7b-hf
for classification. - Scans the
/images
directory, classifies each image with a single prompt, and writes results in CSV format to/RESULTS
. - Copies images predicted to contain fire into
/RESULTS/FIRE_IMAGES/
.
- Uses
-
run_vip-llava-13b-hf_model.py
- Similar to
konza_run_vip-llava-13b-hf_model.py
, but with different prompts and logic flow. - If it detects smoke (or potential wildfire), it copies the file to
/RESULTS/FIRE_IMAGES/
and logs extended details for those images.
- Similar to
-
run_internal_llava-1.5-7b-hf_model.py
- Also uses
llava-hf/llava-1.5-7b-hf
. - In addition to generating a textual response, it captures and saves internal hidden-state representations from the model.
- Writes them out as tensors (
.pt
files) for later analysis.
- Also uses
-
run_internal_vip-llava-13b-hf_model.py
- Similar to
run_internal_llava-1.5-7b-hf_model.py
, but based on the VIP LLaVA 13B model. - Captures internal feature representations for the original, RGB-cropped, and IR-cropped images.
- Similar to
Each script includes functions to:
- Load existing checkpoints and partial outputs (so you can resume from the last processed image).
- Save updated checkpoints, textual results, and internal representations.
- Perform inference on the entire image or on specific sub-regions (RGB vs. IR).
To build the Docker image from the Dockerfile (using the nvcr.io/nvidia/pytorch:24.01-py3
base), run:
# From the repository root:
docker build -t hfsandbox:latest -f Dockerfile .
If you want to use the Polaris-compatible image (using the nvcr.io/nvidia/pytorch:22.06-py3
base):
docker build -t hfsandbox:polaris -f Dockerfile_polaris .
You can run the container with GPU access and mount local directories for images, HF cache, and results:
sudo docker run --gpus all -it --rm \
-v /path/to/images:/images \
-v /path/to/huggingface/cache:/hf_cache \
-v /path/to/results:/RESULTS \
hfsandbox:latest
Within the container, you can run any of the scripts (e.g., konza_run_vip-llava-13b-hf_model.py
) to process /images
and store outputs in /RESULTS
.
If you plan to run this on the Polaris supercomputer (or another HPC environment) where you need mpi4py
and a slightly older CUDA/PyTorch stack, use the container built from Dockerfile_polaris
and the requirements_polaris.txt
file.
requirements_polaris.txt
includesmpi4py
andsentencepiece
, which might be unnecessary for local runs but required on HPC systems like Polaris.- In each script, you can adjust the prompts or classification logic (e.g., changing from “Is there fire?” to “Is there smoke?”) according to your application needs.
- By default, the scripts look for
checkpoint.txt
in/RESULTS/
to detect already processed files. If you remove or rename it, the script will reprocess everything from scratch.
A typical layout after cloning this repository and building the container could look like:
.
├── Dockerfile
├── Dockerfile_polaris
├── requirements.txt
├── requirements_polaris.txt
├── konza_run_vip-llava-13b-hf_model.py
├── run_internal_llava-1.5-7b-hf_model.py
├── run_internal_vip-llava-13b-hf_model.py
├── run_llava-1.5-7b-hf_model.py
├── run_vip-llava-13b-hf_model.py
├── README.md
└── ...
/images
(mounted at runtime)- The directory containing images (
.jpg
) for inference.
- The directory containing images (
/RESULTS
(mounted at runtime)- Where the scripts will store results (
output.csv
), checkpoints (checkpoint.txt
), subfolders likeFIRE_IMAGES/
, and.pt
files with hidden states if applicable.
- Where the scripts will store results (
/hf_cache
(mounted at runtime)- Caches model weights from Hugging Face so you don’t redownload them each time.
If you want to contribute:
- Fork this repository.
- Create a new feature branch.
- Make your changes and test them.
- Submit a pull request describing your changes.
Feel free to open an issue if you find bugs or want to request a feature!
This project does not have a specified license. Please note that the underlying LLaVA and VIP-LLaVA models have their own licenses. Refer to their respective repositories for more details.
- LLaVA team for open-sourcing the vision-language alignment models.
- Hugging Face for hosting model checkpoints.
- NVIDIA NGC for providing base Docker images with PyTorch support.