Paper PDF arXiv GitHub Dataset Models
NAACL 2024

OSCaR

Object State Captioning and State Change Representation for egocentric video. The public release includes training code, inference and evaluation code, a Hugging Face dataset, and model weights for the OSCaR benchmark.

Annotated Segments
14,084
Paper-reported OSCaR dataset scale across EPIC-KITCHENS and Ego4D.
Benchmark Videos
500
Human-verified benchmark set with four captions per video.
Fine-Tune Entries
28,308
Image-level LLaVA conversation entries in the local OSCaR fine-tune manifest.
Backbone
LLaVA 1.5
Vicuna 7B and 13B with CLIP ViT-L/336 and LoRA rank 128.
OSCaR teaser showing source datasets, pipeline, and public release artifacts

Release Components

Code

Public training, inference, evaluation, and data-preparation code, packaged with UV-based install instructions.

Dataset

OSCaR assets, manifests, benchmark splits, and metadata prepared for the Hugging Face dataset release.

Weights

Released projector checkpoints and LoRA adapter repos, plus code to build merged models locally if needed.

Most Common Workflows

Run A Released Model

huggingface-cli download ali-vosoughi/oscar-llava-v1.5-13b-oscar-adapter --local-dir ../oscar-llava-v1.5-13b-oscar-adapter

Then load it with --model-base lmsys/vicuna-13b-v1.5.

Use The Dataset

huggingface-cli download ali-vosoughi/oscar-dataset --repo-type dataset --local-dir ../oscar-dataset

Use DATASET_ROOT=../oscar-dataset and PATH_PREFIX=$DATASET_ROOT/data.

Reproduce Training

Download a released projector repo, point MM_PROJECTOR_PATH to mm_projector.bin, then run the public fine-tune scripts.

Release Links

Source Datasets

Provenance

The released OSCaR frames and clip-level assets are derived from EPIC-KITCHENS and Ego4D source videos.

Use The Dataset

Download the Hugging Face dataset locally, set DATASET_ROOT, then use the preserved manifests and splits with the public train and eval scripts.

huggingface-cli download ali-vosoughi/oscar-dataset --repo-type dataset --local-dir ../oscar-dataset

export DATASET_ROOT=../oscar-dataset

export PATH_PREFIX="$DATASET_ROOT/data"

The main entry files are manifests/llava_data.json, splits/data_mapping_final_EK_test.csv, and metadata/segment_index.csv.

Training guide and evaluation guide.

Quickstart

Install the public environment with UV:

uv venv --python 3.10 .venv && source .venv/bin/activate && uv pip install -e .[train,inference,eval,release]

Training, inference, and evaluation entrypoints are documented in the repository root:

TRAIN.md, INFERENCE.md, EVAL.md.

Authors

Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu

Acknowledgments

OSCaR builds on the LLaVA codebase and training stack. We thank the LLaVA team for their open-source release and the strong baseline it provided for this work.

OSCaR also builds on source video data from EPIC-KITCHENS and Ego4D. We thank those teams for releasing the datasets from which the OSCaR frames and clips are derived.

Approved for public release; distribution is unlimited.

This work has been supported by the Defense Advanced Research Projects Agency (DARPA) under Contract HR00112220003. The content of the information does not necessarily reflect the position of the Government, and no official endorsement should be inferred.

OSCaR is released as a coordinated GitHub + Hugging Face project surface. The code repository intentionally excludes large data and weight payloads.