NAACL 2024

OSCaR

Object State Captioning and State Change Representation for egocentric video. The public release includes training code, inference and evaluation code, a Hugging Face dataset, and model weights for the OSCaR benchmark.

GitHub Repo Paper PDF arXiv Dataset Models

Annotated Segments

14,084

Paper-reported OSCaR dataset scale across EPIC-KITCHENS and Ego4D.

Benchmark Videos

500

Human-verified benchmark set with four captions per video.

Fine-Tune Entries

28,308

Image-level LLaVA conversation entries in the local OSCaR fine-tune manifest.

Backbone

LLaVA 1.5

Vicuna 7B and 13B with CLIP ViT-L/336 and LoRA rank 128.

OSCaR teaser showing source datasets, pipeline, and public release artifacts

Release Components

Code

Public training, inference, evaluation, and data-preparation code, packaged with UV-based install instructions.

Dataset

OSCaR assets, manifests, benchmark splits, and metadata prepared for the Hugging Face dataset release.

Weights

Released projector checkpoints and LoRA adapter repos, plus code to build merged models locally if needed.

Most Common Workflows

Run A Released Model

huggingface-cli download ali-vosoughi/oscar-llava-v1.5-13b-oscar-adapter --local-dir ../oscar-llava-v1.5-13b-oscar-adapter

Then load it with --model-base lmsys/vicuna-13b-v1.5.

Use The Dataset

huggingface-cli download ali-vosoughi/oscar-dataset --repo-type dataset --local-dir ../oscar-dataset

Use DATASET_ROOT=../oscar-dataset and PATH_PREFIX=$DATASET_ROOT/data.

Reproduce Training

Download a released projector repo, point MM_PROJECTOR_PATH to mm_projector.bin, then run the public fine-tune scripts.

Release Links

Source Datasets

EPIC-KITCHENS

Official dataset page

Ego4D

Official dataset page

Provenance

The released OSCaR frames and clip-level assets are derived from EPIC-KITCHENS and Ego4D source videos.

Use The Dataset

Download the Hugging Face dataset locally, set DATASET_ROOT, then use the preserved manifests and splits with the public train and eval scripts.

huggingface-cli download ali-vosoughi/oscar-dataset --repo-type dataset --local-dir ../oscar-dataset

export DATASET_ROOT=../oscar-dataset

export PATH_PREFIX="$DATASET_ROOT/data"

The main entry files are manifests/llava_data.json, splits/data_mapping_final_EK_test.csv, and metadata/segment_index.csv.

Training guide and evaluation guide.

Quickstart

Install the public environment with UV:

uv venv --python 3.10 .venv && source .venv/bin/activate && uv pip install -e .[train,inference,eval,release]

Training, inference, and evaluation entrypoints are documented in the repository root:

TRAIN.md, INFERENCE.md, EVAL.md.

Authors

Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu

Acknowledgments

OSCaR builds on the LLaVA codebase and training stack. We thank the LLaVA team for their open-source release and the strong baseline it provided for this work.

OSCaR also builds on source video data from EPIC-KITCHENS and Ego4D. We thank those teams for releasing the datasets from which the OSCaR frames and clips are derived.

Approved for public release; distribution is unlimited.

This work has been supported by the Defense Advanced Research Projects Agency (DARPA) under Contract HR00112220003. The content of the information does not necessarily reflect the position of the Government, and no official endorsement should be inferred.

OSCaR is released as a coordinated GitHub + Hugging Face project surface. The code repository intentionally excludes large data and weight payloads.

OSCaR

Release Components

Code

Dataset

Weights

Most Common Workflows

Run A Released Model

Use The Dataset

Reproduce Training

Release Links

Dataset

Adapters

Projectors

Source Datasets

EPIC-KITCHENS

Ego4D

Provenance

Use The Dataset

Quickstart

Authors

Acknowledgments