Collaboration Call Vision-Language Models Human Pose Estimation Efficient Models

Call: Promptable and Compact Human-Centric Foundation Models

Muhammad Saif Ullah Khan

Abstract

This call focuses on promptable human understanding models that combine vision, language, and pose signals while staying compact enough for practical deployment.

Scope

This area combines work on promptable pose reasoning, multimodal supervision, and compact model design for human-centric tasks.

The main emphasis is to retain strong transfer behavior while reducing model size, training cost, and adaptation overhead.

Open Collaboration Tasks

Promptable keypoint reasoning across categories and datasets.
Vision-language adaptation for human activity, posture, and context understanding.
Distillation and parameter-efficient tuning for compact deployment.
Unified evaluation across zero-shot transfer, robustness, and efficiency.

What Is Already Available

Preliminary multimodal and zero-shot pipelines from prior work.
Existing pose and vision-language baselines that can be adapted.
Public datasets and project pages for reproducible starting points.

Collaboration Format

Short exploratory studies to validate hypotheses quickly.
Joint model-development and benchmarking tracks for promising directions.
Co-authored releases and papers once results are stable and reproducible.

If you want to collaborate in this area, reach out at mukh07@dfki.de with a brief note on your preferred track (prompting, multimodal learning, or efficiency optimization).

Maintained by saifkhichi96 on GitHub.

The website is distributed under different open-source licenses. For more details, see the notice at the bottom of the page.