ICLR 2026 Submission

DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

Xuan Qi, Luxi He, Dan Roth, Xingyu Fu

DataProphet predicts which supervision datasets will help a target benchmark before any training. It combines multimodal similarity, perplexity, and diversity into a training-free transfer score.

Download PDF Hugging Face GitHub Code

14

Source/target datasets

7

Task families

0.860

Kendall's tau (avg)

+6.9%

Synthetic data gain

Abstract

Conventional data selection for multimodal LLMs often follows intuitive task similarity, but this paper shows that intuition is unreliable for predicting transfer gains. The authors evaluate 14 vision-language datasets across 7 task families and find that influence is asymmetric and dataset-specific. DataProphet introduces a simple training-free metric that integrates question/answer/image similarity, multimodal perplexity, and source diversity to predict transfer ranking. The predicted rankings strongly correlate with actual fine-tuning outcomes (Kendall's tau 0.860), and DataProphet-guided selection improves average performance over uniform and training-based baselines under fixed compute budgets.

Core Takeaways

Task similarity is a weak proxy

OCR supervision can improve spatial reasoning more than chart tasks. Transfer cannot be inferred from high-level task labels alone.

Influence is directional

The gain from train -> test is not symmetric: Delta_s->t and Delta_t->s can differ substantially.

Training-free selection can win

DataProphet reaches +3.4% average improvement on real-data reweighting and +6.9% on synthetic data selection versus uniform sampling.

Data Influence Analysis

Controlled fine-tuning with InternVL3-2B on each source dataset (20K samples) reveals non-intuitive cross-task transfer patterns.

DataProphet teaser figure with three key findings — Figure: Three major takeaways from the paper.

Relative improvement heatmap across 14 train and test datasets — Relative improvement heatmap (train dataset on y-axis, test dataset on x-axis).

The DataProphet Metric

M(s->t) = (QSim * ASim * ISim * PPL(s) * (Sil + H)) / PPL(t)

The metric is directional and training-free. It rewards source datasets that are aligned with the target in text and vision space, challenging enough to teach new capability, and diverse in question coverage.

DataProphet metric heatmap values — Metric score matrix between all source-target pairs.

Ranking Quality

Average tau_Tgt = 0.863
Average tau_Src = 0.857
Overall average tau = 0.860

Ablation Signal

Removing perplexity: 0.860 -> 0.487
Removing image similarity: 0.860 -> 0.625
Removing diversity: 0.860 -> 0.659

Kendall tau results by dataset — Kendall tau across target datasets.

Data Selection Results

Under a fixed budget of 280K samples, DataProphet-guided selection outperforms both uniform and training-based methods in real and synthetic settings.

Setting	Uniform	ICONS	Oracle	DataProphet
Real Data Reweighting (Avg)	67.6	69.6	70.8	71.0
Improve over Uniform	-	+2.0	+3.2	+3.4
Synthetic Data Selection (Avg)	55.1	60.8	-	62.0
Improve over Uniform	-	+5.7	-	+6.9

Synthetic Source Mix

Among selected synthetic samples, approximately 38% come from GPT-5 and 62% from Gemini 2.5 Pro.

RL Post-training

DataProphet allocation improves average score from 0.583 -> 0.595 (real RL data) and 0.564 -> 0.577 (synthetic RL data).

Citation

If this work is useful, please cite:

@article{qi2026dataprophet,
  title={DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs},
  author={Qi, Xuan and He, Luxi and Roth, Dan and Fu, Xingyu},
  journal={International Conference on Learning Representations},
  year={2026}
}