ICLR 2026 Submission

DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

Xuan Qi, Luxi He, Dan Roth, Xingyu Fu

DataProphet predicts which supervision datasets will help a target benchmark before any training. It combines multimodal similarity, perplexity, and diversity into a training-free transfer score.

14

Source/target datasets

7

Task families

0.860

Kendall's tau (avg)

+6.9%

Synthetic data gain

DataProphet illustration

Abstract

Conventional data selection for multimodal LLMs often follows intuitive task similarity, but this paper shows that intuition is unreliable for predicting transfer gains. The authors evaluate 14 vision-language datasets across 7 task families and find that influence is asymmetric and dataset-specific. DataProphet introduces a simple training-free metric that integrates question/answer/image similarity, multimodal perplexity, and source diversity to predict transfer ranking. The predicted rankings strongly correlate with actual fine-tuning outcomes (Kendall's tau 0.860), and DataProphet-guided selection improves average performance over uniform and training-based baselines under fixed compute budgets.

Core Takeaways

Task similarity is a weak proxy

OCR supervision can improve spatial reasoning more than chart tasks. Transfer cannot be inferred from high-level task labels alone.

Influence is directional

The gain from train -> test is not symmetric: Deltas->t and Deltat->s can differ substantially.

Training-free selection can win

DataProphet reaches +3.4% average improvement on real-data reweighting and +6.9% on synthetic data selection versus uniform sampling.

Data Influence Analysis

Controlled fine-tuning with InternVL3-2B on each source dataset (20K samples) reveals non-intuitive cross-task transfer patterns.

DataProphet teaser figure with three key findings
Figure: Three major takeaways from the paper.
Relative improvement heatmap across 14 train and test datasets
Relative improvement heatmap (train dataset on y-axis, test dataset on x-axis).

The DataProphet Metric

M(s->t) = (QSim * ASim * ISim * PPL(s) * (Sil + H)) / PPL(t)

The metric is directional and training-free. It rewards source datasets that are aligned with the target in text and vision space, challenging enough to teach new capability, and diverse in question coverage.

DataProphet metric heatmap values
Metric score matrix between all source-target pairs.

Ranking Quality

  • Average tauTgt = 0.863
  • Average tauSrc = 0.857
  • Overall average tau = 0.860

Ablation Signal

  • Removing perplexity: 0.860 -> 0.487
  • Removing image similarity: 0.860 -> 0.625
  • Removing diversity: 0.860 -> 0.659
Kendall tau results by dataset
Kendall tau across target datasets.

Data Selection Results

Under a fixed budget of 280K samples, DataProphet-guided selection outperforms both uniform and training-based methods in real and synthetic settings.

Setting Uniform ICONS Oracle DataProphet
Real Data Reweighting (Avg) 67.6 69.6 70.8 71.0
Improve over Uniform - +2.0 +3.2 +3.4
Synthetic Data Selection (Avg) 55.1 60.8 - 62.0
Improve over Uniform - +5.7 - +6.9

Synthetic Source Mix

Among selected synthetic samples, approximately 38% come from GPT-5 and 62% from Gemini 2.5 Pro.

RL Post-training

DataProphet allocation improves average score from 0.583 -> 0.595 (real RL data) and 0.564 -> 0.577 (synthetic RL data).

Citation

If this work is useful, please cite:

@article{qi2026dataprophet,
  title={DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs},
  author={Qi, Xuan and He, Luxi and Roth, Dan and Fu, Xingyu},
  journal={International Conference on Learning Representations},
  year={2026}
}