Bora Kargi

Tübingen, Germany

kargibora gmail.com

bora.kargi tue.ellis.eu

I am a Research Engineer at the ELLIS Institute Tübingen, where I work on the OpenEuroLLM project, focusing on the evaluation of large language models.

I hold an MSc in Machine Learning from the University of Tübingen and a BSc in Computer Engineering from Middle East Technical University (METU). During my master’s, I worked as a research assistant (HiWi), contributing to Scholar Inbox — a paper-recommendation platform — in Prof. Andreas Geiger’s Autonomous Vision Group. For my thesis, supervised by Prof. Seong Joon Oh in the Scalable Trustworthy AI group, I studied a fragility of CLIP-based vision–language models — how they can be misled by plausible but incorrect details (“half-truths”).

My research interests are broad — more than any single topic, I enjoy picking up new concepts and reading widely across different fields. Recently, I have been especially drawn to interpretability and language diffusion models.

My selected publications appear below — see the publications page for the full list, or take a look at my CV.

news

Apr 16, 2026	Graduated from the University of Tübingen with distinction. 🎓
Feb 01, 2026	Started as a Research Engineer at the ELLIS Institute Tübingen.
Jan 30, 2026	Submitted my Master’s thesis.
Jul 25, 2025	My first main-author paper was accepted to BMVC 2025.
May 15, 2025	Our Scholar Inbox paper was accepted to ACL 2025 (System Demonstrations Track).
Apr 01, 2024	Started as a Student Assistant on Scholar Inbox.

selected publications

Preprint
From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

Bora Kargi and David Salinas

arXiv preprint arXiv:2606.13221, 2026

Turns biased LLM-as-a-judge votes into calibrated model rankings with honest, distribution-free uncertainty bounds — at a fraction of the cost of human evaluation.

Abs arXiv Bib PDF Code

LLM-as-a-judge evaluation is cheap but biased, which can badly miscalibrate model rankings. We address this at two levels: locally, by propagating calibrated win probabilities (instead of hard labels) into the Bradley–Terry procedure, bringing LLM-derived Elo ratings within 17.9 MAE of human ratings on LMArena; and globally, by using split conformal prediction to place distribution-free uncertainty intervals around the LLM-vs-human rating gap. The result is a low-cost evaluation tool that gives developers calibrated Elo estimates with honest uncertainty bounds, without large-scale human annotation.
@article{kargi2026softelo, title = {From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation}, author = {Kargi, Bora and Salinas, David}, journal = {arXiv preprint arXiv:2606.13221}, year = {2026}, note = {Turns biased LLM-as-a-judge votes into calibrated model rankings with honest, distribution-free uncertainty bounds — at a fraction of the cost of human evaluation.}, }
Preprint
Half-Truths Break Similarity-Based Retrieval

Bora Kargi, Arnas Uselis, and Seong Joon Oh

arXiv preprint arXiv:2602.23906, 2026

CLIP-style models often score a caption as more similar after a plausible but wrong detail is added; we fix this by supervising the individual entities and relations within each caption.

Abs arXiv Bib PDF Code

CLIP-style image-text models can be fooled by "half-truths": appending a plausible but incorrect detail to an otherwise correct caption often increases the similarity score instead of lowering it. We introduce CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units and trains the model to score each correct unit above a minimally-edited incorrect foil, while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy from 40.6% to 69.3% and improves average performance on established compositional benchmarks by 5.7 points.
@article{kargi2026halftruths, title = {Half-Truths Break Similarity-Based Retrieval}, author = {Kargi, Bora and Uselis, Arnas and Oh, Seong Joon}, journal = {arXiv preprint arXiv:2602.23906}, year = {2026}, note = {CLIP-style models often score a caption as more similar after a plausible but wrong detail is added; we fix this by supervising the individual entities and relations within each caption.}, }