back

active research

Closing the Sim-to-Real Gap for Purely-Tactile Dexterous In-Hand Manipulation with Multi-Fingered Robotic Hands 🦾

AIDX Lab - Advised by Leon Sievers, under Prof. Berthold Baeuml (TUM & DLR)

Purely tactile in-hand manipulation is sensitive to discrepancies between simulation and reality because it is dominated by high-dimensional, non-linear, time-variable, multi-contact interactions that are difficult to model perfectly for each task. I am working on adaptive robustness to inaccuracies in contact dynamics simulations and task generalization.

Technical details under confidentiality; Targeted for publication at CoRL 2026

past research

graduate research projects as part of advanced coursework at TUM

October 2024 - March 2025

SAC2: Rapid Adaptation to Random Disturbances in Partial Observability via Off-Policy Meta-Reinforcement Learning 🤖

AIDX Lab - Advised by Leon Sievers, under Prof. Berthold Baeuml (TUM & DLR)
Scope: Individual research (originally scoped for teams of 2-3)

I built an asymmetric, meta-learning variant of Soft Actor-Critic: a recurrent (LSTM) actor that sees only minimal observations, paired with full-state critics to keep off-policy training stable. In a disturbance-rich, partially observed control task, the policy learns to infer hidden dynamics from history and adapts online. Only my SAC2 variant suceeded under heavy disturbances while the baseline plateaued and failed entirely. Under full observability, the method reached comparable returns in roughly half the samples versus baseline SAC.

October 2024 - March 2025

PromptScene: Adaptive Prompt Learning for Open-Vocabulary 3D Instance Segmentation 🏡

3D AI Lab - Advised by Mohamed El Amine Boudjoghra, under Prof. Angela Dai (TUM)
Scope: Team of 3 (Ayaka Nanri, Simon Blessmann), Led development & implementation

We built adaptive prompt learning on top of OpenScene's CLIP-aligned 3D point features with Mask3D's class-agnostic instance grouping. The learnable tokens (suffix placement worked best) reshape the text embedding space to better match 3D instance features, raising mAP over 60% versus a fixed-prompt baseline with negligible inference cost. When grouping isn't the bottleneck (using ground-truth masks), gains become much larger (~3x), implicating instance grouping as the limiting factor. The approach preserves open-vocabulary behavior for zero-shot, text-prompted categories.

October 2024 - March 2025

Dif-fused DINO-tracker: Zero-shot Point Tracking Enhanced with Video Diffusion Features 🎥

Visual Computing & AI Lab - Advised by Dr. Lei Li, under Prof. Matthias Niessner (TUM)
Scope: Individual research (originally scoped for teams of 2-3)

I repurposed temporal representations from a text-to-video diffusion backbone (CogVideoX) to enhance zero-shot point tracking without generating video. Intermediate transformer-block features are extracted, projected into DINO-Tracker's space, and fused via cross-attention so DINO retains spatial precision while borrowing temporal context. On TAP-Vid-DAVIS short sequences, the method improved temporal-coherence metrics while keeping positional accuracy comparable to the baseline. While the gains were modest, this exploratory work successfully showed that generative video models can be leveraged for discriminative tracking.