Kaiyue Wen

Ph.D. Student at Stanford University

prof_pic.jpg

Stanford University

Stanford, CA

Hello! I am Kaiyue Wen. I am a second-year Ph.D. student at Stanford University, where I am grateful to be advised by Tengyu Ma and Percy Liang. I graduated from Tsinghua University, where I was a member of Yao’s pilot class. During my undergraduate study, I am fortunate to be advised by Tengyu Ma, Zhiyuan Liu, Andrej Risteski, Jingzhao Zhang, Yuhao Wang and Zhiyuan Li.

My research interest spreads broadly in deep learning. My long-term goal is to understand the physics behind deep learning and I believe a combination of theoretical analysis and empirical study is essential for this goal.

Recently, I’ve become fascinated by two fundamental axes of scaling in deep learning.

  1. Demystifying pretraining: Pretraining has been the driving force behind the evolution of large language models, yet many foundational algorithmic choices remain poorly understood. Key aspects such as optimizers, architectures, and hyperparameter scaling strategies still lack consensus. My goal is to clarify these choices through rigorous benchmarking (e.g., benchmarking modern optimizers) and theoretical analysis (e.g., exploring the representation limitation of RNNs, architectures beyond $\mathrm{TC}^0$, and river-valley loss landscape). Most of my research in this direction is carried out in the open-source project Marin.

  2. New algorithmic paradigms in reasoning: With the recent progress in reasoning reinforcement learning (RL), particularly innovations like long-chain-of-thought RL, there is growing potential to push the limits of model reasoning. While I am new to this field, my aim is to design end-to-end trainable multi-agent RL systems that build upon and extend the capabilities of current long-CoT RL paradigms.

News

Sep 01, 2025 New preprint (Fantastic Pretraining Optimizers and Where to Find Them) on arxiv!
May 01, 2025 WSD-S is used in training the best open-source 8B model Marin 8B.
Jan 20, 2025 3 papers (River Valley Landscape, RNNs are not Transformers (Yet), Optimization Analysis on Chain-of-Thought) accepted at ICLR 2025!
Jan 20, 2025 New preprint (Global Load Balancing Helps Expert Specialization) on arxiv!
Dec 01, 2024 Residual Permutation Test is accepted at AoS!

Selected Publications

  1. EMNLP
    Finding Skill Neurons in Pre-trained Transformers via Prompt Tuning
    Xiaozhi Wang, Kaiyue Wen, Zhengyan Zhang, Lei Hou, Zhiyuan Liu, and Juanzi Li
    In EMNLP, 2022
  2. ICLR
    How Does Sharpness-Aware Minimization Minimize Sharpness?
    Kaiyue Wen, Tengyu Ma, and Zhiyuan Li
    2023
  3. arXiv
    Fantastic Pretraining Optimizers and Where to Find Them
    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang
    2025
  4. ICLR
    Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
    Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma
    2024