Publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2025
- arXiv
- ACLDemons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models2025
- ICLRFrom Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency2025
- NeurIPSGated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free2025
- COLM
- NeurIPS
- ICMLTask Generalization With AutoRegressive Compositional Structure: Can Learning From D Tasks Generalize to D^T Tasks?2025
2024
- ICLRUnderstanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective2024
- ICLR
2023
- ICLR
- ICLR
- arXiv
- NeurIPSSharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization2023
- NeurIPS
2022
- NAACL
- EMNLP