LLaMA Transformer Block (stacked × N)
Per-layer structure
Input token representation
RMSNorm
Multi-Head Self-Attention with RoPE on q, k
+
Residual add
RMSNorm
MLP (SwiGLU) gate_proj / up_proj / down_proj
+
Residual add
Output to next LLaMA block
Self-Attention (inside each block)
pre-norm with shared RMSNorm
RMSNorm
Linear Wq
RoPE
q
Linear Wk
RoPE
k
Linear Wv
v
Scaled Dot-Product Attention heads merged
Output Projection Wo
MLP with SwiGLU (inside each block)
pre-norm followed by gated feedforward network
RMSNorm
Linear gate_proj
Linear up_proj
SwiGLU SiLU(gate) ⊙ up
Linear down_proj
Output is added back to the block input through the residual path.
Diagram: Standard LLaMA transformer architecture – stacked blocks with RMSNorm, multi-head self-attention with RoPE, and a SwiGLU MLP, drawn in a style similar to the original figure.