LLaMA Transformer Block

LLaMA Transformer Block (stacked × N)

Per-layer structure

Input token representation

RMSNorm

↓

Multi-Head Self-Attention with RoPE on q, k

Residual add

RMSNorm

↓

MLP (SwiGLU) gate_proj / up_proj / down_proj

Residual add

Output to next LLaMA block

Self-Attention (inside each block)

pre-norm with shared RMSNorm

RMSNorm

↓

Linear W_q

RoPE

Linear W_k

RoPE

Linear W_v

↑

Scaled Dot-Product Attention heads merged

↑

Output Projection W_o

MLP with SwiGLU (inside each block)

pre-norm followed by gated feedforward network

RMSNorm

↓

Linear gate_proj

Linear up_proj

↓

SwiGLU SiLU(gate) ⊙ up

↓

Linear down_proj

Output is added back to the block input through the residual path.

Diagram: Standard LLaMA transformer architecture – stacked blocks with RMSNorm, multi-head self-attention with RoPE, and a SwiGLU MLP, drawn in a style similar to the original figure.