gchat Scaling

Explore rough memory requirements for gchat model and training configuration changes. Type exact values or use the sliders to update the estimate immediately.

Estimated total training memory38.66 GiB

Parameters833,866,752

Total FLOPs35.02 EFLOPs (3.50e+19)

Tokens per step32,768

Parameter memory1.55 GiB

Gradient memory1.55 GiB

AdamW state6.21 GiB

Hidden activations2.34 GiB

Attention workspace27.00 GiB

Batch size

Training examples per step.

Sequence length

Tokens per example.

Vocabulary size

Tokenizer vocabulary entries.

Layers

Transformer blocks.

Query heads

Attention query heads.

KV heads

Key/value heads for GQA.

Embedding width

Hidden dimension.

Parameter bytes

2 = bf16/fp16, 4 = fp32.

Activation bytes

2 = bf16/fp16, 4 = fp32.

Training tokens

Total tokens in the training dataset, in billions.

Data parallelism

Chips used by one model replica; remaining chips form data-parallel replicas.

TPU HBM Fit Grid

Uses the Data parallelism slider as chips per model replica. Each column shows whether that TPU type has enough HBM across one replica at the selected chip count; red cells either need more chips for a replica or have insufficient per-replica HBM.

TPU type

1 chip

2 chips

4 chips

8 chips

16 chips

32 chips

TPU 8i288 GB per chipMax 1,152 chips. Boardfly (Inference)

288 GB1-chip replica HBM

TPU 8t216 GB per chipMax 9,600 chips. 3D Torus (Training)

216 GB1-chip replica HBM

TPU v7 (Ironwood)192 GB per chipMax 9,216 chips. 3D Torus

192 GB1-chip replica HBM

TPU v6e (Trillium)32 GB per chipMax 256 chips. 2D Torus

32 GB1-chip replica HBM

TPU v5p95 GB per chipMax 8,960 chips. 3D Torus

95 GB1-chip replica HBM

TPU v5e16 GB per chipMax 256 chips. 2D Torus

16 GB1-chip replica HBM

TPU v432 GB per chipMax 4,096 chips. 3D Torus

32 GB1-chip replica HBM

TPU v332 GB per chipMax 2,048 chips. 2D Torus

32 GB1-chip replica HBM

TPU Training Time Estimate

Uses 6 x parameters x training tokens, divided by BF16 peak FLOPs. Assumes 50% MFU (mean FLOPs utilization).

TPU type

1 chip

2 chips

4 chips

8 chips

16 chips

32 chips

TPU 8i5,050 BF16 peak TFLOPS per chip

3 hr 51 min1 replica

1 hr 56 min2 replicas

0 hr 58 min4 replicas

0 hr 29 min8 replicas

0 hr 14 min16 replicas

0 hr 07 min32 replicas

TPU 8t6,300 BF16 peak TFLOPS per chip

3 hr 05 min1 replica

1 hr 33 min2 replicas

0 hr 46 min4 replicas

0 hr 23 min8 replicas

0 hr 12 min16 replicas

0 hr 06 min32 replicas

TPU v7 (Ironwood)2,307 BF16 peak TFLOPS per chip

8 hr 26 min1 replica

4 hr 13 min2 replicas

2 hr 07 min4 replicas

1 hr 03 min8 replicas

0 hr 32 min16 replicas

0 hr 16 min32 replicas

TPU v6e (Trillium)918 BF16 peak TFLOPS per chip

won't fit in HBMinsufficient HBM

TPU v5p459 BF16 peak TFLOPS per chip

42 hr 23 min1 replica

21 hr 12 min2 replicas

10 hr 36 min4 replicas

5 hr 18 min8 replicas

2 hr 39 min16 replicas

1 hr 19 min32 replicas

TPU v5e197 BF16 peak TFLOPS per chip

won't fit in HBMinsufficient HBM

TPU v4275 BF16 peak TFLOPS per chip

won't fit in HBMinsufficient HBM

TPU v3123 BF16 peak TFLOPS per chip

won't fit in HBMinsufficient HBM