Nemotron Ultra 253B

Name: Nemotron Ultra 253B
Author: NVIDIA

NVIDIA Open Model License

NVIDIA · 253B · transformer-decoder

🤗 HuggingFace Ollama Official Paper

2025-04-07 131K context 253B params

Use Cases

chat code reasoning multilingual tools math writing summary

Quantization Options

Quant	Bits	VRAM	Quality	Status
Q4_K_Mrec	4	155.0 GB	Good	—
Q8_0	8	275.0 GB	Excellent	—
F16	16	508.0 GB	Excellent	—

About this model

Nemotron Ultra 253B is NVIDIA's most capable open-weight reasoning model, derived from Llama 3.1 405B and compressed to 253B parameters using Neural Architecture Search (NAS). It delivers state-of-the-art performance on math, coding, and complex reasoning benchmarks while fitting on a single 8xH100 node at FP8 precision. The model features a dual-mode system supporting both standard chat and explicit chain-of-thought reasoning, toggled via system prompt. It supports a 128K context window and excels at tool calling, RAG, and agentic workflows. With multilingual support for English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai, it is one of the most versatile open-weight models available.

Benchmarks

88.0

mmlu

97.0

math500

76.0

gpqa

Your Hardware

DevicePick…

VRAM—

Bandwidth—

Detecting…

Install

Ollama

ollama run nemotron-ultra:253b-q4_K_M

llama.cpp / GGUF

Download GGUF from HuggingFace

Specs

Parameters: 253B
Architecture: transformer-decoder
Context: 131K tokens
Min VRAM: 155.0 GB
Recommended: 155.0 GB
Family: Nemotron
Released: 2025-04-07
License: NVIDIA Open Model License