Measuring Open-Source Llama Nemotron Models on DeepResearch Bench

Contributors: David Austin, Raja Biswas, Gilberto Titericz Junior, NVIDIA

NVIDIA’s AI-Q Blueprint —the leading portable, open deep research agent—recently climbed to the top of the Hugging Face “LLM with Search” leaderboard on DeepResearch Bench . This is a significant step forward for the open-source AI stack, proving that developer-accessible models can power advanced agentic workflows that rival or surpass closed alternatives.

What sets AI-Q apart? It fuses two high-performance open LLMs—Llama 3.3-70B Instruct and Llama-3.3-Nemotron-Super-49B-v1.5—to orchestrate long-context retrieval, agentic reasoning, and robust synthesis.

Core Stack: Model Choices and Technical Innovations

Llama 3.3-70B Instruct : The foundation for fluent, structured report generation, derived from Meta’s Llama series and open-licensed for unrestricted deployment.
Llama-3.3-Nemotron-Super-49B-v1.5 : An optimized, reasoning-focused variant. Built via Neural Architecture Search (NAS), knowledge distillation, and successive rounds of supervised and reinforcement learning, it excels at multi-step reasoning, query planning, tool use, and reflection—all with a reduced memory footprint for efficient deployment on standard GPUs.

The AI-Q reference example also includes :

NVIDIA NeMo Retriever for scalable, multimodal search (internal+external).
NVIDIA NeMo Agent toolkit for orchestrating complex, multistep agentic workflows.

The architecture supports parallel, low-latency search over local and web data, making it ideal for use cases that demand privacy, compliance, or on-premise deployment for reduced latency.

Deep Reasoning with Llama Nemotron

NVIDIA Llama Nemotron Super isn’t just a fine-tuned instruct model—it’s post-trained for explicit agentic reasoning and supports reasoning ON/OFF toggles via system prompts. You can use it in standard chat LLM mode or switch to deep, chain-of-thought reasoning for agent pipelines—enabling dynamic, context-sensitive workflows.

Key highlights:

Multi-phase post-training : Combines instruction following, mathematical/programmatic reasoning, and tool-calling skills.
Transparent model lineage : Directly traceable from open Meta weights, with additional openness around synthetic data and tuning datasets.
Efficiency : 49B parameters with context windows up to 128K tokens can run on a single H100 GPU or smaller, keeping inference costs predictable and fast.

Evaluation: Transparency and Robustness in Metrics

One of the core strengths of AI-Q is transparency—not just in outputs, but in reasoning traces and intermediate steps. During development, the NVIDIA team leveraged both standard and new metrics, such as:

Hallucination detection : Each factual claim is checked at generation.
Multi-source synthesis : Synthesis of new insights from disparate evidence.
Citation trustworthiness : Automated assessment of claim-evidence links.
RAGAS metrics: Automated scoring of retrieval-augmented generation accuracy.

The architecture lends itself perfectly to granular, stepwise evaluation and debugging—one of the biggest pain points in agentic pipeline development.

Benchmark Results: DeepResearch Bench

DeepResearch Bench evaluates agent stacks using a set of 100+ long-context, real-world research tasks (across science, finance, art, history, software, and more). Unlike traditional QA, tasks require report-length synthesis and complex multi-hop reasoning:

AI-Q achieved an overall score of 40.52 in the LLM with Search category as of August 2025 , currently holding the top spot for any fully open-licensed stack.
Strongest metrics : comprehensiveness (depth of report), insightfulness (quality of analysis), and citation quality.

For the Hugging Face Developer Community

Both Llama-3.3-Nemotron-Super-49B-v1.5 and Llama 3.3-70B Instruct are available for direct use/download on Hugging Face. Try them in your own pipelines using a few lines of Python, or deploy with vLLM for fast inference and tool-calling support (see the model card for code/serving examples).
Open post-training data, transparent evaluation methods, and permissive licensing enable experimentation and reproducibility.

Takeaways

The open-source ecosystem is rapidly closing the gap—and, in some areas, leading—on real-world agent tasks that matter. AI-Q, built on Llama Nemotron, demonstrates that you don’t need to compromise on transparency or control to achieve state-of-the-art results.

Try the stack or adapt it to your own research agent projects from Hugging Face or build.nvidia.com .

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Adaptive Ultrasound Imaging with Physics-Informed NV-Raw2Insights-US AI

Community

· Sign up or log in to comment

Measuring Open-Source Llama Nemotron Models on DeepResearch Bench