Skip to main content

Models Grade Report

This document presents the evaluation report for various Large-Language-Models (LLMs) graded by Spice AI. The models are assessed based on their basic capabilities, quality of tool calls, and accuracy of output when integrated with Spice.

For more details on how model grades are evaluated in Spice, refer to the model grading criteria.

ModelSpice GradeModel ProviderContext Window
Max Output Tokens
Chat CompletionResponse Format
(Structued Outputs)
ToolsRecursive
Tool Calling
ReasoningStreamingModel Release DateSpice Version
o3-mini-2025-01-31 (Reasoning effort: high)Aopenai200k tokens
100k tokens
2025-01-31v1.0.2
o3-mini-2025-01-31 (Reasoning effort: medium)Bopenai200k tokens
100k tokens
2025-01-31v1.0.2
o3-mini-2025-01-31 (Reasoning effort: low)Copenai200k tokens
100k tokens
2025-01-31v1.0.2
o1-2024-12-17 (Reasoning effort: high)Copenai200k tokens
100k tokens
2024-12-17v1.0.2
o1-2024-12-17 (Reasoning effort: medium)Copenai200k tokens
100k tokens
2024-12-17v1.0.2
o1-2024-12-17 (Reasoning effort: low)Copenai200k tokens
100k tokens
2024-12-17v1.0.2
gpt-4o-2024-08-06Bopenai128k tokens
16384 tokens
2024-08-06v1.0.2
claude-3-5-sonnet-20241022Canthropic200k tokens
8192 tokens
2024-10-22v1.0.2
grok-2-1212UngradedxaiNot Availablev1.0.2
deepseek-ai/DeepSeek-R1-Distill-Llama-8BUngradedhuggingfaceNot Availablev1.0.2
meta-llama/Llama-3.2-3B-InstructUngradedhuggingfaceNot Availablev1.0.2