Models Grade Report
This document presents the evaluation report for various Large-Language-Models (LLMs) graded by Spice AI. The models are assessed based on their basic capabilities, quality of tool calls, and accuracy of output when integrated with Spice.
For more details on how model grades are evaluated in Spice, refer to the model grading criteria.
Model | Spice Grade | Model Provider | Context Window Max Output Tokens | Chat Completion | Response Format (Structued Outputs) | Tools | Recursive Tool Calling | Reasoning | Streaming | Model Release Date | Spice Version |
---|---|---|---|---|---|---|---|---|---|---|---|
o3-mini-2025-01-31 (Reasoning effort: high) | A | openai | 200k tokens 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 2025-01-31 | v1.0.2 |
o3-mini-2025-01-31 (Reasoning effort: medium) | B | openai | 200k tokens 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 2025-01-31 | v1.0.2 |
o3-mini-2025-01-31 (Reasoning effort: low) | C | openai | 200k tokens 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 2025-01-31 | v1.0.2 |
o1-2024-12-17 (Reasoning effort: high) | C | openai | 200k tokens 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | 2024-12-17 | v1.0.2 |
o1-2024-12-17 (Reasoning effort: medium) | C | openai | 200k tokens 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | 2024-12-17 | v1.0.2 |
o1-2024-12-17 (Reasoning effort: low) | C | openai | 200k tokens 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | 2024-12-17 | v1.0.2 |
gpt-4o-2024-08-06 | B | openai | 128k tokens 16384 tokens | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | 2024-08-06 | v1.0.2 |
claude-3-5-sonnet-20241022 | C | anthropic | 200k tokens 8192 tokens | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | 2024-10-22 | v1.0.2 |
grok-2-1212 | Ungraded | xai | − | ✅ | − | − | − | ❌ | − | Not Available | v1.0.2 |
deepseek-ai/DeepSeek-R1-Distill-Llama-8B | Ungraded | huggingface | − | ✅ | − | − | − | ✅ | − | Not Available | v1.0.2 |
meta-llama/Llama-3.2-3B-Instruct | Ungraded | huggingface | − | ✅ | − | − | − | ❌ | − | Not Available | v1.0.2 |