Benchmarks
When evaluating what LLM to use, it's important to consider the model's "intelligence" - which you can get an idea of with the following benchmark results. With these, you can determine which size and quality your use case requires.
Model | Params (in billions) | Function Calling ↓ | MMLU (5-shot) | GPQA (0-shot) | GSM-8K (8-shot, CoT) | MATH (4-shot, CoT) | MT-bench |
---|---|---|---|---|---|---|---|
Claude-3.5 Sonnet | 98.57% | 88.7 | 59.4 | - | - | - | |
GPT-4o | 98.57% | - | 53.6 | - | - | - | |
Rubra Llama-3 70B Instruct | 70.6 | 97.85% | 75.90 | 33.93 | 82.26 | 34.24 | 8.36 |
Rubra Llama-3 8B Instruct | 8.9 | 89.28% | 64.39 | 31.70 | 68.99 | 23.76 | 8.03 |
Rubra Qwen2-7B-Instruct | 8.55 | 85.71% | 68.88 | 30.36 | 75.82 | 28.72 | 8.08 |
groq/Llama-3-Groq-70B-Tool-Use | 70.6 | 74.29% | - | - | - | - | - |
Rubra Mistral 7B Instruct v0.3 | 8.12 | 73.57% | 59.12 | 29.91 | 43.29 | 11.14 | 7.69 |
Rubra Phi-3 Mini 128k Instruct | 4.73 | 70.00% | 67.87 | 29.69 | 79.45 | 30.80 | 8.21 |
Rubra Mistral 7B Instruct v0.2 | 8.11 | 69.28% | 58.90 | 29.91 | 34.12 | 8.36 | 7.36 |
Meta/Llama-3.1-70B-Instruct | 70.6 | 63.75%% | - | - | - | - | - |
meetkai/functionary-small-v2.5 | 8.03 | 57.14% | 63.92 | 32.14 | 66.11 | 20.54 | 7.09 |
Nexusflow/NexusRaven-V2-13B | 13 | 53.75% ∔ | 43.23 | 28.79 | 22.67 | 7.12 | 5.36 |
Mistral Large (closed-source) | 48.60% | - | - | 91.21 | 45.0 | - | |
meetkai/functionary-medium-v3.0 | 70.6 | 46.43% | 79.85 | 38.39 | 89.54 | 43.02 | 5.49 |
groq/Llama-3-Groq-8B-Tool-Use | 8.03 | 45.70% | - | - | - | - | - |
Rubra Gemma-1.1 2B Instruct | 2.84 | 45.00% | 38.85 | 24.55 | 6.14 | 2.38 | 5.75 |
gorilla-llm/gorilla-openfunctions-v2 | 6.91 | 41.25% ∔ | 49.14 | 23.66 | 48.29 | 17.54 | 5.13 |
NousResearch/Hermes-2-Pro-Llama-3-8B | 8.03 | 41.25% | 64.16 | 31.92 | 73.92 | 21.58 | 7.83 |
Meta/Llama-3.1-8B-Instruct | 8.03 | 32.50% | - | - | - | - | - |
Mistral 7B Instruct v0.3 | 7.25 | 22.5% | 62.10 | 30.58 | 53.07 | 12.98 | 7.50 |
Qwen2-7B-Instruct | 7.62 | - | 70.78 | 32.14 | 78.54 | 30.10 | 8.29 |
Phi-3 Mini 128k Instruct | 3.82 | - | 69.36 | 27.01 | 83.7 | 32.92 | 8.02 |
Mistral 7B Instruct v0.2 | 7.24 | - | 59.27 | 27.68 | 43.21 | 10.30 | 7.50 |
Llama-3 70B Instruct | 70.6 | - | 79.90 | 38.17 | 90.67 | 44.24 | 8.88 |
Llama-3 8B Instruct | 8.03 | - | 65.69 | 31.47 | 77.41 | 27.58 | 8.07 |
Gemma-1.1 2B Instruct | 2.51 | - | 37.84 | 22.99 | 6.29 | 6.14 | 5.82 |
MT-bench for all models was run in June 2024 using GPT-4.
MMLU, GPQA, GSM-8K & MATH were all calculated using the Language Model Evaluation Harness.
Our proprietary function calling benchmark will be open sourced in the coming months - half of it is composed of quickstart examples found in gptscript.
Some of the LLMs above require using custom libraries to post-process LLM generated tool calls. We followed those models' recommendations and guidelines in our evaluation.
mistralai/Mistral-7B-Instruct-v0.3
required mistral-inference library to extract function calls.
NousResearch/Hermes-2-Pro-Llama-3-8B
required hermes-function-calling.
gorilla-llm/gorilla-openfunctions-v2
required special prompting detailed in their Github repo.
Nexusflow/NexusRaven-V2-13B
required nexusraven-pip.
functionary-small-v2.5
and functionary-medium-v3.0
models are tested using MeetKai's functionary with the vllm framework. For each model, we compared the results with functionary's Grammar Sampling
feature enabled and disabled, taking the highest score from either configuration. The functionary-small-v2.5
model achieved a higher score than the functionary-medium-v3.0
model, primarily due to the medium model exhibiting more hallucinations in some of our more advanced test cases.
groq/Llama-3-Groq-8B-Tool-Use
and groq/Llama-3-Groq-70B-Tool-Use
are tested using groq's API.
Meta/Llama-3.1-8B-Instruct
and Meta/Llama-3.1-70B-Instruct
are tested using Meta's Llama3.1 official docs of User-defined Custom tool calling.
∔ Nexusflow/NexusRaven-V2-13B
and gorilla-llm/gorilla-openfunctions-v2
don't accept tool observations, the result of running a tool or function once the LLM calls it, so we appended the observation to the prompt.