Benchmarks

When evaluating what LLM to use, it's important to consider the model's "intelligence" - which you can get an idea of with the following benchmark results. With these, you can determine which size and quality your use case requires.

Model	Params (in billions)	Function Calling ↓	MMLU (5-shot)	GPQA (0-shot)	GSM-8K (8-shot, CoT)	MATH (4-shot, CoT)	MT-bench
Claude-3.5 Sonnet		98.57%	88.7	59.4	-	-	-
GPT-4o		98.57%	-	53.6	-	-	-
Rubra Llama-3 70B Instruct	70.6	97.85%	75.90	33.93	82.26	34.24	8.36
Rubra Llama-3 8B Instruct	8.9	89.28%	64.39	31.70	68.99	23.76	8.03
Rubra Qwen2-7B-Instruct	8.55	85.71%	68.88	30.36	75.82	28.72	8.08
groq/Llama-3-Groq-70B-Tool-Use	70.6	74.29%	-	-	-	-	-
Rubra Mistral 7B Instruct v0.3	8.12	73.57%	59.12	29.91	43.29	11.14	7.69
Rubra Phi-3 Mini 128k Instruct	4.73	70.00%	67.87	29.69	79.45	30.80	8.21
Rubra Mistral 7B Instruct v0.2	8.11	69.28%	58.90	29.91	34.12	8.36	7.36
Meta/Llama-3.1-70B-Instruct	70.6	63.75%%	-	-	-	-	-
meetkai/functionary-small-v2.5	8.03	57.14%	63.92	32.14	66.11	20.54	7.09
Nexusflow/NexusRaven-V2-13B	13	53.75% ∔	43.23	28.79	22.67	7.12	5.36
Mistral Large (closed-source)		48.60%	-	-	91.21	45.0	-
meetkai/functionary-medium-v3.0	70.6	46.43%	79.85	38.39	89.54	43.02	5.49
groq/Llama-3-Groq-8B-Tool-Use	8.03	45.70%	-	-	-	-	-
Rubra Gemma-1.1 2B Instruct	2.84	45.00%	38.85	24.55	6.14	2.38	5.75
gorilla-llm/gorilla-openfunctions-v2	6.91	41.25% ∔	49.14	23.66	48.29	17.54	5.13
NousResearch/Hermes-2-Pro-Llama-3-8B	8.03	41.25%	64.16	31.92	73.92	21.58	7.83
Meta/Llama-3.1-8B-Instruct	8.03	32.50%	-	-	-	-	-
Mistral 7B Instruct v0.3	7.25	22.5%	62.10	30.58	53.07	12.98	7.50
Qwen2-7B-Instruct	7.62	-	70.78	32.14	78.54	30.10	8.29
Phi-3 Mini 128k Instruct	3.82	-	69.36	27.01	83.7	32.92	8.02
Mistral 7B Instruct v0.2	7.24	-	59.27	27.68	43.21	10.30	7.50
Llama-3 70B Instruct	70.6	-	79.90	38.17	90.67	44.24	8.88
Llama-3 8B Instruct	8.03	-	65.69	31.47	77.41	27.58	8.07
Gemma-1.1 2B Instruct	2.51	-	37.84	22.99	6.29	6.14	5.82

info

MT-bench for all models was run in June 2024 using GPT-4.

MMLU, GPQA, GSM-8K & MATH were all calculated using the Language Model Evaluation Harness.

Our proprietary function calling benchmark will be open sourced in the coming months - half of it is composed of quickstart examples found in gptscript.

note

Some of the LLMs above require using custom libraries to post-process LLM generated tool calls. We followed those models' recommendations and guidelines in our evaluation.

mistralai/Mistral-7B-Instruct-v0.3 required mistral-inference library to extract function calls.

NousResearch/Hermes-2-Pro-Llama-3-8B required hermes-function-calling.

gorilla-llm/gorilla-openfunctions-v2 required special prompting detailed in their Github repo.

Nexusflow/NexusRaven-V2-13B required nexusraven-pip.

functionary-small-v2.5 and functionary-medium-v3.0 models are tested using MeetKai's functionary with the vllm framework. For each model, we compared the results with functionary's Grammar Sampling feature enabled and disabled, taking the highest score from either configuration. The functionary-small-v2.5 model achieved a higher score than the functionary-medium-v3.0 model, primarily due to the medium model exhibiting more hallucinations in some of our more advanced test cases.

groq/Llama-3-Groq-8B-Tool-Use and groq/Llama-3-Groq-70B-Tool-Use are tested using groq's API.

Meta/Llama-3.1-8B-Instruct and Meta/Llama-3.1-70B-Instruct are tested using Meta's Llama3.1 official docs of User-defined Custom tool calling.

∔ Nexusflow/NexusRaven-V2-13B and gorilla-llm/gorilla-openfunctions-v2 don't accept tool observations, the result of running a tool or function once the LLM calls it, so we appended the observation to the prompt.