Skip to main content

llama.cpp

From llama.cpp: The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

tools.cpp

tools.cpp is Rubra's fork of llama.cpp, offering inference of Rubra's function calling models (and others) in pure C/C++. This guide will walk you through how to install and set up tools.cpp to serve Rubra's models for inference, along with a simple Python function calling example.

Quickstart

1. Clone the Repository

git clone https://github.com/rubra-ai/tools.cpp.git
cd tools.cpp

2. Build from Source

Mac Users:
make
Nvidia GPU (CUDA) Users:
make LLAMA_CUDA=1

3. Install a Helper Package to Fix Rare Edge Cases

info

Assumes you have Node.js and npm installed

npm install jsonrepair --no-bin-links
  • You may need to run the above with sudo depending on user permsisions

4. Download a Compatible Rubra GGUF Model

For example:

wget https://huggingface.co/rubra-ai/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/rubra-meta-llama-3-8b-instruct.Q8_0.gguf
info

For large multi-part model files, such as rubra-meta-llama-3-70b-instruct_Q6_K-0000*-of-00003.gguf, use the following command to merge them before proceeding to the next step:

./llama-gguf-split --merge rubra-meta-llama-3-70b-instruct_Q6_K-0000*-of-00003.gguf rubra-meta-llama-3-70b-instruct_Q6_K.gguf

This will merge multi-part model files to one gguf file rubra-meta-llama-3-70b-instruct_Q6_K.gguf.

5. Start the OpenAI Compatible Server

./llama-server -ngl 37 -m rubra-meta-llama-3-8b-instruct.Q8_0.gguf --port 1234 --host 0.0.0.0 -c 8000 --chat-template llama3

6. Test the Server to Ensure Availability

curl localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenabc-123" \
-d '{
"model": "rubra-model",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "hello"
}
]
}'

You should see response like this:

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":" Hello! How can I assist you today? If you have any questions or need information on a particular topic, feel free to ask.","role":"assistant"}}],"created":1719608774,"model":"rubra-model","object":"chat.completion","usage":{"completion_tokens":28,"prompt_tokens":13,"total_tokens":41},"id":"chatcmpl-2Pr8BAD6b5Gc7sQyLWv7i6l8Sh3QMeI3"}

7. Try a Python Function Calling Example

# If openai is not installed, run `pip install openai`
from openai import OpenAI

client = OpenAI(api_key="xyz", base_url="http://localhost:1234/v1/")

tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
}
}
]

messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
completion = client.chat.completions.create(
model="rubra-model",
messages=messages,
tools=tools,
tool_choice="auto"
)

print(completion)

The output should look like this:

ChatCompletion(id='chatcmpl-EmHd8kai4DVwBUOyim054GmfcyUbjiLf', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='e885974b', function=Function(arguments='{"location":"Boston"}', name='get_current_weather'), type='function')]))], created=1719528056, model='rubra-model', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=29, prompt_tokens=241, total_tokens=270))

That's it! For more function calling examples, you can check out the test_llamacpp.ipynb or test_llamacpp_streaming.ipynb notebook.

Choosing a Chat Template for Different Models

ModelChat Template
Llama3llama3
Mistralllama2
Phi3phi3
Gemmagemma
Qwen2chatml

For example, to run Rubra's enhanced Phi3 model, use the following command:

./llama-server -ngl 37 -m rubra-phi-3-mini-128k-instruct.Q8_0.gguf --port 1234 --host 0.0.0.0 -c 32000 --chat-template phi3