llama.cpp

From llama.cpp: The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

tools.cpp

tools.cpp is Rubra's fork of llama.cpp, offering inference of Rubra's function calling models (and others) in pure C/C++. This guide will walk you through how to install and set up tools.cpp to serve Rubra's models for inference, along with a simple Python function calling example.

Quickstart

1. Clone the Repository

git clone https://github.com/rubra-ai/tools.cpp.git
cd tools.cpp

2. Build from Source

Mac Users:

make

Nvidia GPU (CUDA) Users:

make LLAMA_CUDA=1

3. Install a Helper Package to Fix Rare Edge Cases

info

Assumes you have Node.js and npm installed

npm install jsonrepair --no-bin-links

You may need to run the above with sudo depending on user permsisions

4. Download a Compatible Rubra GGUF Model

For example:

wget https://huggingface.co/rubra-ai/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/rubra-meta-llama-3-8b-instruct.Q8_0.gguf

info

For large multi-part model files, such as rubra-meta-llama-3-70b-instruct_Q6_K-0000*-of-00003.gguf, use the following command to merge them before proceeding to the next step:

./llama-gguf-split --merge rubra-meta-llama-3-70b-instruct_Q6_K-0000*-of-00003.gguf rubra-meta-llama-3-70b-instruct_Q6_K.gguf

This will merge multi-part model files to one gguf file rubra-meta-llama-3-70b-instruct_Q6_K.gguf.

5. Start the OpenAI Compatible Server

./llama-server -ngl 37 -m rubra-meta-llama-3-8b-instruct.Q8_0.gguf --port 1234 --host 0.0.0.0 -c 8000 --chat-template llama3

6. Test the Server to Ensure Availability

curl localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenabc-123" \
  -d '{
    "model": "rubra-model",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "hello"
      }
    ]
  }'

You should see response like this:

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":" Hello! How can I assist you today? If you have any questions or need information on a particular topic, feel free to ask.","role":"assistant"}}],"created":1719608774,"model":"rubra-model","object":"chat.completion","usage":{"completion_tokens":28,"prompt_tokens":13,"total_tokens":41},"id":"chatcmpl-2Pr8BAD6b5Gc7sQyLWv7i6l8Sh3QMeI3"}

7. Try a Python Function Calling Example

# If openai is not installed, run `pip install openai`
from openai import OpenAI

client = OpenAI(api_key="xyz", base_url="http://localhost:1234/v1/")

tools = [
  {
    "type": "function",
    "function": {
      "name": "get_current_weather",
      "description": "Get the current weather in a given location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city and state, e.g. San Francisco, CA",
          },
          "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
        },
        "required": ["location"],
      },
    }
  }
]

messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
completion = client.chat.completions.create(
  model="rubra-model",
  messages=messages,
  tools=tools,
  tool_choice="auto"
)

print(completion)

The output should look like this:

ChatCompletion(id='chatcmpl-EmHd8kai4DVwBUOyim054GmfcyUbjiLf', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='e885974b', function=Function(arguments='{"location":"Boston"}', name='get_current_weather'), type='function')]))], created=1719528056, model='rubra-model', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=29, prompt_tokens=241, total_tokens=270))

That's it! For more function calling examples, you can check out the test_llamacpp.ipynb or test_llamacpp_streaming.ipynb notebook.

Choosing a Chat Template for Different Models

Model	Chat Template
Llama3	llama3
Mistral	llama2
Phi3	phi3
Gemma	gemma
Qwen2	chatml

For example, to run Rubra's enhanced Phi3 model, use the following command:

./llama-server -ngl 37 -m rubra-phi-3-mini-128k-instruct.Q8_0.gguf --port 1234 --host 0.0.0.0 -c 32000 --chat-template phi3

tools.cpp​

Quickstart​

1. Clone the Repository​

2. Build from Source​

Mac Users:

Nvidia GPU (CUDA) Users:

3. Install a Helper Package to Fix Rare Edge Cases​

4. Download a Compatible Rubra GGUF Model​

5. Start the OpenAI Compatible Server​

6. Test the Server to Ensure Availability​

7. Try a Python Function Calling Example​

Choosing a Chat Template for Different Models​