Running your own AI in-house – Part 2 – LLM/LVM – Ollama on the Web (API) and Code Editor

A llama (Lama Glama) in front of the Machu Picchu archeological site, Peru; Photo by Alexandre Buisse; Licensed under CC BY-SA 3.0

In my previous article I showed you how to install Ollama and running your first LLM/LVM, recommend some models, gave you examples on how to script with it and finally how to describe images. This here builds up on that and is the second article in a series of articles on how to run AI in-house.

In my previous article we focused on using Ollama – specifically on CLI – and today we’ll take a look at expanding Ollama to use HTTP(S) such that we can securely accept connections remotely or to use tools which use the HTTP API.

Now, here’s the thing: If you have enough resources on your local machine, you can keep your Ollama installation and just use the Ollama API locally without requiring any further setup. You’ll simply use tools which use the Ollama API and point them to http://localhost:11434/ for your enjoyment. For example, if you’re on Mac, you can use the Enchanted App for a nice chat UI (I recommend the model mistral for this). If you’re a developer and want an AI assistant for your JetBrains or VS Code IDE, you can use Continue (I recommend the model codellama for this).

However, if you don’t have enough resources locally and want to offload things to a server in a secure manner, read on.

⚠️ In the rest of this article, the expectation is that we’re running Ollama on Linux on a server.

The Ollama Service

After you installed Ollama on Linux, you should have a systemd service running. You can verify this with:

systemctl status ollama

If this is – for whatever reason – not enabled, you can manually enable and start it with:

systemctl enable --now ollama

However, we can fine-tune the service a little.

systemctl edit ollama

And we write into it:

[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"

This will make sure that models are not automatically unloaded from memory and instead stay in memory. This is useful if you’re building a dedicated LLM service and want to make responses stay fast; otherwise Ollama would unload models after 5 minutes, and then have to load it back into memory for new requests which wastes time. If you have enough memory, keep it in memory!

Some more tweaks:

Environment="OLLAMA_MAX_LOADED_MODELS=8"
Environment="OLLAMA_NUM_PARALLEL=6"
Environment="OLLAMA_MAX_QUEUE=512"

Here we specify that up to 8 models should stay in memory – if enough memory is available, that is – (meaning that a 9th model would unload a previous model), we specify that up to 6 requests should be processed in parallel and finally we allow 512 requests to be queued up before the API will start to respond with an error due to overload.

The idea here is simple: If you have enough RAM and are using mostly smaller models, and want to use them semi-parallel (for example running Codellama in your IDE while also chatting with Mistral), the context switching will be quicker and you won’t have to wait for Ollama to first unload one model and then load in another model. In simple terms: We skip the constant re-reading from disk by keeping them on standby in RAM.

systemctl daemon-reload
systemctl restart ollama

Your first HTTP request

Assuming you have curl and jq installed, we can run this now:

curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello World",
  "stream": false
}' | jq '[.response]'

That’s nice! Now any application which supports the Ollama API can contact Ollama on localhost, but that’s not all, Ollama also – experimentally – supports the OpenAI API, meaning tools expecting OpenAI can use it too!

But … limiting things to localhost isn’t very useful on a server. If you’re running this on your local machine, this is fine, but we want to keep our local machine lightweight and offload Ollama to a server! :D

Nginx + HTTPS

To be able to make use of Ollama on our local machine, like e.g. a weak laptop, we’ll want to put Ollama on a beefy server and use it over a secure connection to that server.

For this, we’ll be installing Nginx (apt install nginx) as reverse proxy, securing it to use HTTPS and finally making sure that we need some sort of “secret” (API token) to talk to Ollama.

I’ll assume you’ll already know how to use Certbot to get a TLS certificate from Let’s Encrypt, if not you can follow my guide on using Certbot + Cloudflare DNS API here.

I recommend getting a wildcard certificate! We’ll use it for other things in future guides.

I recommend setting up a subdomain for the Ollama API which points to your server, so if your domain is example.net, I recommend setting up ollama.example.net, so your Nginx config at /etc/nginx/sites-enabled/ollama.example.net might look something like this:

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server_name ollama.example.net;
    ssl_certificate /etc/letsencrypt/live/example.net/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/example.net/privkey.pem;

    location / {
        proxy_http_version 1.1;
        if ($http_authorization != "Bearer CHANGE_THIS_SECRET") {
            return 401;
        }
        proxy_pass http://localhost:11434;
        proxy_set_header Host localhost:11434;
    }
}

You’ll of course have to change the server name and paths to your certificate files, and don’t forget to CHANGE_THIS_SECRET. You can quickly generate some secret with:

head -c32 /dev/urandom | xxd -p -c 0

Finally, check the config with nginx -t and then if all is ok reload Nginx with systemctl reload nginx and then try to reach Ollama over HTTPS at that server name you set:

curl https://ollama.example.net

Should give you something like:

<html>
<head><title>401 Authorization Required</title></head>
<body>
<center><h1>401 Authorization Required</h1></center>
<hr><center>nginx</center>
</body>
</html>

Which means that you’re not authorized (good!)

So, assuming our bearer token in the Nginx config is 123, the curl would look something like this:

curl https://ollama.example.net -H "Authorization: Bearer 123"

Which should return:

Ollama is running

Perfect!

Now we got Ollama, secured with HTTPS (TLS) and an Authorization Bearer required. Now you can configure supported applications, like e.g. the previously mentioned Enchanted which should be self-explanatory, but read on, on how to configure Continue.dev.

Ollama in the Code Editor

Okay, now we got Ollama, Nginx with HTTPS and Authorization set up. How do we configure Continue.dev? Easy! After installing the Continue plugin, in the config.json you want to specify a model like this:

{
  "title": "Codellama (remote self-hosted)",
  "model": "codellama:latest",
  "completionOptions": {},
  "provider": "ollama",
  "apiBase": "https://ollama.example.net",
  "requestOptions": {
    "headers": {
      "Authorization": "Bearer 123"
    }
  }
}

In newer versions of Continue.dev you can use the apiKey parameter directly:

{
  "title": "Codellama (remote self-hosted)",
  "model": "codellama:latest",
  "completionOptions": {},
  "provider": "ollama",
  "apiBase": "https://ollama.example.net",
  "apiKey": "123"
}

Now, here’s the thing. If your server isn’t that fast, it might get spammed and DoS’d by auto-completion requests. If that’s the case, you probably want to disable autocompletion with:

"tabAutocompleteOptions": {
  "disable": true
}

Troubleshooting web server timeouts

If your server is slow or overwhelmed at times, you might want to increase the proxy timeout in Nginx. You can do so by adding this to your Nginx config’s location block:

proxy_read_timeout 600s;
proxy_send_timeout 600s;

Which means you got 10 minutes for a request before it will time out.

Conclusion

We tested the HTTP API, optimized the systemd service, putting Nginx in front of it with HTTPS and making sure authentication is required; finally, we set up our code editor config to use our remote instance. There you have it! Follow my blog on the fediverse if you want to be notified about future articles in this series.

I run this blog in my free time, and setting up this article took a lot of time and back and forth with testing and documenting. If I helped you out, consider donating a cup of coffee!

Author’s notes

It may seem like a short article but it takes a lot of research and testing before I can distill it down to just the essentials to quickly help others bootstrap themselves. ☕️ Like e.g., I looked a lot into how to secure Ollama with an API key, and there were many FOSS solutions and scripts to proxy the request, in the end after reading through a bunch of Nginx docs and testing some applications, I came up with the simple solution of doing a minimalistic header check in Nginx. These things take time, even if the result doesn’t look like much. 😅

Here some more relevant things to read:

Running your own AI in-house – Part 1 – LLM/LVM – Ollama in the shell

How to set up Certbot (Let’s Encrypt) with Cloudflare DNS

Running your own AI in-house – Part 2 – LLM/LVM – Ollama on the Web (API) and Code Editor

⚠️ In the rest of this article, the expectation is that we’re running Ollama on Linux on a server.

The Ollama Service

Your first HTTP request

Nginx + HTTPS

Ollama in the Code Editor

Troubleshooting web server timeouts

Conclusion

Author’s notes

About the Author: Sindastra

Leave A Comment Cancel reply

Running your own AI in-house – Part 2 – LLM/LVM – Ollama on the Web (API) and Code Editor

⚠️ In the rest of this article, the expectation is that we’re running Ollama on Linux on a server.

The Ollama Service

Your first HTTP request

Nginx + HTTPS

Ollama in the Code Editor

Troubleshooting web server timeouts

Conclusion

Author’s notes

Share this post:

About the Author: Sindastra

Leave A Comment Cancel reply