
In my previous article I showed you how to install Ollama and running your first LLM/LVM, recommend some models, gave you examples on how to script with it and finally how to describe images. This here builds up on that and is the second article in a series of articles on how to run AI in-house.
In my previous article we focused on using Ollama – specifically on CLI – and today we’ll take a look at expanding Ollama to use HTTP(S) such that we can securely accept connections remotely or to use tools which use the HTTP API.
Now, here’s the thing: If you have enough resources on your local machine, you can keep your Ollama installation and just use the Ollama API locally without requiring any further setup. You’ll simply use tools which use the Ollama API and point them to http://localhost:11434/
for your enjoyment. For example, if you’re on Mac, you can use the Enchanted App for a nice chat UI (I recommend the model mistral
for this). If you’re a developer and want an AI assistant for your JetBrains or VS Code IDE, you can use Continue (I recommend the model codellama
for this).
However, if you don’t have enough resources locally and want to offload things to a server in a secure manner, read on.
⚠️ In the rest of this article, the expectation is that we’re running Ollama on Linux on a server.
The Ollama Service
After you installed Ollama on Linux, you should have a systemd service running. You can verify this with:
systemctl status ollama
If this is – for whatever reason – not enabled, you can manually enable and start it with:
systemctl enable --now ollama
However, we can fine-tune the service a little.
systemctl edit ollama
And we write into it:
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"
This will make sure that models are not automatically unloaded from memory and instead stay in memory. This is useful if you’re building a dedicated LLM service and want to make responses stay fast; otherwise Ollama would unload models after 5 minutes, and then have to load it back into memory for new requests which wastes time. If you have enough memory, keep it in memory!
Some more tweaks:
Environment="OLLAMA_MAX_LOADED_MODELS=8"
Environment="OLLAMA_NUM_PARALLEL=6"
Environment="OLLAMA_MAX_QUEUE=512"
Here we specify that up to 8 models should stay in memory – if enough memory is available, that is – (meaning that a 9th model would unload a previous model), we specify that up to 6 requests should be processed in parallel and finally we allow 512 requests to be queued up before the API will start to respond with an error due to overload.
The idea here is simple: If you have enough RAM and are using mostly smaller models, and want to use them semi-parallel (for example running Codellama in your IDE while also chatting with Mistral), the context switching will be quicker and you won’t have to wait for Ollama to first unload one model and then load in another model. In simple terms: We skip the constant re-reading from disk by keeping them on standby in RAM.
systemctl daemon-reload
systemctl restart ollama
Your first HTTP request
Assuming you have curl
and jq
installed, we can run this now:
curl -s http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Hello World",
"stream": false
}' | jq '[.response]'
That’s nice! Now any application which supports the Ollama API can contact Ollama on localhost, but that’s not all, Ollama also – experimentally – supports the OpenAI API, meaning tools expecting OpenAI can use it too!
But … limiting things to localhost isn’t very useful on a server. If you’re running this on your local machine, this is fine, but we want to keep our local machine lightweight and offload Ollama to a server! :D
Nginx + HTTPS
To be able to make use of Ollama on our local machine, like e.g. a weak laptop, we’ll want to put Ollama on a beefy server and use it over a secure connection to that server.
For this, we’ll be installing Nginx (apt install nginx
) as reverse proxy, securing it to use HTTPS and finally making sure that we need some sort of “secret” (API token) to talk to Ollama.
I’ll assume you’ll already know how to use Certbot to get a TLS certificate from Let’s Encrypt, if not you can follow my guide on using Certbot + Cloudflare DNS API here.
I recommend getting a wildcard certificate! We’ll use it for other things in future guides.
I recommend setting up a subdomain for the Ollama API which points to your server, so if your domain is example.net, I recommend setting up ollama.example.net, so your Nginx config at /etc/nginx/sites-enabled/ollama.example.net
might look something like this:
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name ollama.example.net;
ssl_certificate /etc/letsencrypt/live/example.net/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/example.net/privkey.pem;
location / {
proxy_http_version 1.1;
if ($http_authorization != "Bearer CHANGE_THIS_SECRET") {
return 401;
}
proxy_pass http://localhost:11434;
proxy_set_header Host localhost:11434;
}
}
You’ll of course have to change the server name and paths to your certificate files, and don’t forget to CHANGE_THIS_SECRET
. You can quickly generate some secret with:
head -c32 /dev/urandom | xxd -p -c 0
Finally, check the config with nginx -t
and then if all is ok reload Nginx with systemctl reload nginx
and then try to reach Ollama over HTTPS at that server name you set:
curl https://ollama.example.net
Should give you something like:
<html>
<head><title>401 Authorization Required</title></head>
<body>
<center><h1>401 Authorization Required</h1></center>
<hr><center>nginx</center>
</body>
</html>
Which means that you’re not authorized (good!)
So, assuming our bearer token in the Nginx config is 123, the curl
would look something like this:
curl https://ollama.example.net -H "Authorization: Bearer 123"
Which should return:
Ollama is running
Perfect!
Now we got Ollama, secured with HTTPS (TLS) and an Authorization Bearer required. Now you can configure supported applications, like e.g. the previously mentioned Enchanted which should be self-explanatory, but read on, on how to configure Continue.dev.
Ollama in the Code Editor
Okay, now we got Ollama, Nginx with HTTPS and Authorization set up. How do we configure Continue.dev? Easy! After installing the Continue plugin, in the config.json
you want to specify a model like this:
{
"title": "Codellama (remote self-hosted)",
"model": "codellama:latest",
"completionOptions": {},
"provider": "ollama",
"apiBase": "https://ollama.example.net",
"requestOptions": {
"headers": {
"Authorization": "Bearer 123"
}
}
}
In newer versions of Continue.dev you can use the apiKey parameter directly:
{
"title": "Codellama (remote self-hosted)",
"model": "codellama:latest",
"completionOptions": {},
"provider": "ollama",
"apiBase": "https://ollama.example.net",
"apiKey": "123"
}
Now, here’s the thing. If your server isn’t that fast, it might get spammed and DoS’d by auto-completion requests. If that’s the case, you probably want to disable autocompletion with:
"tabAutocompleteOptions": {
"disable": true
}
Troubleshooting web server timeouts
If your server is slow or overwhelmed at times, you might want to increase the proxy timeout in Nginx. You can do so by adding this to your Nginx config’s location block:
proxy_read_timeout 600s;
proxy_send_timeout 600s;
Which means you got 10 minutes for a request before it will time out.
Conclusion
We tested the HTTP API, optimized the systemd service, putting Nginx in front of it with HTTPS and making sure authentication is required; finally, we set up our code editor config to use our remote instance. There you have it! Follow my blog on the fediverse if you want to be notified about future articles in this series.
I run this blog in my free time, and setting up this article took a lot of time and back and forth with testing and documenting. If I helped you out, consider donating a cup of coffee!
Author’s notes
It may seem like a short article but it takes a lot of research and testing before I can distill it down to just the essentials to quickly help others bootstrap themselves. ☕️ Like e.g., I looked a lot into how to secure Ollama with an API key, and there were many FOSS solutions and scripts to proxy the request, in the end after reading through a bunch of Nginx docs and testing some applications, I came up with the simple solution of doing a minimalistic header check in Nginx. These things take time, even if the result doesn’t look like much. 😅
Here some more relevant things to read:
Leave A Comment