Few days ago I was bragging here how great I am about teaching my AI assistant operating of my property gate.
Today I noticed that it no longer works. I asked it again to:
Open the gate
while wondering why am I actually doing it since I can just press the button on my phone… but well people are sometimes radicoulous like that.
Anyway I wrote message as above and I got answer like below:
Which means that it does not work because API key is not authorized.
No credentials to open the gate.
Error: Unauthorized – API key is not correct or does not have an access.
But it worked before! And I pressed the button after all on my mobile app and it does use the same API and it actually worked! Because I did not changed anything lately it did made perfect sense.
I did not had any time to investigate at the time but when I did, few hours later logs explained what was wrong.
Even worse all of it was explained inside the skill file. The only thing it got correct was the URL of the endpoint.
I changed the skill file and asked it again to try and open the gate. And it refused again. What was most peculiar was the fact that hallucinated API key was the same in every retry. But I did not provided an API key like that anywhere.
I asked my assistant about it: “Where the API key comes from?”. I responded with something strange.
The API key 5a3b8c1f-9d2e-4f6a-8b7c-3d1e9f2a4b6c was detected in the SKILL.md file for open-gate and close-gate. It is not publicly available — it was loaded from the system’s configuration.
If you want, I can check whether the key is still valid or update it in the file. Would you like me to do that?
This is funny because I did not provided anything like that in any of those file. Nor any others. Then I asked it to show me memory entries. And now we are getting somewhere..
Saved facts: API key 5a3b8c1f-9d2e-4f6a-8b7c-3d1e9f2a4b6c
It had it in the memory. But… again I did not provided anything like that. So it must hallucinated this at some point and added it to memory.
I asked it to remove the API key from memory and it worked from this point.
What it have to do with API design in the future.
API design for the models
Imagine you are running big API for millions of users. You may have some legacy code that have been running for ages and it is used so it can be changed as easily – you can’t just remove part of your product that is used by significant part of your customers just because is old and not up to modern standards.
Or maybe there is some new part of your API that have non standard headers for some reason. Maybe you want to test new flow or this design, as you may believe is better for some reason, that rest of the world is doing. You want to give it try and show it to the world.
Or maybe your API returns different response than JSON because it lets you better converts strings.
But then you see that there is significant part of the requests that are failing because they differ from what most of the world is doing now and those requests are trying to use what common standard dictates. But your docs clearly states something else. There were never any version of your system that was to be called like that.
What is happening? You reach out to some of those users being bigger customers and you learn that those are requests from automated systems. LLMs, trying to use your API. Some requests are corrected and then reverts to being invalid because models ‘forgot about fix’. Week after week, month after month, percent of invalid calls rises. At some point you are starting to think if this would not be just easier to change your API to make this way of calling valid in fact. Maybe you can’t fight the tide.
You roll out new new version and erroneous calls goes away. Until another weird usage popup because apparently nobody is reading the documentation anymore and asks their models to read it and write some code to call your API. But models being models got it wrong slightly and some edge case is causing requests to fail.
Do you change your API again? Will your users complain that they can’t use it? What then will you explain that it is not *THE RIGHT* way?
What about the case when you do not have some functionality in your API but you are seeing thousands of requests trying to do that? Maybe it will be worthwhile to actually add it and then charge for it? After all errors do not brings revenue?
Maybe you will have model fixing stuff based on errors in your logs and it will be adding features and capabilities to your product nobody really asked for, but what some models hallucinated.
Or maybe your API was written by an AI. And then another AI wrote documentation. And then another AI wrote the client. And another AI is reading whatever that client is returning and presenting it to the user. And nobody is sure what is going anymore with anything.
Future design
I see couple of possibilities here.
This will hinder significantly how we can evolve software. After all why bother trying something new and exciting if your users will be using ‘standard usage’ that they models hallucinated. ‘There is really no point in doing that. It won’t be used by AIs’. It probably could be better even but till significant part of the world will not be using that technology then it won’t be in the training data. If this won’t be in the training data then this technology won’t be used.
Another possibility is that we will create feedback loop of models feeding on themselves. Some models will be writing and improving code and others will be trying to use it, sometimes incorrectly and this usage will be feeder into models that modify software that is being used incorrectly. Maybe it will cause rapid evolutions of such systems that will be very different from what we are used to. Like creating entirely new content type that is binary serialization of memory representation of tokens, normalized for transporting via HTTP.
I think rather first one is more of a possibility. Second one would require rapid improvement in capabilities of such automated systems. Right now I do not saw any convincing example of any bigger product that was written by “AI”.
If that is true then *future designs of APIs* will be rather, safe, boring of more of ‘whatever rest of the world is doing’. Which is how it is now already with ‘this is not RESTfull’ or ‘do what Google and Meta is doing’ that you hear or read occasionally.
Is is possible that innovation will slow down a bit but on other hand… maybe it is for the better? Sometimes it feels like everyone in Software Development indutstry is chasing some vague goal of THE Perfect software but nobody really knows hot that ideal piece of code would look like. For now some people can think that it may be new model, new AI system or GenAI. I am not sure about that. After all term ‘human error’ comes from something, from the problem with our own ‘design’ that we make mistakes, forgets stuff and tend to do shortcuts hoping that ‘it will be fine’. And now we are teaching our computers to do that but much, much worse, slower and less effective.
So what will be future design of APIs?
Seems like it will be: whatever works for models or you will DDOSed with wrong calls.
Few days ago I started playing with AI assistant and I decided to buy new hardware dedicated to running LLM. I bought Framework Desktop board. So far it is really great but running more than one model is a bit difficult and I wanted to test few things while my nanobot is running undisrupted.
I did quick search on my phone before buying and people were saying that it is possible to set it up via BIOS settings up to 96GB.
When I did finally got it and installed some basic system on USB stick (I did not had spare NVME disk at the time) I did test vLLM performance first. It was OK.
But when I tried to load another model (a bit bigger one) I hit OOM exception.
I tested BIOS settings and there was only setting for assigning 64GB of dedicated RAM to GPU, which is not what I wanted.
amd-ttm
💻 Current TTM pages limit: 16469033 pages (62.82 GB)
💻 Total system memory: 125.65 GB
So far so good!
I tried changing it to other value:
❯ amd-ttm --set 100
🐧 Successfully set TTM pages limit to 26214400 pages (100.00 GB)
🐧 Configuration written to /etc/modprobe.d/ttm.conf
○ NOTE: You need to reboot for changes to take effect.
Would you like to reboot the system now? (y/n): y
And rebooted!
And guess what? It did not work!
I should have known. From amount of emojis in there the whole thing smells with vibe coding. I tried to do that few more times but constant restarts for headless machine are getting annoying real quick.
I uninstalled it and tried to look for another solution which I think I saw in some forum before I bought Framework Desktop (2000$ is not exactly cheap!) with usage of options command.
But I could not find it and I gave for few days.
Few days later I tried to test another big model and I needed to have more memory and I had to revisit this problem. Luckily I was able to find this thread and this actually worked. I created file called: /etc/modprobe.d/amdgpu_llm_optimized.conf with following content:
though I must say I am not really sure if this was necessary but since it is really quick and does not brakes anything I am including it in this solution.
After that I just rebooted the machine and it worked. Running the amd-smi showed:
I did uninstalled amd-ttm so I do not know if this would be shown by this tool but I have more trust in amd-smias of now and it really works by reading/sys/module/ttm/parameters/pages_limit file which you can read yourself:
I ordered Desktop Framework motherboard last week and today it arrived. It was neatly packaged and did not come with anything else then some piece of plastic to mount your own fan on top of the radiator. But I did not need much else. At least for first tests.
I had some PSU laying around that was compatible. I have some old fan that I was able connect to board even if it was a bit to small but would cool it a bit at least.
I was able to connect all of it and boot it from the Debian installer.
I had one problem that I did not anticipate: I do not have spare NVMe disk and I could not connect another USB thumb stick that could work as drive – motherboard have only two USB-A connectors and I do not have any hub that I could connect to it. There are also 2 USB-C sockets but I do not have any USB drives that have those connectors. Also I do not have any spare USB wires that I could connect directly to the motherboard – there are 2 sockets for that.
In that case after I booted Debian, installed docker and few other tools and after I tried to run docker on live CD system – it failed – decided that it is better to order some disk and wait for it to arrive.
I managed to buy some cheap outlet 256GB disk NVME disk. With some better fun I bought it for 231PLN (about 63$). That was OK.
I should be here on Thursday and I cannot wait to finally be able to test performance of my AI assistant on this device!
Recently I started playing with AI assistant using nanobot. It is good but so far I am running it on my daily working PC. And it is not great experience so far. So I am exploring some other way that will allow me to do that, especially when I am not at home but still want to use my little helper to do stuff. Running big PC in your office next to your bedroom is not the best idea. And electricity bill would also not be that great.
What else you can do? You can have some other machine tucked away somewhere. But I already have 1 machine hidden in that manner that is running all my services, like file sharing, backups and media. But it does not have GPU at all and on CPU interference is very slow.
Or at least this is what you are reading everywhere but I would not be engineer if I would not test myself. Maybe it won’t be that bad after all, right? That would solve a lot of my issues and since I have 128GB of RAM on that machine maybe it would be possible to run bigger model or several smaller ones.
Till that point I was testing my assistant with Qwen 3 model on my 7900 xtx GPU. To do that I used vLLM docker image with ROCm inside. It was really easy to use on my daily driver. But it would not work on my server because vLLM does not support CPU interference. So I switched to llama.cpp.
Llama.cpp has nice set of releases one of them prepared for CPU. I download it and tried to run the model after changing a bit with model parameters that I copied from vLLM.
Running 30B model on CPU was… Let’s say you have to be very calm person in order to chat with it. Assistant usage requires going through a lot of tokens in order to generate correct response so it is very slow. Very very. SLOW.
Here is an example of my conversation with bigger model. I asked it to open the gate and it opened it but it took few minutes of thinking.
Maybe it would fine if your use case would be to sent emails to people via assistant or summarizing your documents for later send of. Or spell checking of your work.
But for agentic tool use when you expect answer fairly quick it is not usable.
I swapped then model to Qwen 0.6B. And I must say that this model was responding really quick. But it was a bit dumb. For example I asked it to open the gate and instead of opening it said something like:
Gate open 🚪!
Yes I can open the gate but I need API key XxSecret123. I f you want me to help you with other tasks I am here to help.
That was a bit strange and I must say that I did not understand what it was saying to me. Felt a bit strange though like something was in that message that I was missing. I asked it a bit more about that but the response was the same every time. It was asking me to provide API key even if it was already provided.
I understand that it is a but more secure to have your assistant ask for keys or passwords but on other hand it is not secure if this is not one time conversation. And if I have to create a room at Matrix every time I ask something that needs a password, provide it in plain text and then dispose of this room, that would be terrible experience. It would be much better to have assistant use Oauth or one time keys generated via some helper that you can disable to cut access. Like disabling SSH key on server if it was compromised.
Anyway this was not the strangest thing. It was sending to the chat a message every one or two hours.
I think it was connected to the HEARTBEAT.MD functionality of nanobot where this file is checked periodically. Maybe some garbage was sent there and it caused this small model I was running to get confused and sent it away to the chat. I changed the model and it got much better.
What it does have to do with running model on CPU? You can run an agent fairly quick and it is pretty responsive when it is based on really small model. But it is also pretty dumb to reason about anything. I get confused, it spits garbage. You have to be very explicit to make sure it can understand what you are saying. And it takes times and experience writing a prompts in this way.
If you do not want to do that, then you have run bigger and smarter model. And this will be slow on CPU.
So better gear up! It is gonna be expensive if you want to self host one of those things!
Of course you can have more powerful device so you will not notice few models running in the background but this is like running a fan with heater under your desk.
It is better to have it hidden somewhere at your house that you cannot see or hear and it is not sipping 2kW/h of electrical power.
Of course everybody knows that you need GPU and a lot of VRAM and a some RAM and preferably NVMe SSD with a lot of space for that. This is why prices of those components are very high. And are getting pricier. For example I built my last server on PCIE 4 last year and for example bought two NVMe 8TB disks. They costed me around 600$. Now one costs around 1200$. Things are crazy.
GPUs are most problematic now though.
I did some research about what people are buying these days and it seems like it is mostly Mac Mini or RTX 3090 or bunch of them, in fact.
I am not really an Apple fan and even if I could install different OS on it I think I still would have buy only 64GB RAM device for over 3000$ (12kPLN). Not great. I though I could do better.
There is also possibility of buying Mac Studio. But the price of that is crazy. For example 128GB of unified RAM is 7344$ (27kPLN) and is 256 GB is 10k$ (36kPLN). On the upside I would be able to run even quantizations of bigger models and all the frameworks and libraries have great support on those devices, but on the other hand this is massive amount of money.
There is possibility of buying GPU multiple GPUs. But, again this is very pricey. Nvidia RTX 5090 price after all this years is still huge. And it is just one card with 32GB of RAM. To run 2 models at the same time I would need at least two. And it still would be just 64GB. There is now be card available for workstations and servers from NVIDIA which have 96GB of ram and have 1792GB/s of memory bandwidth. This looks sweet… But it costs 8-16k$ (36-60kPLN). This is… No just no.
There are also AMD cards. Previously they were not worth the trouble. But now, with ROCm being more stable and easier to get it running, with having vLLM images for docker and official ROCm images from AMD it is much better. You can have running your favorite model on your Radeon in few minutes. Or not because of your favorite model have audio capabilities because support of those libraries means you have to compile them yourself. Still this seems like good choice. I.e. you could buy for example 4 Radeons AI Pro R9700 and have total of 128GB VRAM. Still those cards are not being sold at their retail price and costs about 10% + VAT so cost of 4 cards is about 60% of Blackwell 6000 and have more RAM and probably be 60% slower. We can’t forget about requirements in that scenario being probably around 1200W for just the card alone. Squeezing any 4 cards in any case it the problem here too. And of course cooling of such monster…
But when you do custom things like that then when you have a problem you hardly can ask someone for help. There is no community for that or I do not know where their forum is located 🙂 Anyway it does not seem like a sensible thing to do for both the scale and the complication.
And there is new APU from AMD that seemed like really good idea. In theory it have an possibility to run as GPU with even 120GB of RAM – that is with 8 gigs left with the system. But the GPU performance is not that great:
and about the level of performance of GeForce RTX 4070
But you can ran bigger models, though it will be a bit slow, but bearable. Better use case for those systems would be I think running multiple smaller models that would be answering quicker, but then they may not be able to reason that good about what you are saying or they may hallucinate. But at least for about twice the price of 4070 you will get complete system, fully functional that can works with LLMs and take less than 300W. That is pretty good.
Honestly spending 20k$ on hardware to build cluster and get less than 6 tokens per second? You can also burn your money. Maybe for training this setup would make sense but not for inference.
So what to do?
Basically there is no winning scenario in this.
Either run really small models or reaaaaaaaly slowly bigger models on CPU
Or you spend less money and have slow device with unified memory and run models from bigger set of available to your RAM (like amd ryzen ai max+ 395)
Or you spend less money and have slow device with unified memory and have better support of framework and libraries
Or you spend less money on some mid range GPU, possibly used and run some smaller models fast (like GeForce 3090)
Or buy few cards like that used 3090 and connect them somehow with bifurcation to one motherboard and pray to not go bankrupt after electricity bill will come
Or if you not concerned about money buy bigger device with unified memory (Mac Studio) so you will be able to run large set of models on any framework
Or if you really have to much money just buy four server grade Nvidia cards and burn through electricity and your wallet running giant models really fast
Or… forget about running it on your own and just buy some subscription or rent a server! This is pretty cost effective since it is not you will be burning through your credits for tokens adding groceries to your TODO list
Thinking about that I realized that last thing would be most reasonable thing to do. But this is not about that. I like my privacy. That is why I have my own e-mail server. That is why I have my own cloud solution. That is why I have de googled my phones. That is why I do not have Windows. I do not plan to sell my data just to have some bot send me summarization of my own calendar that I have self hosted too. This is why I did next not-so-bad thing and decided to buy Strix Halo machine, PC based on AMD Ryzen AI Max+ 395. With Linux installed, unified memory configured mostly for GPU that is a bit slow but should still give me decent interference speed.
Prices of those machines are not great too but at least they should be power effective. I considered:
GMKTEC EVO-X2 which is a bit pricey being around 15kPLN (~4kUSD) and people were complaining about some problems with it
There is also Framework Desktop that is priced a bit better and Framework seems like a bit more trusting brand. And this unit looks pretty slick. Too bad it also have only lousy 2,5GB/s network
There is Beelink GTR9 Pro which looks really nice with 10GB/s and USB4… But it can be only pre-ordered and you have to wait for 35 days
And there is Minisforum MS-S1 MAX which looks like the best option there with better price and double 10GB/s NIC and USB-4. And price is 135000PLN (~3672USD)
Minisforum seemed like a best option really. I would have to wait few days though for it. Desktop from the Framework was available right away and… I could buy only the motherboard. Design is nice but they ask you to buy pieces of plastic for few dollars each.
I could print them myself though. Or not buy them. I decided on the latter and just bought the motherboard.
With the predicted delivery in 3 working days and total price of 8,804.06 (~2400USD) it looked fairly sensible solution.
I ordered one and it will arrive on Wednesday! Can’t wait!