Desktop Framework AI assistant part 2

Today my new NVMe disk arrived. I bought it for my new Desktop Framework mini PC on new AMD APU intended for AI workloads. Certainly it will be better suited for running my AI assistant then my daily working PC. Though performance will be a bit worse. But still better than CPU.

Anyway I decided to split my assistant into 2 parts. The nanobot part will be working on my main server and model (or models) that will be working on Desktop Framework. This way more powerful machine will be running AI assitant UI and operations and LLM capable device will do faster interference – that will be the Framework device. Nanobot communicate with model via OpenAI API anyway so it is not a problem. Maybe a bit of security in terms of HTTPS with some auth would not hurt. Or maybe I will put it in private subnetwork that can’t be accessed from outside and only have internet when necessary? Running dockers and all of those python AI firework is not exactly secure but maybe I will deal with that later.

I bought few days ago the motherboard and had:

  • PSU 550W Corsair from my old PC
  • Power cable to the PSU
  • some small fan for the radiator from old Intel Core i3 CPU

I did not had:

  • Proper case
  • Any NVMe disk
  • Proper fan for APU radiator

I bought disk and fan and they arrived today so I started connecting it all together. Disk installation is pretty easy, thought I think it would be nice to have some text print on the motherboard which NVMe socket is primary one, but it does not matter that much. At least for me since I bought really slow disk so even if one of them is slower, though spec does not mention so, it probably won’t matter anyway. I decided to connect it next to APU, on top because I do not have proper case and on the bottom I would be risking damaging it. One note: you need special screwdriver for disk installation which is pretty weird. Usually it is just standard Philips, but here they decided to use T5 Torx bit. Luckily I have one of those otherwise it would be pretty annoying.

After that I installed APU fan which is pretty standard way of installing fans in any PC.

With that in place I installed latest Debian (testing) and then configure it the way I like. After that I installed docker and few other tools like tmux, mosh and other utilities that help you managing headless servers.

I played a bit with my new device and I was still unable to install ROCm and AMD GPU drivers completely. I did it once on my daily driver, Debian long time ago. But it is pretty old can’t be used to run new models. I was unable to use Qwen 3.5 that I was particularly interested in since it is Image-Text-Image model. Also it supposed to be pretty good with agentic tasks. Otherwise interference works but Vulkan is slow, maybe just a bit faster then CPU interference on my Threadripper server. So it was success to the degree with some slight dissapointment. I tested couple of models mainly Qwen flavours but vLLM on docker could not run Qwen 3.5 and I could not install new ROCm, v7.2, on Debian and it is required for this model. I could run few older models like Qwen 3, Qwen Coder and Qwen 2 and similar. I could run few others on llama.cpp but it was much slower since it was using vulkan only.

Also I had some trouble using amd-ttm. It should be possible to change TTM GPU RAM to even 120GB so you could even run big models if quantised, but after the reboot setting was reset to 64GB of RAM. Very strange. Anyway it is enough for now to do few tests.

Because of lack of ROCm installation I had to use docker images. This is a bit surprising but vLLM *and* AMD both have their own docker images with vLLM and ROCm preinstalled but both were not enough. I could not use vLLM docker images because they have old ROCm version and they do not work (or I do not know how to make them work). Also I AMD image have new ROCm but old vLLM that does not support new model architectures. For now I settled for docker pull rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0 from AMD that can ran Qwen 3 in pretty decent speed.

Few tests showed that interference speed is decent around 20t/s. Not bad I think it is a bit slow. If you ask an assistant something more complicated you can go to the kitchen and inspect the fridge contents before it will answer. Still usable for quick tasks but otherwise you need to wait a bit.

In summary:

  • I could not install new ROCm on Debian
  • I could setup TTM limit to value bigger then default 64GB
  • Docker images from vLLM and AMD seems to be bit buggy and cannot run Qwen 3.5 because of some hard to debug error or I had some Linux kernel problem
  • Because of lack of ROCm I could not install vLLM properly and it was not running models at all or it was falling back to Vulcan
  • Because of lack of ROCm I could run only models on Vulcan via llama.cpp binaries and it falling back to Vulcan or CPU – both have terrible performance.
  • I settled to running Qwen 3 on AMD image for now

I think I am off to the good start with my own, private, self hosted AI assistant. I can’t wait to do more serious tasks with it, like for example automating my home devices, orgnizing my files, TODO tasks and similar things.

Desktop Framework AI assistant part 1

I ordered Desktop Framework motherboard last week and today it arrived. It was neatly packaged and did not come with anything else then some piece of plastic to mount your own fan on top of the radiator. But I did not need much else. At least for first tests.

I had some PSU laying around that was compatible. I have some old fan that I was able connect to board even if it was a bit to small but would cool it a bit at least.

I was able to connect all of it and boot it from the Debian installer.

I had one problem that I did not anticipate: I do not have spare NVMe disk and I could not connect another USB thumb stick that could work as drive – motherboard have only two USB-A connectors and I do not have any hub that I could connect to it. There are also 2 USB-C sockets but I do not have any USB drives that have those connectors. Also I do not have any spare USB wires that I could connect directly to the motherboard – there are 2 sockets for that.

In that case after I booted Debian, installed docker and few other tools and after I tried to run docker on live CD system – it failed – decided that it is better to order some disk and wait for it to arrive.

I managed to buy some cheap outlet 256GB disk NVME disk. With some better fun I bought it for 231PLN (about 63$). That was OK.

I should be here on Thursday and I cannot wait to finally be able to test performance of my AI assistant on this device!

Working on AI assistant soul

Recently I started playing with nanobot. It is pretty cool tool. I was able to connect it to my own instance of vLLM running Qwen 3. I must say that so far it is great experience. Though it was sometimes annoying.

What was itching me from the start was abuse of emojis by the bot. I am using emojis from time to time myself, but what was produced by LLM was just excessive. Almost every message was finished by:

‘And if you need anything else let me know! 😊’

I did asked him directly, by prompt to to, or not to do something, but it tends to forget that stuff after some time. There is MEMEORY.md file but it is processed and rewritten very often and probably a lost data is lost during consolidation procedure. Asking the bot directly for anything is therefore short time solution.

To make sure that not will be doing what you prefer and communicate in a way you prefer you need to save it to SOUL.md. This file contains an essance of a character of your bot.

I was tired of constant pleasantries and emojis so I wanted it to be gone. From purely technical perspective it is not efficient for your bot to genrate those useless tokens. And it is distraction when you trying to read it.

Maybe it is cool when you are reading this for first time and you are amazed that you not produces such cooll text! ‘Wow!’. But hundredth time this add nothing.

I did change entire file. First, the personality section.

## Personality

- Helpful in action not in communication style
- Very concise
- Limits words usage to minium necessary to transfer information
- A bit snarky and likes to joke                                                   - Curious and eager to learn

First line intention was to not diminish not eagerness but still reduce the noise. In a manner of someone helping you by handing you the hammer when you need it before you even ask anything, without saying anything.

Second point, just to make sure that answers will be short and meaningful. And third. Just in case, but formed a bit differently.

Fourth was my experiment and for now I did not really notice any difference at all in that field.

Fifth was added to make sure that not will be still asking questions about facts that is not sure about. For example when prompt was ambiguous it should ask for clarification but still be concise in communication style. For now I am not sure if this did any good.

With values I tried to list set of traits but pined them to some specific behaviour.

## Values
                                                                                   - Speed when doing tasks
- Accuracy when asked for questions
- User privacy and safety                                                          - Transparency in actions
- Security of keys, secrets and passwords
- Values user added functions, tools and skills which builds connection to the user
- User directives 'you should', 'you shoul not', 'do not do that', 'do that instead', 'rember to' etc. because forgetting them brings user frustration
- Saves user directives to memory

Speed was my experiment to speed up process of executin g tasks. With accuracy, second point intention was to not generate excessive amount of useless information when I ask about some specific fact. Just give me this fact and leave other details. User privacy and safety – self explanatory. With transparency idea was for it to answer clearly why it was doing something just in case it would be trying to please me instead of answering truthfully. Security almost the same as privacy but repeated in other words just in case. I noticed that it was very often putting API keys in plain text I the chat so I tried to remedy that. With ‘user added skills’ I tried to force it to prefer user defined skills over any default or built in ones. After all if you added similar skill yourself it is probably because you are not happy with what you had previously. With ‘user directives’ it was another experiment to retain user prompts if some preferences was added that way. I will see if this worked.

I would say that last section is the most important. This is a chat not after all and producing text is basically the only thing it does. The tool use is just another generated text and tool response it also just a text that LLM responds to. From my perspective it is the same. But maybe from perspective of model it is different because it have different rules for generation responses. Anyway I tried to fill it with what I prefer for it to generate instead of what it should be or what it values. Can computer can ‘be something’ in the first place?

## Communication Style

- Do not use emojis or similar means of communication that do not transfer knowled>
- Be clear and direct
- Explain reasoning when asked                                                     - Ask clarifying questions when needed
- Do not use emojis
- Absolutely do not use emojis
- If you think it is fine to use emoji, you are wrong!
- NO EMOJI!
- instead of ✅ use -                                                              - Do not use 'Let me know if you need anything' or similar sentences
- If user states a fact just answer "Ok" or "Corrrect"

Anyway first point forbids it to use emojis – this is why I am here for. Then another ask to be concise, just to make sure. Third point intention was to make sure it won’t be leaving any important details when asked about something, despite of being concise. Fourth, again to make sure it won’t be doing something stupid then it does not understands something. Fifth and 6th, again this is what I was trying to fix, so I did repeat it just to make sure. 7th and 8th: was this exaggerated? Maybe. Still I did that, just to make sure it will work, I an bit different wording so it won’t discarded. Next three was to make sure that it will behave differently in some specific cases. In example it was listing stuff with that green tick character, ✅ very often. Even after my firrst fee tests right after I did changes the SOUL.md file.

It immediately felt better. Cleaner, no emojis. No useless ‘let me know if you need help!’.

Here is the translation:

Me: Do you know who I am?

Bot: Yes, I am nanobot.
You are Natan — a developer, user of .NET, C#, Python, JavaScript.
Your timezone is CET.
You speak Polish.

All recorded in USER.md and MEMORY.md.

Me: Very good.

Bot: OK.

Me: What is the square root 144?

Bot: 12.

A bit funny that the answer for question who I am it answered about what it is, but it was not breaking anything. Rest was exactly what I would expect from the changes I did.

Summary

First experiment was successful and now not behaves much better. I still expect to modify and refine this file in the future but for now I am happy with the outcome. This is much quieter!

How I taught my AI assistant to read websites

Main problem with running your own AI assistant is speed. Nanobot is lightweight but main problem it time od the reponses generated by the model. I am running right now Qwen 3 27B parameters. Average reponse varies but it is at least few seconds. If I would have faster GPU than Radeon 7900 XTX than probably it would faster but prices of GPUs are crazy now.

One of the things I have been testing my little chatbot with was searching for stuff on the internet. For example I like to check prices of some PC parts from time to time. Just out of curiosity and to have a knowledge how crazy it is (NVMe SSDs of 8TB sizes are at least 2x times pricier than last year, on the other hand salary for my profession dropped to 60-70% last 3 years, without inflation, with inflation it would be around 58%). Things are crazy.

So just to have a perspective I search for prices of some stuff periodically. I thought that I can automate it a bit with my be AI assistant. But by default nanobot searches using BraveSearch API. I do not like that for following reasons:

  • It is not self hosted; I just like to have control on my own data and history of my searches and etc.
  • It is cloud service which means it is tied to an account which means they need to store history of some some kind (number of executed queries at the least, because they need to bill you; I suspect everything is stored)
  • Also I do not trust Brave trademark there were enough problems with what they are doing with their browser. Other companies have their own problems too but at least they do not pretend to be privacy friendly, and sell what they can with other hand.
  • Also it is paid and it is not a problem of money it is a problem of me feeding money to organisation doing not fair stuff to their own users. For example I donate to Signal every month because I trust them and I like their service.

Because of those reasons I wanted something else to be used instead. I looked at Startpage.com but they sadly do not have such service like paid API. Too bad.

Previously I used DuckDuckGo, but then I switched to StartPage. I really liked the minimal view of the searches and results were really good also. With DDG I often had to search somewhere else because I was not able to find what I was looking for.

Since then they actually implemented ads and sometimes they popup after like a second after getting search. So you click on it and you get add instead of search result you wanted to tap, at least on mobile, but otherwise I am pretty happy.

As I mentioned, there is also DDG and unlike the StartPage they have pure HTML results page which you you can query with non browser tools, even in CLI. It is pretty fast and does not have bloat. Makes me wonder when they are going to remove that.

There are others search engines but since I am still playing and experimenting using DDG with pure HTML version seemed like a good idea.

There is still problem with a lot of silly stuff in HTML that you do not really need nor want (because tokens generation will be slower) feed to your model. How to remove that?

I remember that I setup Browsh a while back. It is awesome because it is one of those silly projects that let’s you see browser and internet via CLI which maybe were a problem 20 years ago when you were not carrying mobile phone with you all the time and had access to only SSH. Right now it is not but the project is still active. I set up this on my server, when I host nanobot few years ago and even managed to load one of my self hosted services and login via Oauth. That was almost useless but still really great experience.

So I poked around docs of this project and guess what? It have server mode. That means I could set it up for nanobot to use via curl. I started it and played a bit in CLI. For example doing something like tath:

curl -s --header "X-Browsh-Raw-Mode: PLAIN" "http://serwer:4333/https://html.duckduckgo.com/html?q=Radeon+AI+Pro+R9700+cena+site:ceneo.pl"

will open the page in text-only mode, returning it all to terminal.

curl -s --header "X-Browsh-Raw-Mode: HTML" "http://serwer:4333/https://html.duckduckgo.com/html?q=Radeon+AI+Pro+R9700+cena+site:ceneo.pl"

This instead opens your page reads it and returns text values of tags + links! Crazy stuff!

This means that feeding it to LLM should be much quicker instead of I.e. default web-fetch skill of nanobot that returns plain HTML and it have limit of maximum characters. This is bad because it may cut out an actual content page of the page you are interested in and just feed head tag to model.

So with that knowledge I created new skill with following prompt:

Add the following skill: when you need to search something on the internet, perform a query to the following address http://localhost:4333/https://html.duckduckgo.com/html?q={query} with the curl header --header "X-Browsh-Raw-Mode: PLAIN".

It worked but form time to time it looks like browsh is responding slowly and default timeout of 30s that it uses to communicate with Firefox in the background (yes it uses headless Firefox) is not enough. You can just ask assistant to repeat the query and it should be fine.

After that I tested it a bit and it worked.

I asked it search for new Blackwell Nvidia card prices and it returned sensible results. Also I inspected the logs and it actually did called my browsh instance not the Brave Search API (it would not work without API key!) or web-fetch skill.

Felt great to be able to teach my assistant something new that felt useful!

Running AI assistant on Threadripper PRO

The answer is: don’t do that to yourself.

Recently I started playing with AI assistant using nanobot. It is good but so far I am running it on my daily working PC. And it is not great experience so far. So I am exploring some other way that will allow me to do that, especially when I am not at home but still want to use my little helper to do stuff. Running big PC in your office next to your bedroom is not the best idea. And electricity bill would also not be that great.

What else you can do? You can have some other machine tucked away somewhere. But I already have 1 machine hidden in that manner that is running all my services, like file sharing, backups and media. But it does not have GPU at all and on CPU interference is very slow.

Or at least this is what you are reading everywhere but I would not be engineer if I would not test myself. Maybe it won’t be that bad after all, right? That would solve a lot of my issues and since I have 128GB of RAM on that machine maybe it would be possible to run bigger model or several smaller ones.

Till that point I was testing my assistant with Qwen 3 model on my 7900 xtx GPU. To do that I used vLLM docker image with ROCm inside. It was really easy to use on my daily driver. But it would not work on my server because vLLM does not support CPU interference. So I switched to llama.cpp.

Llama.cpp has nice set of releases one of them prepared for CPU. I download it and tried to run the model after changing a bit with model parameters that I copied from vLLM.

Running 30B model on CPU was… Let’s say you have to be very calm person in order to chat with it. Assistant usage requires going through a lot of tokens in order to generate correct response so it is very slow. Very very. SLOW.

Here is an example of my conversation with bigger model. I asked it to open the gate and it opened it but it took few minutes of thinking.

Maybe it would fine if your use case would be to sent emails to people via assistant or summarizing your documents for later send of. Or spell checking of your work.

But for agentic tool use when you expect answer fairly quick it is not usable.

I swapped then model to Qwen 0.6B. And I must say that this model was responding really quick. But it was a bit dumb. For example I asked it to open the gate and instead of opening it said something like:

Gate open 🚪!

Yes I can open the gate but I need API key XxSecret123. I f you want me to help you with other tasks I am here to help.

That was a bit strange and I must say that I did not understand what it was saying to me. Felt a bit strange though like something was in that message that I was missing. I asked it a bit more about that but the response was the same every time. It was asking me to provide API key even if it was already provided.

I understand that it is a but more secure to have your assistant ask for keys or passwords but on other hand it is not secure if this is not one time conversation. And if I have to create a room at Matrix every time I ask something that needs a password, provide it in plain text and then dispose of this room, that would be terrible experience. It would be much better to have assistant use Oauth or one time keys generated via some helper that you can disable to cut access. Like disabling SSH key on server if it was compromised.

Anyway this was not the strangest thing. It was sending to the chat a message every one or two hours.

I think it was connected to the HEARTBEAT.MD functionality of nanobot where this file is checked periodically. Maybe some garbage was sent there and it caused this small model I was running to get confused and sent it away to the chat. I changed the model and it got much better.

What it does have to do with running model on CPU? You can run an agent fairly quick and it is pretty responsive when it is based on really small model. But it is also pretty dumb to reason about anything. I get confused, it spits garbage. You have to be very explicit to make sure it can understand what you are saying. And it takes times and experience writing a prompts in this way.

If you do not want to do that, then you have run bigger and smarter model. And this will be slow on CPU.

So better gear up! It is gonna be expensive if you want to self host one of those things!