My first PR to Nanobot

I am blogging last few weeks about my new aventure with nanobot. I am running model for that little assistant of mine on Desktop Framework motherboard I bought couple of days ago.

Few days ago I was working on my own fork of nanobot. I was not happy how few things are not entirely correctly implemented in this project. For example command support in Matrix channel. I created PR for that one here. Unfortunately it was not merged yet. I am not sure why. Maybe it is mising few unit tests.

Yesterday my my another PR was merged though! I created change request that added kind of streaming support to nanobot Matrix channel. It is not entirely streaming, it is series of edits to original message that was sent from first set of tokens to nanobot from LLM.

After the first set the message is created and after next tokens are generated nanobot sends another matrix event that changes original one. This is not really edit either as Matrix does not edit messages. It is just another event that is updating previous event data. Actually when I was testing that with small model, Qwen 3.5 0.6B, which is often generating repeating or incorect messages, but it does it very fast. On one accassion it did generated a lot of tokens which generates a lot of edits. I think it was about 90 lines edited few times each. Which caused entire message to be edited probably about 1000 times. This actually caused my mobile client to hang and stop getting updates altogether. It was just to much for it to handle.

I had to remove this message entirely 🙂

So it is not ideal but I still think it is beneficial because it is better to have quicker response and read through updates then wait for entire LLM response to appear in Element.

I think it is great success to be part of new and interesting project. And it is great to be able to benefit from this project by running my self hosted AI assistant.

Swapping vLLM to Llama.cpp

Yesterday I posted about vLLM being nice but optimized for different usage I am being really interested in. I commented on Llama Swap GH issue asking for help my problem of vLLM being shutdown before it actually started. During writing that comment I was thinking about Llama.cpp and it inability to run Qwen 3.5. It was at the beginning of March when I started (again) experimenting with LLMs. I tried to run Qwen models with release version of Llama.cpp and it was failing with ‘model architecture of Qwen3 is not found’ (or something like that; I do not really remember). Because of that I switched to using vLLM, even if it could not run Qwen 3.5, it could run Qwen 3 just fine from the docker image. Sglang seemed like not a good choice for my needs.

Because of that I was still using vLLM when I switched to running models on Desktop Framework motherboard. It was totally fine for about two weeks until I started experimenting with Llama.swap yesterday and it started to pain me that it is so slow to start.

I was writing the comment on Llama-swap issue and though: “it was actually few weeks maybe it works now…”. I decided to try it out again.

I downloaded new release of llama.cpp for ROCm and tried to run my main model I am using now for interference: Qwen/Qwen3.5-35B-A3B. It worked.

And it was quick to start! Really quick!

And it was working almost the same!

And it was using less memory!

Somehow chatting with the model felt a bit worse. Maybe how llama.cpp works or quantisation was worse or settings were a bit different, but… It was starting in several seconds! Outstanding.

I looked for some benchmarking tool for Open AI API models and I have found “LLM API Throughput Benchmark“. It does not seem to be like the best thing out there, but I just wanted to have rough idea about the performance of vLLM and llama.cpp running the same models. I run Qwen in version 2B in vLLM and Llama.cpp:

vLLM:

################################################################################                                   LLM API Throughput Benchmark                                                      https://github.com/Yoosu-L/llmapibenchmark                                                    Time: 2026-03-25 21:27:39 UTC+0                                  ################################################################################
Model: unsloth/Qwen3.5-2B        | Latency: 0.60 ms
Input: 38                        | Output:  512 tokens
                                                                                         | Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|         |    1 |   68.49 |     224.32 |        0.17 |        0.17 | 100.00% |     3.18 |
|    2 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.02 |
|    4 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     1.54 |         |    8 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.45 |         |   16 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     5.62 |         |   32 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    12.43 |         |   64 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    24.44 |         |  128 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    48.58 |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|                                                                                                  ================================================================================

And the same in llama.cpp:

################################################################################                                   LLM API Throughput Benchmark                                                      https://github.com/Yoosu-L/llmapibenchmark                                                    Time: 2026-03-25 21:27:39 UTC+0
################################################################################         Model: unsloth/Qwen3.5-2B        | Latency: 0.60 ms
Input: 38                        | Output:  512 tokens
                                                                                         | Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |   68.49 |     224.32 |        0.17 |        0.17 | 100.00% |     3.18 |         |    2 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.02 |         |    4 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     1.54 |         |    8 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.45 |         |   16 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     5.62 |         |   32 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    12.43 |
|   64 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    24.44 |         |  128 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    48.58 |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|         
================================================================================

Ok. You could see the problem immediately because llama.cpp can’t run *any* queries, in parallel at all. But some parameters juggling fixed that, to some extent. I added parameter -np 4 and it made things much better.

################################################################################
                          LLM API Throughput Benchmark
                   https://github.com/Yoosu-L/llmapibenchmark                                                    Time: 2026-03-25 21:34:54 UTC+0
################################################################################
Model: unsloth/Qwen3.5-2B        | Latency: 0.60 ms
Input: 38                        | Output:  512 tokens

| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |   68.46 |     224.32 |        0.17 |        0.17 | 100.00% |     3.19 |
|    2 |   98.97 |     237.95 |        0.32 |        0.32 | 100.00% |     4.41 |
|    4 |  126.60 |     249.43 |        0.61 |        0.61 | 100.00% |     6.89 |
|    8 |  113.77 |      36.54 |        0.65 |        8.32 | 100.00% |    15.33 |
|   16 |  112.97 |      25.55 |        0.64 |       23.80 | 100.00% |    30.88 |
|   32 |  112.79 |      22.17 |        0.64 |       54.86 | 100.00% |    61.85 |
|   64 |  112.36 |      20.76 |        0.66 |      117.14 | 100.00% |   124.17 |
|  128 |  109.85 |      19.69 |        0.67 |      247.08 | 100.00% |   254.03 |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|

================================================================================

I run the same tests but with Qwen 0.6B, smaller model and it was even more surprising.

Llama.cpp:

################################################################################
                          LLM API Throughput Benchmark
                   https://github.com/Yoosu-L/llmapibenchmark
                        Time: 2026-03-26 18:51:52 UTC+0
################################################################################
Model: tiny                      | Latency: 0.00 ms
Input: 36                        | Output:  512 tokens

| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |  125.41 |     360.00 |        0.10 |        0.10 | 100.00% |     4.08 |
|    2 |  166.04 |     360.00 |        0.20 |        0.20 | 100.00% |     6.17 |
|    4 |  189.06 |     288.00 |        0.50 |        0.50 | 100.00% |    10.83 |
|    8 |  180.52 |      24.55 |        0.48 |       11.73 | 100.00% |    22.69 |
|   16 |  169.31 |      15.71 |        0.47 |       22.91 |  62.50% |    30.24 |
|   32 |  169.91 |      15.76 |        0.44 |       22.84 |  31.25% |    30.13 |
|   64 |  168.29 |      15.59 |        0.49 |       23.09 |  15.62% |    30.42 |
|  128 |  168.84 |      15.62 |        0.48 |       23.05 |   7.81% |    30.32 |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|

================================================================================

vLLM:

################################################################################
                          LLM API Throughput Benchmark
                   https://github.com/Yoosu-L/llmapibenchmark
                        Time: 2026-03-26 18:55:38 UTC+0
################################################################################
Model: tiny-vllm                 | Latency: 0.20 ms
Input: 38                        | Output:  512 tokens

| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |   60.45 |     635.45 |        0.06 |        0.06 | 100.00% |     8.47 |
|    2 |  101.16 |     330.72 |        0.23 |        0.23 | 100.00% |    10.12 |
|    4 |  195.48 |    1904.76 |        0.08 |        0.08 | 100.00% |    10.48 |
|    8 |  364.67 |    1790.34 |        0.13 |        0.17 | 100.00% |    11.23 |
|   16 |  418.69 |    3460.84 |        0.07 |        0.11 |  62.50% |    12.23 |
|   32 |  420.42 |    3807.62 |        0.10 |        0.10 |  31.25% |    12.18 |
|   64 |  418.69 |    3460.84 |        0.07 |        0.11 |  15.62% |    12.23 |
|  128 |  433.56 |    3460.84 |        0.07 |        0.11 |   7.81% |    11.81 |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|

================================================================================

I seems like actually for lower number of concurrent users llama.cpp was doing a lot better. Of course for 4 requests it is almost the same and for more it performs a lot worse, but this is not the use case I am worrying about at all.

Then I checked memory consumption. This is important to note that using vLLM if you not forbid it vLLM will assign to itself all the memory. You could run one model with 2B paramers and it would consume entire VRAM, all 120GB for it. This make no sense for my scenario. I did change that when I was working on voice recognition for my assistant as it requires two models. I fixed that by assigning the lowest possible value to --gpu-memory-utilization that would still work. For Qwen 35B it was about 33% percent which is about 40GB. But for llama.cpp it is much less.

+------------------------------------------------------------------------------+         | AMD-SMI 26.2.1+fc0010cf6a    amdgpu version: Linuxver ROCm version: 7.2.0    |         | VBIOS version: 023.011.000.039.000001                                        |         | Platform: Linux Baremetal                                                    |         |-------------------------------------+----------------------------------------|         | BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |         | GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |         |=====================================+========================================|         | 0000:c1:00.0    AMD Radeon Graphics | N/A        N/A   0                 N/A |         |   0       0     N/A             N/A | N/A        N/A              189/512 MB |         +-------------------------------------+----------------------------------------+         +------------------------------------------------------------------------------+         | Processes:                                                                   |         |  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |         |==============================================================================|         |    0     148812  llama-server           1.9 GB   40.4 MB     1.9 GB  N/A     |         +------------------------------------------------------------------------------+

2GB! How?! Why?! I need to do more testing as it may be only some bare minimum and during the usage it is actually much more. But if this is true, combined with the fact that startup time is much better it convinced me that vLLM is probably not worth it.

This is how memory usage looks like during running Qwen 35B via vLLM:

+------------------------------------------------------------------------------+         | AMD-SMI 26.2.1+fc0010cf6a    amdgpu version: Linuxver ROCm version: 7.2.0    |         | VBIOS version: 023.011.000.039.000001                                        |         | Platform: Linux Baremetal                                                    |         |-------------------------------------+----------------------------------------|         | BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |         | GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |         |=====================================+========================================|         | 0000:c1:00.0    AMD Radeon Graphics | N/A        N/A   0                 N/A |         |   0       0     N/A             N/A | N/A        N/A              151/512 MB |         +-------------------------------------+----------------------------------------+         +------------------------------------------------------------------------------+         | Processes:                                                                   |         |  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |         |==============================================================================|         |    0     149044  python3.12             5.9 MB   62.5 KB    16.0 EB  N/A     |         |    0     149216  python3.12            37.6 GB    3.2 MB    38.5 GB  N/A     |         +------------------------------------------------------------------------------+

So llama.cpp is starting much quicker and is using less memory? There is no discussion here. That is it. I am switching to llama.cpp.

Of course I need to so more testing and more benchmarks but this seems to be obvious decision: llama.cpp is better choice for me for now.

vLLM is fast but kind of slow

I was using vLLM to run my AI assistant models for few weeks now. It is quite nice, production ready framework that have few bugs. ROCm support and by extent Strix Halo support is quite nice, even if AMD GPU docker image does not work on Strix Halo without aiter in custom version.

Interference speeds are quite nice. You can get 20-30t/s out of new AMD APU. Which is totally usable. Here is sample test I done on this device via LLM API Throughput Benchmark:

                                                                                      | Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|         |    1 |   28.48 |      52.79 |        0.72 |        0.72 | 100.00% |    17.98 |         |    2 |   37.82 |      36.37 |        2.09 |        2.09 | 100.00% |    27.08 |         |    4 |   44.40 |      15.06 |       10.08 |       10.09 | 100.00% |    46.13 |

There is one big issue though. vLLM startup time is very slow. Initially I was planning to use one model for evetyhing text related. Maybe for images too, which is totally fine for ImageText-to-Text model as Qwen 3.5. Another one for Speech-to-Text. Another one for Text-to-Speech. Another one for image generation. Maybe few others for other uses. But Desktop Framework, with its 128gb of unified RAM, even if I did assigned 120gb of RAM to GPU, it is still too little to run all of those models at once.

Solution to this is to run only few of them at once or maybe just one I am currently using. This is totally sensible thing to do and even can be used to run the same model in many different modes. For example Huggingface page for Qwen 3.5 specifies 3 different modes for running Qwen for different purposes:

We recommend using the following set of sampling parameters for generation

    Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
    Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.

Ideal thing would be to run one model and specify ‘mode’ by some API switch. And it happens that some people are doing that by using Llama Swap.

Llama Swap seems like very interesting project. I am using my new AMD Ryzen AI Max+ 395 motherboard as 24/7 headless server for models lately. And I was wondering what this software is doing actually all that time. How this impacts power usage? Most probably it is constantly doing something sipping power even when I am asleep.

The logical thing to so would be to run models only when I need them. Maybe swap assistant with model with coding model (I.e. Qwen Coder when I am working)… But…

vLLM is so slow at startup!

It can take as much as 3 minutes to start Qwen/Qwen3.5-35B-A3B.

During that time you can make yourself some tea.

Swapping models like that would be a bit annoying. Still I wanted to test it with Llama swap project. Otherwise using the same vLLM model from multiple not connected sources may cause it to run itself into a corner, having apparently some kind of the deadlock generating hundreds of tokens for some requests and nothing for others and being totally unresponsive while doing that. That was very strange and the only solution I was able to find for that was to totally kill the process.

Running multiple instances would probably help a lot even if initial startup time would be annoying.

I am experimenting right now with vLLM started via Llama Swap and it can’t really do that. Llama Swap kills the process after two minutes or so. Right now I can’t force it to run anything bigger than 2B parameters using vLLM – it is just to slow. There should be configuration switch to change the wait time to something longer but I can’t find it right now. There is this issue on Llama swap github when somebody else was experiencing something very similar. I hope I will be able to solve it and swap running models on the fly even if this will be very slow.

Integrating my AI assistant with StartPage

Today I was fixing an issue with my assistant having a hard time accessing web search. I fixed that but I was not entirelly satisfied with the result. Browsh is great for that used but I am using StartPage on my PC and on my phone. Ability to see similar or even the same web search result for me and my assistant would be great.

How hard it can be to actually fetch the result od search via CLI? Actually it is not that hard. I was able to figurÄ™ it out though it requires very specific set od parameters, headers being sent to very specific address.

Inspecting Start Page search I noticed that there is POST http call being made to address: https://www.startpage.com/sp/search with form parameters like below.

"query={query}&t=device&lui=polski&sc=mgAAkVBCMhaz20&cat=web&abd=0&abe=0&qsr=all&qadf=moderate&with_date="

I was able to recreate the result of this call in Rider HTTP client with the same headers as in the browser. This means that it is usually OK to run it programmatically. I am bit afraid about sc form parameter. But StartPage does not requires you to login or anything like that, so it is most probably settings cookie if at all.

I was playing at the same time with Junie, Jet Brains code assistant. I asked it to write some script for me to call that ebdpeoint and then extract the data from HTML. It was quite capable but I am bit said that I was unable to force it to work with my local model instead.

This is what it was able to came up with in the end:

import urllib.request
import urllib.parse
import re
import html
import json
import sys

def clean_html(text):
    # Remove style blocks
    text = re.sub(r'<style.*?>.*?</style>', '', text, flags=re.DOTALL)
    # Remove all HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Decode HTML entities
    text = html.unescape(text)
    # Trim whitespace
    text = text.strip()
    return text

def extract_links(query):
    url = "https://www.startpage.com/sp/search"
    headers = {
        "Host": "www.startpage.com",
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:140.0) Gecko/20100101 Firefox/140.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "pl,en-US;q=0.7,en;q=0.3",
        "Referer": "https://www.startpage.com/",
        "Content-Type": "application/x-www-form-urlencoded",
        "Origin": "https://www.startpage.com",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "same-origin",
        "Sec-Fetch-User": "?1",
        "Startpage-Extension": "ext-ff",
        "Startpage-Extension-Version": "2.0.3",
        "Startpage-Extension-Segment": "startpage.defaultffx",
        "Priority": "u=0, i",
        "TE": "trailers"
    }

    # URL-encode the query
    query_encoded = urllib.parse.quote_plus(query)
    body = f"query={query_encoded}&t=device&lui=polski&sc=mgAAkVBCMhaz20&cat=web&abd=0&abe=0&qsr=all&qadf=moderate&with_date="
    data = body.encode('utf-8')

    req = urllib.request.Request(url, data=data, headers=headers, method='POST')
    
    try:
        with urllib.request.urlopen(req) as response:
            content = response.read().decode('utf-8', errors='replace')
    except Exception as e:
        print(f"Error fetching page: {e}")
        return []

    results = []
    
    # We want to find result containers to match title, link and description correctly.
    # Looking at the HTML, each result seems to be in a div with class "result"
    # But regex might be easier if we look for the title link and then the following description.
    
    # Pattern to find the title link and its content
    title_pattern = re.compile(r'<a[^>]+class=[^>]*result-title[^>]*href=["\']([^"\']+)["\'][^>]*>(.*?)</a>', re.DOTALL)
    # Pattern to find the description after the title link
    desc_pattern = re.compile(r'<p[^>]+class=[^>]*description[^>]*>(.*?)</p>', re.DOTALL)
    
    # Let's find all occurrences of result-title links
    for match in title_pattern.finditer(content):
        link = match.group(1)
        title_raw = match.group(2)
        title = clean_html(title_raw)
        
        # Look for description starting from the end of the current title match
        search_start = match.end()
        desc_match = desc_pattern.search(content, search_start)
        
        description = ""
        if desc_match:
            # Check if this description belongs to this result (not the next one)
            # Typically descriptions follow titles closely.
            # We can also check if there's another result-title between them.
            next_title_match = title_pattern.search(content, search_start)
            if not next_title_match or desc_match.start() < next_title_match.start():
                description = clean_html(desc_match.group(1))
        
        results.append({
            "link": link,
            "title": title,
            "description": description
        })
        
    return results

if __name__ == "__main__":
    if len(sys.argv) > 1:
        search_query = sys.argv[1]
    else:
        search_query = "nvidia blackwell cena"
        
    links = extract_links(search_query)
    print(json.dumps(links, indent=2, ensure_ascii=False))

A bit messy I would say but the flow is as below:

  • read the HTML from the endpoint
  • find a tags with result-link class.
  • extract title of the link and address of the link
  • extract description from the p tag with description class.
  • glue it all together in JSON

After that chatbot is sending results to the chat. Where it can be further refined or used for something else.

Teaching my AI assistant to surf the web part 2

Few days ago I thought my assistant to do web search via Browsh and Duck Duck Go. It was working for some time with small problems, like for example I changed host name of the machine hosting nanobot and forget to adjust the host name in skill URL. Another small pain point is that sometimes returns timeout instead of web page content and not falls back to Brave Search API, and since I do not like that service, there is no API key and it falls back to DDG via python code. So it is mostly slow but I am still happy with that.

Today though I had another problem. This skill did not worked at all. I asked about the price of GPU card and it was slow. It responded in 2 minutes. I thought, immediately, that there is something wrong. I inspected logs and found:

2026-03-20 07:02:21.773 | INFO     | nanobot.agent.loop:_run_agent_loop:227 - Tool call: exec({"command": "curl -s --header \"X-Browsh-Raw-Mode: PLAIN\" \"http://localhost:4333/https://html.duckduckgo.com/html?q=rtx+r9700+cena\""})                  2026-03-20 07:02:41.462 | INFO     | nanobot.agent.loop:_run_agent_loop:227 - Tool
call: web_search({"query": "rtx r9700 cena", "count": 5})                          2026-03-20 07:02:41.462 | WARNING  | nanobot.agent.tools.web:_search_brave:114 - BRAVE_API_KEY not set, falling back to DuckDuckGo                                    2026-03-20 07:03:09.058 | INFO     | nanobot.agent.loop:_run_agent_loop:227 - Tool call: web_search({"query": "Radeon AI PRO R9700 cena", "count": 5})                2026-03-20 07:03:09.058 | WARNING  | nanobot.agent.tools.web:_search_brave:114 - BRAVE_API_KEY not set, falling back to DuckDuckGo                                    2026-03-20 07:03:35.222 | INFO     | nanobot.agent.loop:_process_message:452 - Response to matrix:@natan:m.np0.pl: Radeon AI PRO R9700:

it was slow because:

  • I retried it 6 times.
  • I entered non sensical search term

The second thing was a bit deliberate since I am testing output and reasoning of my assistant this way and I did not tested this skill since I switched to running Qwen 3.5. Because of that I wanted to test what Qwen 3.5 will do with such search. It was able to correct it on the fly and did not even comment on it. This is good but I think it would be better if this would be added to final message, i.e. ‘I changed the search term because original was incorrect’. But maybe I am just picky.

The first thing was bad and I was concerned why this was not working. I tested the output myself in the browser and in the command line via cURL. Everything seemed to be working OK.

I retyped the same message and waited for response, again for longer then I should. And the result was exactly the same.

I asked assistant why it is not using my custom search skill. It answered that the tool call was blocked by Safety guard.

What is the safety guard? I honestly do not know. I know though that there is some protection baked inside the nanobot to prevent I.e. nanobot to execute rm -rf ~/ for example. But it could also just hallucinating stuff. I do not know. The answer lacked some details and I do not remember nanobot having any settings for this protective layer. Also blocking cURL calls to localhost seems very questionable decision. Anyway I asked it to change localhost to host name but I also said that it is blocked. This looked even more suspicious. But fine I can ask it to change the skill to include bash script instead.

Update the web-search-custom skill by adding an execute script that performs the query to the same address and returns the response from that address.

Bot changed the skill and this time it worked, after I asked to fix the script by removing incorrect query string parameter from DDG search.

After that asking bit again to search for price yielded correct results and used correct skill.

2026-03-20 07:52:29.334 | INFO     | nanobot.agent.loop:_run_agent_loop:227 - Tool
call: exec({"command": "~/.nanobot/workspace/skills/web-search-custom/search.sh \"Radeon AI Pro R9700 cena site:ceneo.pl\""})                               2026-03-20 07:53:16.425 | INFO     | nanobot.agent.loop:_process_message:452 - Response to matrix:@natan:m.np0.pl: Linki do Ceneo:                                                                                                                       1. **PowerColor AI PRO 9700 32GB**: https://www.ceneo.pl/190947446                 2. **Gigabyte AI PRO 9700 32GB**: ht...

Less than a minute it not best but still better than 2 minutes originally.

I need to dig deeper into that Safety guard.mechanism and disable that for localhost. The previous day I asked not about something and it was trying to do killall to stop some process. It failed of course because this user does not have privileges to kill processess on this server, but still seems really strange that it was able to send such command but it could not send HTTP query to localhost port.