Monday, 11 May 2026

Getting there, testing backend services

 

Testing the Components: From Video to Summary

Before building the full application, I wanted to verify the individual components. The goal is to create a tool that downloads a YouTube video's audio, transcribes it, and then summarizes the content.

1. Environment Setup

The application server will run on a Linux VM. I logged into my development station (Dev-Station) and began by installing yt-dlp and ffmpeg:

sudo apt update

sudo apt install yt-dlp ffmpeg


After installation, I noticed the version provided by the package manager was quite outdated:

nobait@Dev-Station:~$ yt-dlp --version

2024.04.09

nobait@Dev-Station:~$ 


To ensure compatibility, I uninstalled the repo version and reinstalled the latest build directly from GitHub:

sudo wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O /usr/local/bin/yt-dlp

sudo chmod a+rx /usr/local/bin/yt-dlp

hash -r

Now we’re up to date:

nobait@Dev-Station:~$ yt-dlp --version

2026.03.17

nobait@Dev-Station:~$ 


2. Audio Extraction

Next, I needed to download the audio from a YouTube video. However, since the audio will be processed by the Whisper v3 Turbo Large model, it must meet specific requirements: 16-bit, 16kHz, Mono WAV format.

I used the following command to download and convert the audio in one go:


yt-dlp -x --audio-format wav --audio-quality 0 \ --postprocessor-args "ffmpeg:-ar 16000 -ac 1" \ -o "whisper_input.wav" "VIDEO_URL"



Lets try!

nobait@Dev-Station:~$ yt-dlp -x --audio-format wav --audio-quality 0 \

    --postprocessor-args "ffmpeg:-ar 16000 -ac 1" \

    -o "whisper_input.wav" "https://youtu.be/YBp_PXBbe80"

[youtube] Extracting URL: https://youtu.be/YBp_PXBbe80

[youtube] YBp_PXBbe80: Downloading webpage

WARNING: [youtube] No supported JavaScript runtime could be found. Only deno is enabled by default; to use another runtime add  --js-runtimes RUNTIME[:PATH]  to your command/config. YouTube extraction without a JS runtime has been deprecated, and some formats may be missing. See  https://github.com/yt-dlp/yt-dlp/wiki/EJS  for details on installing one

[youtube] YBp_PXBbe80: Downloading android vr player API JSON

[info] YBp_PXBbe80: Downloading 1 format(s): 251

[download] Destination: whisper_input.webm

[download] 100% of    9.20MiB in 00:00:01 at 6.43MiB/s

[ExtractAudio] Destination: whisper_input.wav

Deleting original file whisper_input.webm (pass -k to keep)

nobait@Dev-Station:~$ ls -ltrh *wav

-rw-rw-r-- 1 nobait nobait 19M May 11 20:41 whisper_input.wav

nobait@Dev-Station:~$ 


Yeah! Great!
The download and conversion worked perfectly, resulting in a 19MB WAV file for a ~10-minute video.


3. Transcription with Whisper

My transcription server is running on a physical host (IP 100.92.17.43) via Tailscale. I verified the service was active:


llama@mf-x1:~$ ps ax | grep whisper-server

1464907 pts/0    Sl     3:08 ./whisper.cpp/build/bin/whisper-server -m models/whisper-large-v3-turbo-q6_k.gguf --host 0.0.0.0 --port 8090

1604391 pts/1    S+     0:00 grep --color=auto whisper-server

llama@mf-x1:~$ ip a | grep 100.92

    inet 100.92.17.43/32 scope global tailscale0

llama@mf-x1:~$ 

I then sent the audio file from the VM to the physical server for inference:

nobait@Dev-Station:~$ time curl http://100.92.17.43:8090/inference   -H "Content-Type: multipart/form-data"   -F file="@whisper_input.wav"   -F language="auto"   -F beam_size="5"   -F best_of="5"   -F response_format="text"


 Hermes Agent is one of the most interesting open-source AI

 projects right now, and honestly,

 it makes sense why it's climbing above tools like OpenClaw,

.

.

.

.

 thought, guys, have an amazing day for positivity. And I'll

 see you guys really shortly. Peace out, fellas.


real 0m28.094s

user 0m0.010s

sys 0m0.014s

nobait@Dev-Station:~$ 


Results: The 9:53 video took roughly 28 seconds to process. This confirms an encoding speed of approximately 20x real-time (a 1-hour video would take about 3 minutes). 

I also compared the performance of the Tailscale IP versus the local LAN IP. Both resulted in identical processing times (~28s), proving that Tailscale adds no noticeable latency to the communication in this setup.

Worth noticing is that we can create an SRT (a subtitles file) straight away from the video file simply by selecting the response formet to be srt:


nobait@Dev-Station:~$ time curl http://100.92.17.43:8090/inference   -H "Content-Type: multipart/form-data"   -F file="@whisper_input.wav"   -F language="auto"   -F beam_size="5"   -F best_of="5"   -F response_format="srt"

1

00:00:00,000 --> 00:00:03,430

 Hermes Agent is one of the most interesting open-source AI


2

00:00:03,430 --> 00:00:05,500

 projects right now, and honestly,

.

.

.

.

205

00:09:48,440 --> 00:09:51,140

 that thought guys, have an amazing day, spread positivity,


206

00:09:51,140 --> 00:09:53,640

 and I'll see you guys really shortly. Peace out, fellas.



real 0m28.662s

user 0m0.007s

sys 0m0.019s

nobait@Dev-Station:~$

4. Summarization with Qwen 3.6

Finally, I tested the summarization model. On the same physical host, I have a llama-server running Qwen 3.6 (35B)


llama@mf-x1:~$ ip a | grep 100.92

    inet 100.92.17.43/32 scope global tailscale0

llama@mf-x1:~$ ps ax | grep llama-server

 932133 ?        Sl     3:18 llama-server -m /home/llama/models/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf --mmproj /home/llama/models/mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf --alias qwen3.6 --host 0.0.0.0 --port 8080 -ngl 99 -fa on -c 240000 --cache-type-k q4_0 --cache-type-v q4_0 -b 512 -ub 512 --no-mmap --mlock --image-min-tokens 1024 --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 --tools all --parallel 2 --reasoning on --reasoning-budget 400 --offline

llama@mf-x1:~$ 

Using jq to pipe the previous transcript into a JSON payload, I requested a summary:

nobait@Dev-Station:~$ time curl http://100.92.17.43:8080/v1/chat/completions   -H "Content-Type: application/json"   -d "$(jq -n --arg content "$(cat output.txt)" '{

    "model": "qwen3.6",

    "messages": [

      {"role": "system", "content": "You are a helpful assistant that summarizes text concisely."},

      {"role": "user", "content": ("Please summarize the following text:\n\n" + $content)}

    ],

    "temperature": 0.7

  }')"

{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"**Summary of Hermes Agent**\n\nHermes is an open-source, autonomous AI agent that stands out for its **persistent memory, self-improving capabilities, and deep user modeling**. Unlike typical chatbots, it runs continuously on local infrastructure, learns from past tasks, refines reusable skills, and adapts to user preferences over time. It is often compared to OpenClaw but is preferred for its reliability and focus on depth over breadth.\n\nA major recent upgrade is the launch of an **official desktop app** (Windows, macOS, Linux), which replaces the previously CLI-heavy experience with a visual interface for easier multi-agent management, workflow orchestration, and system configuration.\n\n**Key Features & Setup:**\n- **Easy Setup:** Install via platform-specific installer. Supports multiple AI providers (OpenAI\n- **Highly Configurable UI:** Manage sessions, personas, cron jobs, memory, tools (web, browser, terminal), and integrations (Discord, Telegram, etc.).\n- **Seamless Migration:** Easily import existing OpenClaw configurations, API keys, and sessions.\n- **Use Cases:** Functions as a 24/5 assistant, second brain, financial analyst, app builder, and content creator (e.g., autonomously generating HTML-native videos).\n\n**Bottom Line:** Hermes bridges the gap between research-grade AI and everyday usability, offering a robust, self-learning agent that grows smarter with use. For best results, pair it with a premium AI provider and explore its extensive tool/skill ecosystem.","reasoning_content":"Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Input Text:** A lengthy transcript/article about \"Hermes Agent,\" an open-source AI project. It covers its features, comparison to competitors (OpenClaw, CloudCode, Kilo), key capabilities (persistent memory, self-improving loop, user modeling, autonomous video generation), setup experience (historically CLI-focused, now has a desktop app), installation/setup process, UI features, integrations, migration from OpenClaw, use cases, and a call to action (newsletter, Discord, subscribe).\n   - **Task:** Summarize the text concisely.\n   - **Key Constraints:** Be concise, capture main points, maintain accuracy.\n\n2.  **Identify Key Information (Mental Extraction):**\n   - *What is Hermes Agent?* An open-source, autonomous AI agent designed for long-term, continuous use on local infrastructure.\n   - *Key Features:* Persistent cross-session memory, self-improving learning loop, user modeling (Poncho), reusable skill generation, deep understanding over time, focuses on depth over breadth.\n   - *Comparison:* Often compared to OpenClaw; preferred for reliability, memory, and self-improvement. Also competes with CloudCode/Kilo.\n   - *Recent Development:* Historically CLI-heavy, now offers an official open-source desktop app (Windows, macOS, Linux) to improve usability and accessibility.\n   - *Setup/Usage:* Easy installation via `.dmg`/`.exe`. Supports multiple AI providers (OpenAI, Anthropic, local models, free tiers). Configurable via UI: sessions, profiles, 3D office view, model/provider management, tools (web search, browser, terminal, etc.), cron jobs, personas, memory, gateway integrations (Telegram, Discord, etc.), and OpenClaw migration.\n   - *Capabilities/Use"}}],"created":1778534134,"model":"qwen3.6","system_fingerprint":"b9010-d05fe1d7d","object":"chat.completion","usage":{"completion_tokens":716,"prompt_tokens":2310,"total_tokens":3026,"prompt_tokens_details":{"cached_tokens":2306}},"id":"chatcmpl-tAlVeJsD6K2K69MTkE7KypospjxNv65l","timings":{"cache_n":2306,"prompt_n":4,"prompt_ms":89.455,"prompt_per_token_ms":22.36375,"prompt_per_second":44.71521994298809,"predicted_n":716,"predicted_ms":29352.035,"predicted_per_token_ms":40.994462290502796,"predicted_per_second":24.393538642209986}}

real 0m29.515s

user 0m0.006s

sys 0m0.009s

nobait@Dev-Station:~$


OK!
The model returned a high-quality, structured summary of the "Hermes Agent" project in about 29 seconds.

Conclusion

Between transcription and summarization, the total processing time is roughly 1/10th of the video's duration.

The backend services are verified and operating according to specifications. With the plumbing out of the way, I’m ready to start building the app!

Stay tuned for the next post where I'll dive into the application code.

Thanks for reading!



AI Lab on a budget

The Mouse That Roared: Building a Collapsed AI Lab on Limited Budget

Many AI enthusiasts assume that meaningful development requires at least one NVIDIA RTX 5090. While these cards are undeniably fast, they are often "VRAM-starved" for large-scale work, effectively limiting their utility to models and contexts that can fit within approximately 30GB of VRAM. Supporting a full stack—including embedding models, rerankers, and transcription engines like the one discussed here—would typically necessitate additional cards and host systems, which drastically inflates cost, complexity, noise, and power consumption.

In response, I have developed the Collapsed AI Lab, a setup I call the "Roaring Mouse". This system is low-cost, has a minimal physical footprint, and maintains low power consumption. Crucially, it can run massive models that would exceed the memory capacity of an RTX 5090, forcing that high-end hardware into a performance crawl by comparison. It serves as the ideal home-lab environment for development and Proof-of-Concept (PoC) configurations.


======================================================

       NODE PROFILE: mf-x1 | STRIX POINT

======================================================

[ CPU ]

Model:      AMD Ryzen AI 9 HX 470 w/ Radeon 890M

Topology:   24 Threads

Load:       0.01, 0.03, 0.00


[ MEMORY ]

System:     54Gi Total (37Gi Used)

(actually 64Gi total, 8GB set as dedicated VRAM)

[ GPU & COMPUTE ]

Hardware:   AMD Radeon 890M (GFX1150)

VRAM/GTT:   8192 MiB Dedicated / 49152 MiB Shared

Vulkan API: 1.4.318

NPU Status: XDNA Driver Loaded (AI Ready)


[ STORAGE ]

Root FS:    773G / 1.9T (43% used)

Disk:       zd0 (50G) 

Disk:       zd16 (50G) 

Disk:       zd32 (150G) 

Disk:       zd48 (300G) 

Disk:       nvme0n1 (1.9T) KINGSTON


[ NETWORK & OS ]

Local IP:   172.17.212.252/24 on enp194s0

Kernel:     6.17.0-23-generic

Uptime:     up 1 week, 2 days, 22 hours, 7 minutes

======================================================

llama@mf-x1:~$ 


The Philosophy: Physical Models, Virtual Incubator

To maximize performance, the lab is designed with a strict architectural split:

  • The Physical Layer (The Muscle): The heavyweight models—transcription (Whisper) and reasoning/summarization (Qwen)—run directly on the host OS (Ubuntu 24.04). This gives them bare-metal access to the GPU and RAM without virtualization overhead.

  • The Virtual Layer (The Incubator): The applications run within isolated environments managed by LXC. This acts as a software incubator, housing various projects under different accounts while they "call out" to the physical layer for AI processing.

What is LXC?

LXC (Linux Containers) is a system-level virtualization method for running multiple isolated Linux systems on a single control host. Unlike traditional Virtual Machines (VMs) that emulate hardware and run their own kernels, containers share the host's kernel, making them incredibly lightweight and efficient. My setup uses a management layer that handles both high-density Containers and full Virtual Machines, allowing me to toggle environments as needed.

Current Incubator Status:

llama@mf-x1:~$ lxc list
+-------------+---------+----------------------------+-----------------+
|    NAME     |  STATE  |            IPV4            |      TYPE       |
+-------------+---------+----------------------------+-----------------+
| Dev-Station | RUNNING | 172.17.212.246 (enp5s0)    | VIRTUAL-MACHINE |
| cognivault  | RUNNING | 172.17.212.247 (enp5s0)    | VIRTUAL-MACHINE |
| Atrus       | STOPPED |                            | VIRTUAL-MACHINE |
| WeCa        | STOPPED |                            | VIRTUAL-MACHINE |
| WebRAG      | STOPPED |                            | VIRTUAL-MACHINE |
| sd-forge    | STOPPED |                            | CONTAINER       |
+-------------+---------+----------------------------+-----------------+

Note: Active projects like Dev-Station and cognivault run as VMs within this layer, while specialized tools like sd-forge are kept in lightweight containers.


Hardware: Minisforum AI X1 Pro-470

The "Roaring Mouse" is built on a headless workstation profile tuned for high-capacity memory tasks:

ComponentSpecification
ModelAMD Ryzen AI 9 HX 470 (24 Threads)
GPUAMD Radeon 890M (GFX 11.5.0)
Memory64GB, VRAM up to 54Gi Total (8GB Dedicated / 48GB Shared GTT)
Storage2TB NVMe
OS/KernelUbuntu 24.04.4 LTS / Kernel 6.17.0

The Engine Room: Physical Layer Services

Currently, the transcription engine is hosted directly on the metal to provide instant access to the Radeon 890M:

1. High-Precision Transcription (Port 8090)

I run whisper-server using the large-v3-turbo model, at 6 bit.

whisper-server -m models/whisper-large-v3-turbo-q6_k.gguf --host 0.0.0.0 --port 8090

The basic generative AI model I use is an abliterated version of qwen3.6 35B. also at 6 bit:

llama-server -m /home/llama/models/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf --mmproj /home/llama/models/mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf --alias qwen3.6 --host 0.0.0.0 --port 8080 -ngl 99 -fa on -c 240000 --cache-type-k q4_0 --cache-type-v q4_0 -b 512 -ub 512 --no-mmap --mlock --image-min-tokens 1024 --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 --tools all --parallel 2 --reasoning on --reasoning-budget 400 --offline

2. Real-World Performance

In a recent test, I processed a ~39.5-minute audio file (2373.7 seconds) using high-accuracy settings:

  • Settings: Auto-language detection, Beam Size 5, Best of 5.

  • Result: The system finished in 112 seconds

  • Efficiency: This is roughly 20x real-time speed while maintaining elite "5-beam" precision.

Iinference using the above mentioned qwen3.6model is at usable and respectable  rate of 25 tokens per second.


New Resident: The Anti-Clickbait Utility

The latest project in th incubator is the Anti-Clickbait Utility. It leverages the lab's distinct layers to act as a guardian of your time:

UI & Orchestration: The front-end resides in the Dev-Station VM, where it manages user requests and retrieves YouTube audio. It orchestrates the entire workflow: sending audio to the host's Whisper server, receiving the transcript, forwarding it to the Qwen 3.6 35B engine for analysis, and finally displaying the results to the user.

Transcription: This process is offloaded to the Physical Layer, utilizing whisper-server and the Whisper model for high-speed, bare-metal transcription.\

Critique: The raw transcript is analyzed by the Qwen 35B reasoning engine, also running on the physical layer to ensure maximum performance during complex semantic evaluations.

Verdict: If the AI determines the video is overdramatized fluff, it issues a clickbait warning. If the content is legitimate, the system produces a detailed, chaptered summary, allowing the user to extract the most important information at a glance.System Status: Stability Under Load

Even with these massive models resident in memory, the system maintains plenty of breathing room:

llama@mf-x1:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           54Gi        37Gi       8.6Gi       3.4Gi        12Gi        17Gi

Currently using 37Gi of memory to provide high-end AI services to the entire virtual incubator.


Next up: I'll be sharing the specific logic and prompts used to determine if a video is actually worth your 10 minutes or just chasing the algorithm.