Monday, 11 May 2026

AI Lab on a budget

The Mouse That Roared: Building a Collapsed AI Lab on Limited Budget

Many AI enthusiasts assume that meaningful development requires at least one NVIDIA RTX 5090. While these cards are undeniably fast, they are often "VRAM-starved" for large-scale work, effectively limiting their utility to models and contexts that can fit within approximately 30GB of VRAM. Supporting a full stack—including embedding models, rerankers, and transcription engines like the one discussed here—would typically necessitate additional cards and host systems, which drastically inflates cost, complexity, noise, and power consumption.

In response, I have developed the Collapsed AI Lab, a setup I call the "Roaring Mouse". This system is low-cost, has a minimal physical footprint, and maintains low power consumption. Crucially, it can run massive models that would exceed the memory capacity of an RTX 5090, forcing that high-end hardware into a performance crawl by comparison. It serves as the ideal home-lab environment for development and Proof-of-Concept (PoC) configurations.


======================================================

       NODE PROFILE: mf-x1 | STRIX POINT

======================================================

[ CPU ]

Model:      AMD Ryzen AI 9 HX 470 w/ Radeon 890M

Topology:   24 Threads

Load:       0.01, 0.03, 0.00


[ MEMORY ]

System:     54Gi Total (37Gi Used)

(actually 64Gi total, 8GB set as dedicated VRAM)

[ GPU & COMPUTE ]

Hardware:   AMD Radeon 890M (GFX1150)

VRAM/GTT:   8192 MiB Dedicated / 49152 MiB Shared

Vulkan API: 1.4.318

NPU Status: XDNA Driver Loaded (AI Ready)


[ STORAGE ]

Root FS:    773G / 1.9T (43% used)

Disk:       zd0 (50G) 

Disk:       zd16 (50G) 

Disk:       zd32 (150G) 

Disk:       zd48 (300G) 

Disk:       nvme0n1 (1.9T) KINGSTON


[ NETWORK & OS ]

Local IP:   172.17.212.252/24 on enp194s0

Kernel:     6.17.0-23-generic

Uptime:     up 1 week, 2 days, 22 hours, 7 minutes

======================================================

llama@mf-x1:~$ 


The Philosophy: Physical Models, Virtual Incubator

To maximize performance, the lab is designed with a strict architectural split:

  • The Physical Layer (The Muscle): The heavyweight models—transcription (Whisper) and reasoning/summarization (Qwen)—run directly on the host OS (Ubuntu 24.04). This gives them bare-metal access to the GPU and RAM without virtualization overhead.

  • The Virtual Layer (The Incubator): The applications run within isolated environments managed by LXC. This acts as a software incubator, housing various projects under different accounts while they "call out" to the physical layer for AI processing.

What is LXC?

LXC (Linux Containers) is a system-level virtualization method for running multiple isolated Linux systems on a single control host. Unlike traditional Virtual Machines (VMs) that emulate hardware and run their own kernels, containers share the host's kernel, making them incredibly lightweight and efficient. My setup uses a management layer that handles both high-density Containers and full Virtual Machines, allowing me to toggle environments as needed.

Current Incubator Status:

llama@mf-x1:~$ lxc list
+-------------+---------+----------------------------+-----------------+
|    NAME     |  STATE  |            IPV4            |      TYPE       |
+-------------+---------+----------------------------+-----------------+
| Dev-Station | RUNNING | 172.17.212.246 (enp5s0)    | VIRTUAL-MACHINE |
| cognivault  | RUNNING | 172.17.212.247 (enp5s0)    | VIRTUAL-MACHINE |
| Atrus       | STOPPED |                            | VIRTUAL-MACHINE |
| WeCa        | STOPPED |                            | VIRTUAL-MACHINE |
| WebRAG      | STOPPED |                            | VIRTUAL-MACHINE |
| sd-forge    | STOPPED |                            | CONTAINER       |
+-------------+---------+----------------------------+-----------------+

Note: Active projects like Dev-Station and cognivault run as VMs within this layer, while specialized tools like sd-forge are kept in lightweight containers.


Hardware: Minisforum AI X1 Pro-470

The "Roaring Mouse" is built on a headless workstation profile tuned for high-capacity memory tasks:

ComponentSpecification
ModelAMD Ryzen AI 9 HX 470 (24 Threads)
GPUAMD Radeon 890M (GFX 11.5.0)
Memory64GB, VRAM up to 54Gi Total (8GB Dedicated / 48GB Shared GTT)
Storage2TB NVMe
OS/KernelUbuntu 24.04.4 LTS / Kernel 6.17.0

The Engine Room: Physical Layer Services

Currently, the transcription engine is hosted directly on the metal to provide instant access to the Radeon 890M:

1. High-Precision Transcription (Port 8090)

I run whisper-server using the large-v3-turbo model, at 6 bit.

whisper-server -m models/whisper-large-v3-turbo-q6_k.gguf --host 0.0.0.0 --port 8090

The basic generative AI model I use is an abliterated version of qwen3.6 35B. also at 6 bit:

llama-server -m /home/llama/models/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf --mmproj /home/llama/models/mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf --alias qwen3.6 --host 0.0.0.0 --port 8080 -ngl 99 -fa on -c 240000 --cache-type-k q4_0 --cache-type-v q4_0 -b 512 -ub 512 --no-mmap --mlock --image-min-tokens 1024 --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 --tools all --parallel 2 --reasoning on --reasoning-budget 400 --offline

2. Real-World Performance

In a recent test, I processed a ~39.5-minute audio file (2373.7 seconds) using high-accuracy settings:

  • Settings: Auto-language detection, Beam Size 5, Best of 5.

  • Result: The system finished in 112 seconds

  • Efficiency: This is roughly 20x real-time speed while maintaining elite "5-beam" precision.

Iinference using the above mentioned qwen3.6model is at usable and respectable  rate of 25 tokens per second.


New Resident: The Anti-Clickbait Utility

The latest project in th incubator is the Anti-Clickbait Utility. It leverages the lab's distinct layers to act as a guardian of your time:

UI & Orchestration: The front-end resides in the Dev-Station VM, where it manages user requests and retrieves YouTube audio. It orchestrates the entire workflow: sending audio to the host's Whisper server, receiving the transcript, forwarding it to the Qwen 3.6 35B engine for analysis, and finally displaying the results to the user.

Transcription: This process is offloaded to the Physical Layer, utilizing whisper-server and the Whisper model for high-speed, bare-metal transcription.\

Critique: The raw transcript is analyzed by the Qwen 35B reasoning engine, also running on the physical layer to ensure maximum performance during complex semantic evaluations.

Verdict: If the AI determines the video is overdramatized fluff, it issues a clickbait warning. If the content is legitimate, the system produces a detailed, chaptered summary, allowing the user to extract the most important information at a glance.System Status: Stability Under Load

Even with these massive models resident in memory, the system maintains plenty of breathing room:

llama@mf-x1:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           54Gi        37Gi       8.6Gi       3.4Gi        12Gi        17Gi

Currently using 37Gi of memory to provide high-end AI services to the entire virtual incubator.


Next up: I'll be sharing the specific logic and prompts used to determine if a video is actually worth your 10 minutes or just chasing the algorithm.


UPDATE

Pairing this machine with a 

NVIDIA DGX Spark Founders Edition EU

would create the ideal AI Lab.
But this is another 5K so no more low budget!
In any case, if anyone feels like giving me a present for some reason (nameday, birthday, I would think of a reason if one is needed!) I would not mind one! 


No comments:

Post a Comment