Monday, 11 May 2026

Getting there, testing backend services

 

Testing the Components: From Video to Summary

Before building the full application, I wanted to verify the individual components. The goal is to create a tool that downloads a YouTube video's audio, transcribes it, and then summarizes the content.

1. Environment Setup

The application server will run on a Linux VM. I logged into my development station (Dev-Station) and began by installing yt-dlp and ffmpeg:

sudo apt update

sudo apt install yt-dlp ffmpeg


After installation, I noticed the version provided by the package manager was quite outdated:

nobait@Dev-Station:~$ yt-dlp --version

2024.04.09

nobait@Dev-Station:~$ 


To ensure compatibility, I uninstalled the repo version and reinstalled the latest build directly from GitHub:

sudo wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O /usr/local/bin/yt-dlp

sudo chmod a+rx /usr/local/bin/yt-dlp

hash -r

Now we’re up to date:

nobait@Dev-Station:~$ yt-dlp --version

2026.03.17

nobait@Dev-Station:~$ 


2. Audio Extraction

Next, I needed to download the audio from a YouTube video. However, since the audio will be processed by the Whisper v3 Turbo Large model, it must meet specific requirements: 16-bit, 16kHz, Mono WAV format.

I used the following command to download and convert the audio in one go:


yt-dlp -x --audio-format wav --audio-quality 0 \ --postprocessor-args "ffmpeg:-ar 16000 -ac 1" \ -o "whisper_input.wav" "VIDEO_URL"



Lets try!

nobait@Dev-Station:~$ yt-dlp -x --audio-format wav --audio-quality 0 \

    --postprocessor-args "ffmpeg:-ar 16000 -ac 1" \

    -o "whisper_input.wav" "https://youtu.be/YBp_PXBbe80"

[youtube] Extracting URL: https://youtu.be/YBp_PXBbe80

[youtube] YBp_PXBbe80: Downloading webpage

WARNING: [youtube] No supported JavaScript runtime could be found. Only deno is enabled by default; to use another runtime add  --js-runtimes RUNTIME[:PATH]  to your command/config. YouTube extraction without a JS runtime has been deprecated, and some formats may be missing. See  https://github.com/yt-dlp/yt-dlp/wiki/EJS  for details on installing one

[youtube] YBp_PXBbe80: Downloading android vr player API JSON

[info] YBp_PXBbe80: Downloading 1 format(s): 251

[download] Destination: whisper_input.webm

[download] 100% of    9.20MiB in 00:00:01 at 6.43MiB/s

[ExtractAudio] Destination: whisper_input.wav

Deleting original file whisper_input.webm (pass -k to keep)

nobait@Dev-Station:~$ ls -ltrh *wav

-rw-rw-r-- 1 nobait nobait 19M May 11 20:41 whisper_input.wav

nobait@Dev-Station:~$ 


Yeah! Great!
The download and conversion worked perfectly, resulting in a 19MB WAV file for a ~10-minute video.


3. Transcription with Whisper

My transcription server is running on a physical host (IP 100.92.17.43) via Tailscale. I verified the service was active:


llama@mf-x1:~$ ps ax | grep whisper-server

1464907 pts/0    Sl     3:08 ./whisper.cpp/build/bin/whisper-server -m models/whisper-large-v3-turbo-q6_k.gguf --host 0.0.0.0 --port 8090

1604391 pts/1    S+     0:00 grep --color=auto whisper-server

llama@mf-x1:~$ ip a | grep 100.92

    inet 100.92.17.43/32 scope global tailscale0

llama@mf-x1:~$ 

I then sent the audio file from the VM to the physical server for inference:

nobait@Dev-Station:~$ time curl http://100.92.17.43:8090/inference   -H "Content-Type: multipart/form-data"   -F file="@whisper_input.wav"   -F language="auto"   -F beam_size="5"   -F best_of="5"   -F response_format="text"


 Hermes Agent is one of the most interesting open-source AI

 projects right now, and honestly,

 it makes sense why it's climbing above tools like OpenClaw,

.

.

.

.

 thought, guys, have an amazing day for positivity. And I'll

 see you guys really shortly. Peace out, fellas.


real 0m28.094s

user 0m0.010s

sys 0m0.014s

nobait@Dev-Station:~$ 


Results: The 9:53 video took roughly 28 seconds to process. This confirms an encoding speed of approximately 20x real-time (a 1-hour video would take about 3 minutes). 

I also compared the performance of the Tailscale IP versus the local LAN IP. Both resulted in identical processing times (~28s), proving that Tailscale adds no noticeable latency to the communication in this setup.

Worth noticing is that we can create an SRT (a subtitles file) straight away from the video file simply by selecting the response formet to be srt:


nobait@Dev-Station:~$ time curl http://100.92.17.43:8090/inference   -H "Content-Type: multipart/form-data"   -F file="@whisper_input.wav"   -F language="auto"   -F beam_size="5"   -F best_of="5"   -F response_format="srt"

1

00:00:00,000 --> 00:00:03,430

 Hermes Agent is one of the most interesting open-source AI


2

00:00:03,430 --> 00:00:05,500

 projects right now, and honestly,

.

.

.

.

205

00:09:48,440 --> 00:09:51,140

 that thought guys, have an amazing day, spread positivity,


206

00:09:51,140 --> 00:09:53,640

 and I'll see you guys really shortly. Peace out, fellas.



real 0m28.662s

user 0m0.007s

sys 0m0.019s

nobait@Dev-Station:~$

4. Summarization with Qwen 3.6

Finally, I tested the summarization model. On the same physical host, I have a llama-server running Qwen 3.6 (35B)


llama@mf-x1:~$ ip a | grep 100.92

    inet 100.92.17.43/32 scope global tailscale0

llama@mf-x1:~$ ps ax | grep llama-server

 932133 ?        Sl     3:18 llama-server -m /home/llama/models/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf --mmproj /home/llama/models/mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf --alias qwen3.6 --host 0.0.0.0 --port 8080 -ngl 99 -fa on -c 240000 --cache-type-k q4_0 --cache-type-v q4_0 -b 512 -ub 512 --no-mmap --mlock --image-min-tokens 1024 --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 --tools all --parallel 2 --reasoning on --reasoning-budget 400 --offline

llama@mf-x1:~$ 

Using jq to pipe the previous transcript into a JSON payload, I requested a summary:

nobait@Dev-Station:~$ time curl http://100.92.17.43:8080/v1/chat/completions   -H "Content-Type: application/json"   -d "$(jq -n --arg content "$(cat output.txt)" '{

    "model": "qwen3.6",

    "messages": [

      {"role": "system", "content": "You are a helpful assistant that summarizes text concisely."},

      {"role": "user", "content": ("Please summarize the following text:\n\n" + $content)}

    ],

    "temperature": 0.7

  }')"

{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"**Summary of Hermes Agent**\n\nHermes is an open-source, autonomous AI agent that stands out for its **persistent memory, self-improving capabilities, and deep user modeling**. Unlike typical chatbots, it runs continuously on local infrastructure, learns from past tasks, refines reusable skills, and adapts to user preferences over time. It is often compared to OpenClaw but is preferred for its reliability and focus on depth over breadth.\n\nA major recent upgrade is the launch of an **official desktop app** (Windows, macOS, Linux), which replaces the previously CLI-heavy experience with a visual interface for easier multi-agent management, workflow orchestration, and system configuration.\n\n**Key Features & Setup:**\n- **Easy Setup:** Install via platform-specific installer. Supports multiple AI providers (OpenAI\n- **Highly Configurable UI:** Manage sessions, personas, cron jobs, memory, tools (web, browser, terminal), and integrations (Discord, Telegram, etc.).\n- **Seamless Migration:** Easily import existing OpenClaw configurations, API keys, and sessions.\n- **Use Cases:** Functions as a 24/5 assistant, second brain, financial analyst, app builder, and content creator (e.g., autonomously generating HTML-native videos).\n\n**Bottom Line:** Hermes bridges the gap between research-grade AI and everyday usability, offering a robust, self-learning agent that grows smarter with use. For best results, pair it with a premium AI provider and explore its extensive tool/skill ecosystem.","reasoning_content":"Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Input Text:** A lengthy transcript/article about \"Hermes Agent,\" an open-source AI project. It covers its features, comparison to competitors (OpenClaw, CloudCode, Kilo), key capabilities (persistent memory, self-improving loop, user modeling, autonomous video generation), setup experience (historically CLI-focused, now has a desktop app), installation/setup process, UI features, integrations, migration from OpenClaw, use cases, and a call to action (newsletter, Discord, subscribe).\n   - **Task:** Summarize the text concisely.\n   - **Key Constraints:** Be concise, capture main points, maintain accuracy.\n\n2.  **Identify Key Information (Mental Extraction):**\n   - *What is Hermes Agent?* An open-source, autonomous AI agent designed for long-term, continuous use on local infrastructure.\n   - *Key Features:* Persistent cross-session memory, self-improving learning loop, user modeling (Poncho), reusable skill generation, deep understanding over time, focuses on depth over breadth.\n   - *Comparison:* Often compared to OpenClaw; preferred for reliability, memory, and self-improvement. Also competes with CloudCode/Kilo.\n   - *Recent Development:* Historically CLI-heavy, now offers an official open-source desktop app (Windows, macOS, Linux) to improve usability and accessibility.\n   - *Setup/Usage:* Easy installation via `.dmg`/`.exe`. Supports multiple AI providers (OpenAI, Anthropic, local models, free tiers). Configurable via UI: sessions, profiles, 3D office view, model/provider management, tools (web search, browser, terminal, etc.), cron jobs, personas, memory, gateway integrations (Telegram, Discord, etc.), and OpenClaw migration.\n   - *Capabilities/Use"}}],"created":1778534134,"model":"qwen3.6","system_fingerprint":"b9010-d05fe1d7d","object":"chat.completion","usage":{"completion_tokens":716,"prompt_tokens":2310,"total_tokens":3026,"prompt_tokens_details":{"cached_tokens":2306}},"id":"chatcmpl-tAlVeJsD6K2K69MTkE7KypospjxNv65l","timings":{"cache_n":2306,"prompt_n":4,"prompt_ms":89.455,"prompt_per_token_ms":22.36375,"prompt_per_second":44.71521994298809,"predicted_n":716,"predicted_ms":29352.035,"predicted_per_token_ms":40.994462290502796,"predicted_per_second":24.393538642209986}}

real 0m29.515s

user 0m0.006s

sys 0m0.009s

nobait@Dev-Station:~$


OK!
The model returned a high-quality, structured summary of the "Hermes Agent" project in about 29 seconds.

Conclusion

Between transcription and summarization, the total processing time is roughly 1/10th of the video's duration.

The backend services are verified and operating according to specifications. With the plumbing out of the way, I’m ready to start building the app!

Stay tuned for the next post where I'll dive into the application code.

Thanks for reading!



No comments:

Post a Comment