Testing the Components: From Video to Summary
Before building the full application, I wanted to verify the individual components. The goal is to create a tool that downloads a YouTube video's audio, transcribes it, and then summarizes the content.
1. Environment Setup
The application server will run on a Linux VM. I logged into my development station (Dev-Station) and began by installing yt-dlp and ffmpeg:
sudo apt update
sudo apt install yt-dlp ffmpeg
After installation, I noticed the version provided by the package manager was quite outdated:
To ensure compatibility, I uninstalled the repo version and reinstalled the latest build directly from GitHub:nobait@Dev-Station:~$ yt-dlp --version
2024.04.09
nobait@Dev-Station:~$
sudo wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O /usr/local/bin/yt-dlp
sudo chmod a+rx /usr/local/bin/yt-dlp
hash -r
Now we’re up to date:
nobait@Dev-Station:~$ yt-dlp --version
2026.03.17
nobait@Dev-Station:~$
2. Audio Extraction
Next, I needed to download the audio from a YouTube video. However, since the audio will be processed by the Whisper v3 Turbo Large model, it must meet specific requirements: 16-bit, 16kHz, Mono WAV format.
I used the following command to download and convert the audio in one go:
yt-dlp -x --audio-format wav --audio-quality 0 \
--postprocessor-args "ffmpeg:-ar 16000 -ac 1" \
-o "whisper_input.wav" "VIDEO_URL"
Lets try!
nobait@Dev-Station:~$ yt-dlp -x --audio-format wav --audio-quality 0 \
--postprocessor-args "ffmpeg:-ar 16000 -ac 1" \
-o "whisper_input.wav" "https://youtu.be/YBp_PXBbe80"
[youtube] Extracting URL: https://youtu.be/YBp_PXBbe80
[youtube] YBp_PXBbe80: Downloading webpage
WARNING: [youtube] No supported JavaScript runtime could be found. Only deno is enabled by default; to use another runtime add --js-runtimes RUNTIME[:PATH] to your command/config. YouTube extraction without a JS runtime has been deprecated, and some formats may be missing. See https://github.com/yt-dlp/yt-dlp/wiki/EJS for details on installing one
[youtube] YBp_PXBbe80: Downloading android vr player API JSON
[info] YBp_PXBbe80: Downloading 1 format(s): 251
[download] Destination: whisper_input.webm
[download] 100% of 9.20MiB in 00:00:01 at 6.43MiB/s
[ExtractAudio] Destination: whisper_input.wav
Deleting original file whisper_input.webm (pass -k to keep)
nobait@Dev-Station:~$ ls -ltrh *wav
-rw-rw-r-- 1 nobait nobait 19M May 11 20:41 whisper_input.wav
nobait@Dev-Station:~$
The download and conversion worked perfectly, resulting in a 19MB WAV file for a ~10-minute video.
3. Transcription with Whisper
My transcription server is running on a physical host (IP 100.92.17.43) via Tailscale. I verified the service was active:
llama@mf-x1:~$ ps ax | grep whisper-server
1464907 pts/0 Sl 3:08 ./whisper.cpp/build/bin/whisper-server -m models/whisper-large-v3-turbo-q6_k.gguf --host 0.0.0.0 --port 8090
1604391 pts/1 S+ 0:00 grep --color=auto whisper-server
llama@mf-x1:~$ ip a | grep 100.92
inet 100.92.17.43/32 scope global tailscale0
llama@mf-x1:~$
I then sent the audio file from the VM to the physical server for inference:
nobait@Dev-Station:~$ time curl http://100.92.17.43:8090/inference -H "Content-Type: multipart/form-data" -F file="@whisper_input.wav" -F language="auto" -F beam_size="5" -F best_of="5" -F response_format="text"
Hermes Agent is one of the most interesting open-source AI
projects right now, and honestly,
it makes sense why it's climbing above tools like OpenClaw,
.
.
.
.
thought, guys, have an amazing day for positivity. And I'll
see you guys really shortly. Peace out, fellas.
real 0m28.094s
user 0m0.010s
sys 0m0.014s
nobait@Dev-Station:~$
Results: The 9:53 video took roughly 28 seconds to process. This confirms an encoding speed of approximately 20x real-time (a 1-hour video would take about 3 minutes).
I also compared the performance of the Tailscale IP versus the local LAN IP. Both resulted in identical processing times (~28s), proving that Tailscale adds no noticeable latency to the communication in this setup.
Worth noticing is that we can create an SRT (a subtitles file) straight away from the video file simply by selecting the response formet to be srt:
nobait@Dev-Station:~$ time curl http://100.92.17.43:8090/inference -H "Content-Type: multipart/form-data" -F file="@whisper_input.wav" -F language="auto" -F beam_size="5" -F best_of="5" -F response_format="srt"
1
00:00:00,000 --> 00:00:03,430
Hermes Agent is one of the most interesting open-source AI
2
00:00:03,430 --> 00:00:05,500
projects right now, and honestly,
.
.
.
.
205
00:09:48,440 --> 00:09:51,140
that thought guys, have an amazing day, spread positivity,
206
00:09:51,140 --> 00:09:53,640
and I'll see you guys really shortly. Peace out, fellas.
real 0m28.662s
user 0m0.007s
sys 0m0.019s
nobait@Dev-Station:~$
4. Summarization with Qwen 3.6
Finally, I tested the summarization model. On the same physical host, I have a llama-server running Qwen 3.6 (35B)
llama@mf-x1:~$ ip a | grep 100.92
inet 100.92.17.43/32 scope global tailscale0
llama@mf-x1:~$ ps ax | grep llama-server
932133 ? Sl 3:18 llama-server -m /home/llama/models/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf --mmproj /home/llama/models/mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf --alias qwen3.6 --host 0.0.0.0 --port 8080 -ngl 99 -fa on -c 240000 --cache-type-k q4_0 --cache-type-v q4_0 -b 512 -ub 512 --no-mmap --mlock --image-min-tokens 1024 --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 --tools all --parallel 2 --reasoning on --reasoning-budget 400 --offline
llama@mf-x1:~$
Using jq to pipe the previous transcript into a JSON payload, I requested a summary:
nobait@Dev-Station:~$ time curl http://100.92.17.43:8080/v1/chat/completions -H "Content-Type: application/json" -d "$(jq -n --arg content "$(cat output.txt)" '{
"model": "qwen3.6",
"messages": [
{"role": "system", "content": "You are a helpful assistant that summarizes text concisely."},
{"role": "user", "content": ("Please summarize the following text:\n\n" + $content)}
],
"temperature": 0.7
}')"
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"**Summary of Hermes Agent**\n\nHermes is an open-source, autonomous AI agent that stands out for its **persistent memory, self-improving capabilities, and deep user modeling**. Unlike typical chatbots, it runs continuously on local infrastructure, learns from past tasks, refines reusable skills, and adapts to user preferences over time. It is often compared to OpenClaw but is preferred for its reliability and focus on depth over breadth.\n\nA major recent upgrade is the launch of an **official desktop app** (Windows, macOS, Linux), which replaces the previously CLI-heavy experience with a visual interface for easier multi-agent management, workflow orchestration, and system configuration.\n\n**Key Features & Setup:**\n- **Easy Setup:** Install via platform-specific installer. Supports multiple AI providers (OpenAI\n- **Highly Configurable UI:** Manage sessions, personas, cron jobs, memory, tools (web, browser, terminal), and integrations (Discord, Telegram, etc.).\n- **Seamless Migration:** Easily import existing OpenClaw configurations, API keys, and sessions.\n- **Use Cases:** Functions as a 24/5 assistant, second brain, financial analyst, app builder, and content creator (e.g., autonomously generating HTML-native videos).\n\n**Bottom Line:** Hermes bridges the gap between research-grade AI and everyday usability, offering a robust, self-learning agent that grows smarter with use. For best results, pair it with a premium AI provider and explore its extensive tool/skill ecosystem.","reasoning_content":"Here's a thinking process:\n\n1. **Analyze User Input:**\n - **Input Text:** A lengthy transcript/article about \"Hermes Agent,\" an open-source AI project. It covers its features, comparison to competitors (OpenClaw, CloudCode, Kilo), key capabilities (persistent memory, self-improving loop, user modeling, autonomous video generation), setup experience (historically CLI-focused, now has a desktop app), installation/setup process, UI features, integrations, migration from OpenClaw, use cases, and a call to action (newsletter, Discord, subscribe).\n - **Task:** Summarize the text concisely.\n - **Key Constraints:** Be concise, capture main points, maintain accuracy.\n\n2. **Identify Key Information (Mental Extraction):**\n - *What is Hermes Agent?* An open-source, autonomous AI agent designed for long-term, continuous use on local infrastructure.\n - *Key Features:* Persistent cross-session memory, self-improving learning loop, user modeling (Poncho), reusable skill generation, deep understanding over time, focuses on depth over breadth.\n - *Comparison:* Often compared to OpenClaw; preferred for reliability, memory, and self-improvement. Also competes with CloudCode/Kilo.\n - *Recent Development:* Historically CLI-heavy, now offers an official open-source desktop app (Windows, macOS, Linux) to improve usability and accessibility.\n - *Setup/Usage:* Easy installation via `.dmg`/`.exe`. Supports multiple AI providers (OpenAI, Anthropic, local models, free tiers). Configurable via UI: sessions, profiles, 3D office view, model/provider management, tools (web search, browser, terminal, etc.), cron jobs, personas, memory, gateway integrations (Telegram, Discord, etc.), and OpenClaw migration.\n - *Capabilities/Use"}}],"created":1778534134,"model":"qwen3.6","system_fingerprint":"b9010-d05fe1d7d","object":"chat.completion","usage":{"completion_tokens":716,"prompt_tokens":2310,"total_tokens":3026,"prompt_tokens_details":{"cached_tokens":2306}},"id":"chatcmpl-tAlVeJsD6K2K69MTkE7KypospjxNv65l","timings":{"cache_n":2306,"prompt_n":4,"prompt_ms":89.455,"prompt_per_token_ms":22.36375,"prompt_per_second":44.71521994298809,"predicted_n":716,"predicted_ms":29352.035,"predicted_per_token_ms":40.994462290502796,"predicted_per_second":24.393538642209986}}
real 0m29.515s
user 0m0.006s
sys 0m0.009s
nobait@Dev-Station:~$
OK!
The model returned a high-quality, structured summary of the "Hermes Agent" project in about 29 seconds.
Conclusion
Between transcription and summarization, the total processing time is roughly 1/10th of the video's duration.
The backend services are verified and operating according to specifications. With the plumbing out of the way, I’m ready to start building the app!
Stay tuned for the next post where I'll dive into the application code.
Thanks for reading!
No comments:
Post a Comment