Local-first AI stops being a hobby project.

Local-first AI stops being a hobby project. The #2 post on HN today (1,210 points) is "Local AI needs to be the norm" — and the slate of releases that actually shipped today reads like an answer to it. A continuous-batching MLX server that lives in your Mac's menu bar. A terminal tool that figures out which model runs on the silicon you already have. Warp open-sourcing its agentic development environment. An autonomous AI team you spin up with one `npx` command and zero API key forms. A meeting transcriber that never phones home. And a self-hosted gateway that fronts every major model behind one URL you control. We dropped today's heaviest agent-framework picks (kagent, ARIS, hermes-agent's messaging fanout) — those are agent-runtime stories, not the local-stack story — and the ProductHunt SaaS firehose, which was 49 vertical agents looking for problems to solve.

omlx: continuous-batching MLX server with a menu-bar UI

The piece that was missing between "I downloaded a GGUF" and "this is actually a server I can build against."

A native Apple Silicon LLM inference server (13.7k⭐, v0.3.8, Apr 30) installable as a `.dmg` or via `brew install omlx`. Implements continuous batching and a two-tier KV cache (RAM-resident hot blocks, SSD-tiered cold blocks), exposes the OpenAI API format, and supports text models, vision-language models, embeddings, and rerankers from one process. The macOS app is a native PyObjC menubar — start, stop, swap models, and watch throughput without ever touching a terminal. Reach for it when the Ollama-or-llama.cpp default has hit its ceiling and you actually need the server semantics — concurrent requests, rerankers in the same process as the LLM, real KV reuse — without renting a GPU. Continuous batching is the boring-but-load-bearing piece: it's what turns "this model technically runs" into "this model serves traffic." Delete: the `python -m llama_cpp.server` wrapper, the half-finished FastAPI shim around Ollama, and the screenshots of menubar Ollama you saved as inspiration. Tradeoff: Apple Silicon only, M1+, macOS 15+ — none of this helps your Linux box or your team's Windows machines.

→ github.com/jundot/omlx

llmfit: which model actually runs on your hardware

The Steam hardware survey, but for whether GLM-4 is going to OOM your laptop.

A terminal tool (25.8k⭐, v0.9.23, May 10) installable via `brew install llmfit`, `scoop install llmfit`, `curl ... | sh`, or `uv tool install -U llmfit`. Detects your RAM, CPU, GPU, and quantization-friendliness, then scores hundreds of models on quality, speed, fit, and context for the specific machine. Integrates with Ollama, llama.cpp, and MLX as runtimes; ships a community leaderboard with real-world per-machine benchmarks rather than the model card's marketing numbers. Reach for it before downloading 40GB of weights you find out won't fit. The hardware-simulation mode is the real win for buyers: you can ask "what would I get from a 64GB M4 Max vs a 36GB M3" and get a model-by-model answer instead of a forum thread from 2024. Delete: the spreadsheet of "models I think will fit," the open tabs of HuggingFace model cards, and the hour you spend re-quantizing things to find out they were never going to run. Tradeoff: scores depend on community-submitted benchmarks, which means popular models are well-calibrated and the long tail isn't — verify before you trust the leaderboard for an obscure pick.

→ github.com/AlexsJones/llmfit

Warp goes open source

The terminal that started the agent-IDE category opens up its own source.

Warp open-sourced its agentic development environment two weeks ago and the repo is past 25,000 stars in week one (`github.com/warpdotdev/warp`). The interesting part isn't the license — it's the operating model: the team runs an "Oz" rig of internal agents that do most of the coding, planning, and testing, and external contributors drive direction and verification. There's a public dashboard at `build.warp.dev` showing what the agents are working on right now. Reach for it if you've been Warp-curious but blocked on the closed-source-terminal-on-my-dev-box concern, or if you want a real-world look at what "agents do the heavy lifting, humans steer" looks like as a contribution model rather than a pitch deck. The dashboard is genuinely worth ten minutes — it's the most concrete public artifact yet of an agent-led codebase being maintained out in the open. Delete: the bookmark on "should I switch from iTerm" you've been sitting on for a year. Tradeoff: "agentic" is now load-bearing in the value prop — if you wanted just a faster terminal, the LLM dependencies and the Oz workflow are overhead you didn't ask for.

→ github.com/warpdotdev/warp

wuphf.team: an autonomous AI team that runs on your laptop

The "no API key web form" line in the README is the whole pitch.

A locally-run multi-agent coordination system that spins up with `npx wuphf@latest` — no signup, no Docker compose, no 47-variable .env. State persists in SQLite on your machine. Drop a goal (the README's example: "Ship onboarding by Friday"), and a CEO/engineer/designer/marketer rig decomposes it, tracks blockers, and resolves dependencies between agents without round-tripping through you. Mixed-model: PM on Claude, engineer on a different provider, all in one config. Claims 7x fewer tokens per session than accumulated-context systems by being aggressive about what each agent gets to see. Reach for it when you've been routing between your own tools by hand — Linear → Cursor → notes → Slack to yourself — and what you actually want is the routing layer to handle itself overnight. The local + SQLite story matters: this is the first multi-agent product in a while where the data-residency question has a real answer. Delete: the ad-hoc cron job that pings Claude every morning with yesterday's TODOs. Tradeoff: it's early — `wuphf.team` ships a hosted demo and a Show HN with 9 points, which means production polish, recovery from agent-deadlock, and the long-tail UX bugs are all still ahead.

→ wuphf.team

meetily: privacy-first meeting transcription, fully local

Whisper + Ollama in one Tauri app, everything stays on the laptop.

A Rust + Tauri desktop app (11.9k⭐, v0.3.0, Mar 3) with installers for Windows, macOS, and Linux. Captures system audio, transcribes with Whisper or Parakeet, summarizes via Ollama or your local provider of choice, and does speaker diarization — all on-device. Hardware acceleration via Metal on Apple Silicon, CUDA on Nvidia, Vulkan on AMD/Intel. The community edition is permanently free and open; a PRO tier exists for teams that want enterprise extras. Reach for it the next time you'd otherwise paste a sensitive board call transcript into a hosted tool. Local meeting AI was a "this works in a demo" category six months ago; this is the first one with real diarization, real installer ergonomics, and a maintained release cadence — not a research project with a YouTube link. Delete: the Otter.ai subscription you've been quietly paying for client calls, and the Granola tab you keep open out of inertia. Tradeoff: transcription quality on local Whisper is meaningfully behind hosted services for accents and overlapping speech — fine for your own notes, less fine if you're handing the transcript to legal.

→ github.com/Zackriya-Solutions/meetily

ccx: one self-hosted gateway in front of every model provider

Multi-key rotation, channel failover, and a web admin in a single Go binary you can run yourself.

A Go API gateway (457⭐, v2.6.83 shipped today) that fronts Claude, OpenAI Chat, OpenAI Images, Codex, and Gemini behind one OpenAI-compatible endpoint. Channel orchestration, automatic failover when a provider degrades, multi-key rotation across accounts, and a web admin for routing rules. Install as a binary, via Docker Compose, or built from source. Reach for it when "we use three model providers" has metastasized into "every service has its own SDK, every key is in a different env file, and nobody knows which one is rate-limiting today." The interesting move is making the gateway boring infrastructure — one URL, one auth scheme, routing decisions in a UI rather than scattered across application code. Pairs cleanly with whatever model picker your agents already do client-side. Delete: the `if PROVIDER == "anthropic"` branch in your retry logic, the per-team .env files of API keys, and the spreadsheet of which key has quota left. Tradeoff: the gateway becomes a single point of failure between your apps and every LLM you depend on — budget the operational cost (TLS, monitoring, restarts) before you put it in the production path.

→ github.com/BenedictKing/ccx

One of these,
every weekday.

Free. Unsubscribe by replying with one word. No tracking pixels in the email.