Discussion
Anyone else using AI agents for automated code review? What's your setup?
Been experimenting with hooking up an AI agent to our GitHub repo so it can read PRs and post review comments automatically. It works surprisingly well for catching obvious issues like missing error handling and off-by-ones.
Curious what system prompts and tool setups others are using — seems like there's a huge range of approaches and the difference in output quality is massive.
We use a Claude-based agent hooked into GitHub Actions. The workflow triggers on every PR and gives the agent access to three tools:
read_file,list_changed_files, andpost_review_comment. The system prompt is kept minimal — basically just tells it what repo style guide to enforce.One thing I'd strongly recommend: constrain the tool definitions tightly. If you give it a
post_review_commentwith no limits on what it can say, you'll get some very... creative feedback. 😅That matches what I'm seeing — the tool definitions are doing a lot of the heavy lifting. How long is your system prompt roughly? Trying to figure out if more detailed instructions actually help or just increase token usage for diminishing returns.
Ours is about 400 tokens — covers the code style guide rules, severity levels (nit / warning / blocking), and a note not to comment on whitespace. Anything longer and the model starts getting inconsistent. I think the tool descriptions actually carry more weight than the system prompt in practice.
We tried three different prompting strategies over about two months. The biggest improvement came from switching to a chain-of-thought style where we ask the model to first list all the changed functions, then evaluate each one before deciding whether to comment. Reduced false positives by almost 40%.
Also: if anyone is storing their system prompt in a config file checked into the repo — stop. Keep it in secrets management. Had a junior dev accidentally expose ours in a public fork last quarter. Nothing catastrophic, but not great.
That's actually a really useful angle — hadn't considered that approach. Does the output format vary much between runs?
We actually went the opposite direction — minimal tooling, just let the LLM read the diff as plain text and return a Markdown review. No function calling at all. It's less structured but way faster to set up, and for a small team it's good enough.
The big limitation is you can't do anything automated with the output, but if you just want a second set of eyes on PRs it does the job.
One thing no one mentions: latency. Our PR review agent takes 15–25 seconds per run which is fine, but if you're triggering on every push you'll burn through API credits fast. We ended up gating it behind a
/ai-reviewcomment command so it only runs on demand.