AI demos are easy to love, and I get why, because the demo is a controlled little world where nobody's uploaded the weird CSV yet and nobody's asked the question with three typos, and nobody's tried to use the thing late on a Friday with real money attached. That last part is the whole thing, and it's the part the demo never shows you. Production is where the demo has to grow up and start behaving like the inputs are imperfect and the user is busy and the data is half there and the answer actually matters to someone.
Give it one job
The first thing most people ask is "how do we use AI?", which sounds sensible but is quietly impossible to test, because it's too big to answer in any way you could ever check. The better question is smaller and a bit boring—what single job should this thing do, and what would make it safe enough that you'd trust it with that one job and walk away.
The jobs that have actually worked for me are the unglamorous ones. Writing 4,134 product descriptions against a brand's real catalogue, or running a parts-finder live on a storefront so a customer can find the right fitting without guessing, or pulling the overnight numbers into a 06:00 briefing so the day starts with a read instead of a hunch—one job first, done properly, against real data, and the rest can wait its turn. I worked that out the slow way, by trying to do too much at once and watching it all go mushy.
The nice thing is that a clear job finally gives you something to grade. Did it find the right product, did it use verified data, did it tell you where the answer came from, did it stop when the account connection dropped instead of inventing something to paper over the gap. You can't really grade "use AI", but you can grade that, and that's most of the difference.
Decide where the human comes back in
A good system knows what it's allowed to do, what it isn't, and the exact point where a person picks it up again.
For most business builds I'll start it read-only, or recommend-only—let the thing collect and classify and draft and explain all it likes, then put a hard stop before it writes to a client-facing platform or changes a budget or sends an email or touches a live product page. That stop is just plumbing, honestly, not me being nervous about it. It's the bit that means a bad five minutes stays a bad five minutes instead of quietly becoming a bad fortnight that someone has to sit there and unpick.
Make it read from somewhere real
If the system is answering an actual business question, it has to know where its answer came from, which in practice just means it's reading from somewhere specific and checkable—Shopify orders, Google Ads spend, product metafields, review exports, the brand rules, the last few briefs—rather than confidently making something up from the shape of the question. The most useful version of that is dull on purpose: a row in a database, a timestamp, a forgettable name, nothing you'd ever put in a demo.
Source-backed work is just calmer to live with, mostly because it can tell you where it got the answer. And it can tell you "I don't have enough data to answer that", which is the one sentence you most want a machine to be willing to say, and the one it's least naturally inclined to.
Plan for the days it falls over
The API will go down, the product data will be missing, someone will ask for something miles outside scope, and the model will hand back a weak answer on a Tuesday for no reason you'll ever actually locate. The thing that decides whether you've built something real is just what happens next, and if the honest answer to that is "we hope it's fine", then it isn't fine and it isn't ready. So you build the boring scaffolding instead:
- log the failure
- show a useful message
- avoid partial writes
- retry when retrying is sensible
- hand back to a human when judgement or risk is involved
I learned all of this the hard way, which is exactly why I say the stuff I build is battle-tested—because it literally is. A broad-match test once bit the SFT account before the guardrail existed to catch it, and that's the kind of thing you only really design around once it's already cost you something.
Keep a human in the loop
The point of all this was never to get people out of the loop entirely, because that's how you end up automating the production of nonsense, and nonsense with a project code and a dashboard is still nonsense.
The point is to move people onto the parts people are genuinely good at—the taste, the risk, the commercial judgement, the empathy, the final yes—while the machine quietly does the reading and the sorting and the drafting and the checking and the remembering, every night, without ever getting bored. Production-ready really just means the thing is still useful the morning after the demo, once the business has gone back to being its usual messy self.
If this is your problem
If you've got a job like this—a parts-finder, a catalogue rewrite, a daily briefing—I build these to one standard: a defined job, a data map, real permissions, logging, QA checks, human review and a clear failure path. Bespoke builds are scoped to your data and start at $5,000. If you'd rather test the water first, a $2,500 roadmap sprint is the low-commitment way in, and the always-on pieces—the 06:00 briefing, the guardrails that watch things overnight—run on the $3,000/mo Intelligence Retainer. Full pricing's on the AI implementation page.
Get those pieces in place and AI quietly stops being a toy you show people and starts being something you just rely on. That's the only version of it I've ever found worth building.