How to evaluate AI tools without falling for the demo

The demo is not the product, the keynote is not the product, the Twitter thread is not the product. The product is what your team is still using six weeks after onboarding, and most AI tools do not survive that long. Here’s how to filter before you commit.

1. Show me the integration, not the output

The output is always impressive. The model is genuinely smart. The question that matters is how this tool gets its inputs and where its outputs go.

Specifically, does it pull from where our content actually lives or do we have to copy-paste in. Does its output land somewhere our team already operates or in a separate dashboard. Does it remember context across sessions or do we start from scratch each time.

Tools that produce great output in a vacuum but require five steps of context-shuffling in real use are dead on arrival. Your team will use them for a week and stop.

2. What’s the depreciation curve

Some tools are wrappers. The underlying model is GPT-5 or Claude or Gemini, and the company is providing the UI, the prompt engineering and the workflow on top. When the next model ships those tools either keep up (good wrapper) or fall behind (bad wrapper).

The question to ask the vendor, “when [next major model] ships in six months, what’s your timeline to incorporate it?” Vendors with a real engineering team have a clear answer, and vendors that don’t will dodge.

Wrappers can be perfectly fine. The right wrapper is sometimes better than using the model directly, but you need to know what you’re buying.

3. Does it survive when one specific senior person leaves

A lot of AI workflows get set up by one early adopter on the team who has the patience to wire it together. Six months in, that person leaves, and the question is what’s left.

If the answer is “a documented workflow that the next person can pick up”, the tool has survived its first real test. If the answer is “a tangle of prompts, shortcuts and undocumented integrations only Sarah understood”, you’ll be replacing the whole stack.

Ask before you adopt, who owns the documentation, where does it live, and when was it last updated.

4. What does your power user do with it

Vendors should be able to introduce you to their best customer. Not a generic case study, an actual operator using it daily. If they can’t, the tool isn’t sticky enough yet.

When you talk to that customer the questions are. What was your workflow before this tool. What broke when you first deployed it. Three months in, what do you actually use it for, versus what you thought you would. What would you go back to if it disappeared tomorrow.

That last question is the killer. Tools that disappear and don’t get missed weren’t really being used.

5. The four-week rule

Don’t roll a tool out across the team on day one. Pick one person, ideally the most sceptical operator on the team rather than the most enthusiastic, and have them run it solo for four weeks. Real work, daily use. At the end of four weeks they write a one-page memo, keep, kill, or pilot wider.

Most tools die in week three. The shine wears off, the integration friction adds up, the marginal value of the output drops as the team gets used to it. The tools that survive a sceptical four-week pilot are the ones worth deploying.

This is slower than the standard “let’s all sign up and try it” rollout, and it’s also why most teams end up paying for fifteen overlapping subscriptions for tools nobody is using. The four-week pilot saves the budget and the attention.

The deeper point

The AI tooling market is in a fast-cycling, frothy state. Some of it is genuinely transformative, most of it is incremental at best and noise at worst. The default position should be scepticism, the test should be small, and the bar for keeping a tool in the stack should be high. (For the bigger pattern this falls inside, see the hidden cost of AI adoption nobody talks about — most adoption fails at the layer above the tool, not the tool itself.)

The teams that win this don’t have the biggest stack, they have the tightest stack. Three tools they use deeply with the workflows wired in, the documentation maintained and the operators expert at running them. That beats fifteen tools used shallowly every time. (When you’re ready to roll a stack across the team, the 90-day AI integration plan is the practical sequence we use.)