Two different answers from gpt4o - one right, one wrong.. !?

But that’s not the issue here, surely?

The issue is with the reasoning.

Giving the LLM access to a calculator certainly helps (Chatbot has had that access for a long time) but does not make up for poor logic or reasoning: doing the wrong calculation “correctly” is arguably as bad as doing a wrong calculation. Indeed, the former can actually make the error more convincing so might be harder to detect?