I merged a fix here that included updating the prompt and moving examples away from the system prompt and into proper interaction.
https://github.com/discourse/discourse/pull/35763
Our team is also currently working on evals to improve reliability across various LLMs.