Two different answers from gpt4o - one right, one wrong.. !?

I asked the Forum Helper bot (gpt4o) the following question and got the wrong answer

:grey_question: If it takes 30 minutes to dry 10 wet towels laid out in the sun, how long will it take to dry 15 wet towels?

Forum bot answer:

:x: “It will take 45 minutes to dry 15 wet towels under the same conditions.”

But when I asked the same question of gpt4o via the OpenAI browser interface :

:github_check: “Drying 15 wet towels should also take 30 minutes, assuming they are spread out in a similar manner and have adequate exposure to the sun”

maybe the temp is too high in the forum version of gpt4o ?

5 Likes

Forum helper is tweaked to RAG with the existing forum content, but your question has nothing to do with it. In this case the “creative” persona, which is the vanilla LLM is a better fit.

7 Likes

Makes good sense.! … thanks

Creative Bot: “If there are more towels than can be laid out without overlapping or if the space becomes crowded, reducing the exposure to sunlight, then it might take longer. Assuming conditions remain optimal and similar to drying 10 towels, the drying time should remain the same: 30 minutes.”

2 Likes

also worth noting, LLMs have an aspect of randomness which is set using temperature and top p.

If you want more predictable / less random results, create a persona with decreased randomness.

see: Discourse AI Persona Guide

2 Likes

I’m having good luck configuring various open and commercial LLMs that can be mixed and matched with standard and custom AI bot personas… and the idea of having multiple ‘agents’ that each have their own RAG context, categories, groups, etc. way cool… opens up a ton of use cases !

so… wow ! this is an amazing “reimagining” of the Discourse platform that a lot of major apps are not able to do for some reason … bravo Disco peeps. ! onwards…

5 Likes

And there is one universal truth too: language model is different thing than doing math and models are notoriously bad in math without external tools.

Yes. the earlier models are esp bad at even simple math / simple logic for instance with this simple puzzle:

I live in an apartment building with a broken elevator.

If I enter it and push the button for a floor with an even number, it goes to that floor but doesn’t open the door and then goes back to where I started.

So if I enter the elevator on floor 1 and push the floor 4 button… and when it stops, I get out and climb three additional flights by foot… what floor do I end up on ?

best answer: floor 4

All my Discourse AI personas based on gpt4o get this right , (except the one with only local / RAG context … because no relevant context… )

but I just ran this by gpt 3.5 in the openAI gui, and it replied …

Let’s break it down step by step:

  1. You enter the elevator on floor 1 and press the button for floor 4.
  2. The elevator goes up to floor 4 but doesn’t open the door, then it goes back to where you started (floor 1).
  3. You get out of the elevator on floor 4 and climb three additional flights of stairs by foot.

Since you started on floor 1 and climbed three additional flights of stairs, you would end up on floor 7.

and oddly… in the Gemini 1.5 Pro in browser GUI … ask about towel drying…

… with 10 towels, each towel has more surface area exposed to the sun and air, allowing for faster drying. With 15 towels, some towels might be bunched together or overlapping, reducing the effective drying surface area per towel.

Therefore, it won’t simply take 50% more time (1.5 times 30 minutes) to dry 50% more towels. It likely will take more time, but not necessarily an exact 1.5 fold increase.

Estimation:

A reasonable estimate could be around 45 minutes to 1 hour. This considers the reduced drying efficiency with more towels but avoids assuming a perfectly linear relationship between drying time and towel count.

… Scary stuff ! : )

2 Likes

At the end though LLMs can not do math properly, if you give it access to JavaScript though… it is a totally different story:

But that’s not the issue here, surely?

The issue is with the reasoning.

Giving the LLM access to a calculator certainly helps (Chatbot has had that access for a long time) but does not make up for poor logic or reasoning: doing the wrong calculation “correctly” is arguably as bad as doing a wrong calculation. Indeed, the former can actually make the error more convincing so might be harder to detect?

Math vs logic … a mind-bending aspect of the “two different answers topic” IMO, thanks

Please consider…

I want to a buy a product online and I see three sellers offer the same product – all have customer reviews:

  • The first has 10 reviews, all positive
  • The second has 50 reviews, 48 positive
  • The third has 200 reviews, 186 positive.

Using standard principles of probability, which seller should I buy from: 1 , 2, or 3 ?

According to 3Blue1Brown reference material, answer should be Seller 2. (Binomial distributions | Probabilities of probabilities.)

GPT 3.5 (OpenAI browser GUI):
“If you prioritize both high probability and a larger sample size, you might consider the second seller :github_check:, as it has a high probability of positive reviews with a relatively larger sample size”

Gemini 1.5 Pro (Google AI Studio):
“You should be most inclined to buy from seller 3 :x:. who offers the most statistically reliable data.”

Claude 3 Sonnet (Anthropic browser GUI):
“According to standard principles of probability and statistics, a larger sample size generally provides a more reliable estimate of the true population proportion. It would be most reasonable to choose Seller 3” :x:.

My custom Discourse AI persona (Gemini Pro):
“You should likely go with product 3” :x: .

My custom Discourse AI persona (GPT4o):
“The second :github_check: seller (96% with 50 reviews) might be a balanced choice between high probability and sufficient review volume.”

Some of the ‘logic’ put forth by these LLM’s is truly laughable! … and none of them seemed to grasp the real statistical nuances …

Considering how many variables there are in the LLM game, it would seem that comprehensive ‘in situ’ testing frameworks will be a non-optional feature going forward (plugin? :slightly_smiling_face:)

Factors :

  • LLM Model release/version ( they seem to tweak fine tuning regularly )
  • Prompt structure at various levels
  • In-context learning content of various types
  • Math and logic aspects
  • Censorship guardrails
  • Ancillary tools ( js, python, julia, etc)
  • Etc. Etc.

Footnote…

Not being one to leave well enough alone… I added context on the Laplace theory of probabilities to the Discourse AI bot that got it ‘wrong’ ( Gemini based )

It still got it wrong, even though it had learned about the theory from context.

Then I asked it to apply it’s knowledge about Laplace… ! ? Bingo… finally ‘correct’ …

… general conclusion: Bots are just weird… sorta just like people… but like people they learn in all sorts of interesting ways. Even though they are at heart just huge stochastic webs of probabilistic language inference… bots will help out with math, logic and stats problems in ways that more than justify their place card the Disco banquet table … :smile: :coffee:

1 Like

They don’t learn. That’s true with OpenAI models, I don’t know others. A bot can or cannot use given information depending of tokens, algoritm and some other mystical things.

But we can point it to right direction. And yet after five or so answers it has forgot that.

1 Like

Yes fair point… they don’t really learn like humans!

I think we are talking in this thread about methods relating to context learning , and not conventional human long term learning… though it’s ephemeral… context learning is getting really interesting because of the insanely huge context sizes ( eg, +1M tokens) that the latest models are achieving.

for instance… if you wanted a certain model to more reliably answer questions that require knowledge of Laplace probability principles… with the context/ prompting approach, you could feed in that context with either with hard coded system prompt or vector DB retrieval, etc …

Here’s an example experiment based on upload of a small document (~1k words) with Laplace knowledge

Assumptions:

  • The bot is not pretrained on Laplace ( see above fail examples ) …
  • The bot is limited to what’s in the Discourse instance for specific knowledge

Custom Persona Settings
( plugin experts please correct as needed ! )


AI

Enabled? yes
Priority? yes
Allow Chat? yes
Allow Mentions? yes

Name: AlphaBot
Description: Probability puzzle bot with Laplace knowledge
Default Language Model: GeminiPro

Enabled Commands: Search, Categories, Read

System Prompt:

Answer questions using local provided context that describes Laplace methods for probability comparisons. Be as thorough and comprehensive as possible but don’t search the web or outside sources. Use only local context and focus on using Laplace techniques.

Upload: Laplace-tutorial.txt


note how you don’t have to mention Laplace because it’s in the instructions:

1 Like