AI Summarization Backfill is stuck, keeps regenerating the same topic

I’m trying to backfill the last 90 days. It successfully backfilled about 500 out of 2000 topics, but now progress is stalled because it keeps processing the same topic every time the job runs (every 5 minutes). No idea why- the topic already has a valid summary, and has no new posts in the past 12 days. AI Audit Logs table shows each request is successful. Sidekiq status is OK. Nothing relevant in /logs. How to debug this?

SELECT request_tokens,
       response_tokens,
       raw_response_payload,
       topic_id,
       created_at AS summary_created_at,
       language_model
FROM ai_api_audit_logs
WHERE raw_request_payload LIKE '%Franklin Lexington%'
ORDER BY created_at DESC
LIMIT 40

All of the raw response payloads look valid. Each is slightly different as expected. Here’s an example (I’ve redacted most of the content text):


{
  "id": "msg_016C2dHZik2Miwe16pRHFe9z",
  "type": "message",
  "role": "assistant",
  "model": "claude-3-5-sonnet-20241022",
  "content": [
    {
      "type": "text",
      "text": "The Franklin Lexington Private Markets Fund (FLEX) is a new private equity secondaries fund... [REDACTED] "
    }
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 13848,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 0,
    "output_tokens": 416
  }
}

I don’t know if this is a clue, but does the 103 mean it thinks it has only summarized through post 103, and there are 108 posts, so it needs a new summary?

Query generated by SQL Helper:

-- [params]
-- integer :summary_type = 0
-- integer :max_age_days = 30
-- integer :min_word_count = 100

WITH topic_candidates AS (
  SELECT 
    t.id as topic_id,
    t.title,
    t.created_at,
    t.word_count,
    t.highest_post_number,
    t.last_posted_at,
    ais.id as summary_id,
    ais.content_range,
    ais.created_at as summary_created_at,
    UPPER(ais.content_range) as content_range_upper
  FROM topics t
  LEFT OUTER JOIN ai_summaries ais ON 
    t.id = ais.target_id AND
    ais.target_type = 'Topic' AND
    ais.summary_type = :summary_type
  WHERE 
    -- Word count threshold
    t.word_count >= :min_word_count
    -- Age restriction
    AND t.updated_at > CURRENT_TIMESTAMP - (:max_age_days || ' DAYS')::INTERVAL
    -- Either no summary exists OR summary needs updating
    AND (
      ais.id IS NULL OR 
      (
        UPPER(ais.content_range) < t.highest_post_number + 1
        AND ais.created_at < (CURRENT_TIMESTAMP - INTERVAL '5 minutes')
      )
    )
)
SELECT 
  topic_id,
  title,
  word_count,
  highest_post_number,
  created_at,
  last_posted_at,
  summary_id,
  content_range,
  summary_created_at,
  content_range_upper
FROM topic_candidates
ORDER BY summary_created_at DESC NULLS FIRST, last_posted_at DESC
1 Like

There are 5 hidden or deleted posts in this topic. 108-103=5. Is it possible it’s not handling hidden and deleted posts correctly? Yes 108 is the highest post, but only 103 are ever passed in, so it continually thinks there are 5 new posts to summarize.

@Falco @sam

1 Like

I have another topic that also keeps regenerating the summary. That topic has no hidden or deleted posts, but is pinned, which creates a post “this topic was pinned globally…”. So max post is 8, but that last post isn’t really passed in?

Yeah I think you are on to something here @Roman_Rizzi

best_replies is not guaranteed to include the last public post on a topic. We should add the last post unconditionally even when picking best_replies.

1 Like

Thanks for looking into it. And just to clarify the impact, it’s not just the inefficiency of regenerating the same summary repeatedly- it completely stops the backfill. Say you have backfill set to 24 topics per hour. Every 5 minutes it tries 2 topics. Eventually both topics have an issue- deleted topics or a pinned topic or something else- and it keeps retrying the same 2 topics every 5 minutes. It doesn’t even try any other topics.

2 Likes

Does it need .where("NOT deleted") also? I don’t understand the difference between hidden and deleted, but I have both hidden and deleted in my problem topic.

Hey Mark,

I’m currently looking into this and trying to figure out what’s going on. What @sam pointed out is true—best_replies not including the last post could cause the job to stall. I’m about to wrap up a fix for that and introduce a fail-safe to turn the job into a no-op and log an error if there’s a bug in the query. This is not ideal and it shouldn’t happen, but it’s better than regenerating the same summary over and over again.

On the other hand, I’m looking at the Franklin Lexington... example you shared, which I suspect is a different issue because:

  1. It already has a summary (149).
  2. UPPER(ais.content_range) < t.highest_post_number + 1 should be FALSE. UPPER should return 109, which equals highest_post_number + 1.

The end of the content_range interval would need to be lower for the job to stall due to that. We don’t really care about hidden or deleted posts when deciding if a summary is outdated; we only want to avoid summarizing those.
Do you have ai_summary_gists_enabled enabled? If so, can you confirm that you have two summaries for that topic, one with type 0 and one with type 1? Also, in the ai_api_audit_logs query you shared, what is the value in the feature_name column?

I’ll keep you updated with what I find.

Thanks for looking into this. Happy to provide error logs when your fix is ready.

I don’t see how to enable it, so I don’t think it’s enabled.

“summarize”

This will unblock the queue and make the job more resilient to these kind of problems:

Could please let me know how things look after updating your site?

3 Likes

It’s no longer regenerating summaries for topics that already had a summary, so that’s good.

But it’s not summarizing many topics, even when there are new posts, because of:

So backfill thinks it’s done, but maybe 75% of the Latest topics do not have a summary, even though they have new posts, just because they happened to be created more than 90 days ago. I’m sure that’s not the intent. Please change it or help me understand why.