Fix real-time task progress by closing WS on pubsub exit and keeping polling active

Three interconnected bugs prevented progress updates from reaching the frontend:

1. _forward_pubsub could exit silently while _handle_client_messages kept
   the WebSocket alive (responding to pings), so the client never detected
   the broken forwarding path. Replace asyncio.gather with asyncio.wait
   (FIRST_COMPLETED) so both coroutines are cancelled together.

2. Polling was stopped on WS connect with no fallback if forwarding broke.
   Now polling runs always alongside WebSocket as a safety net.

3. Redis publish failures in task_progress_publisher were logged at DEBUG
   and the broken client was reused forever. Log at WARNING and reset the
   client so the next call reconnects.
This commit is contained in:
Viktor Barzin 2026-02-09 22:48:57 +00:00
parent 8d52bdf99d
commit 791b5a9d55
No known key found for this signature in database
GPG key ID: 0EB088298288D958
3 changed files with 362 additions and 19 deletions

View file

@ -35,8 +35,9 @@ def publish_task_progress(task_id: str, state: str, meta: dict[str, Any]) -> Non
state: Celery state string (e.g. 'PROGRESS', 'SUCCESS').
meta: Metadata dict (progress, phase, logs, counters, etc.).
Failures are caught and logged at DEBUG level so they never break the
critical task execution path.
Failures are caught and logged at WARNING level so they never break the
critical task execution path. The Redis client is reset on failure so
subsequent calls can reconnect.
"""
try:
client = _get_redis_client()
@ -47,4 +48,6 @@ def publish_task_progress(task_id: str, state: str, meta: dict[str, Any]) -> Non
})
client.publish(f"task_progress:{task_id}", payload)
except Exception:
logger.debug("Failed to publish task progress for %s", task_id, exc_info=True)
logger.warning("Failed to publish task progress for %s", task_id, exc_info=True)
# Reset client so next call creates a fresh connection
_redis_client = None