## TL;DR When running codex with `-c features.tui_app_server=true` we see corruption when streaming large amounts of data. This PR marks other event types as _critical_ by making them _must-deliver_. ## Problem When the TUI consumer falls behind the app-server event stream, the bounded `mpsc` channel fills up and the forwarding layer drops events via `try_send`. Previously only `TurnCompleted` was marked as must-deliver. Streamed assistant text (`AgentMessageDelta`) and the authoritative final item (`ItemCompleted`) were treated as droppable — the same as ephemeral command output deltas. Because the TUI renders markdown incrementally from these deltas, dropping any of them produces permanently corrupted or incomplete paragraphs that persist for the rest of the session. ## Mental model The app-server event stream has two tiers of importance: 1. **Lossless (transcript + terminal):** Events that form the authoritative record of what the assistant said or that signal turn lifecycle transitions. Losing any of these corrupts the visible output or leaves surfaces waiting forever. These are: `AgentMessageDelta`, `PlanDelta`, `ReasoningSummaryTextDelta`, `ReasoningTextDelta`, `ItemCompleted`, and `TurnCompleted`. 2. **Best-effort (everything else):** Ephemeral status events like `CommandExecutionOutputDelta` and progress notifications. Dropping these under load causes cosmetic gaps but no permanent corruption. The forwarding layer uses `try_send` for best-effort events (dropping on backpressure) and blocking `send().await` for lossless events (applying back-pressure to the producer until the consumer catches up). ## Non-goals - Eliminating backpressure entirely. The bounded queue is intentional; this change only widens the set of events that survive it. - Changing the event protocol or adding new notification types. - Addressing root causes of consumer slowness (e.g. TUI render cost). ## Tradeoffs Blocking on transcript events means a slow consumer can now stall the producer for the duration of those events. This is acceptable because: (a) the alternative is permanently broken output, which is worse; (b) the consumer already had to keep up with `TurnCompleted` blocking sends; and (c) transcript events arrive at model-output speed, not burst speed, so sustained saturation is unlikely in practice. ## Architecture Two parallel changes, one per transport: - **In-process path** (`lib.rs`): The inline forwarding logic was extracted into `forward_in_process_event`, a standalone async function that encapsulates the lag-marker / must-deliver / try-send decision tree. The worker loop now delegates to it. A new `server_notification_requires_delivery` function (shared `pub(crate)`) centralizes the notification classification. - **Remote path** (`remote.rs`): The local `event_requires_delivery` now delegates to the same shared `server_notification_requires_delivery`, keeping both transports in sync. ## Observability No new metrics or log lines. The existing `warn!` on event drops continues to fire for best-effort events. Lossless events that block will not produce a log line (they simply wait). ## Tests - `event_requires_delivery_marks_transcript_and_terminal_events`: unit test confirming the expanded classification covers `AgentMessageDelta`, `ItemCompleted`, `TurnCompleted`, and excludes `CommandExecutionOutputDelta` and `Lagged`. - `forward_in_process_event_preserves_transcript_notifications_under_backpressure`: integration-style test that fills a capacity-1 channel, verifies a best-effort event is dropped (skipped count increments), then sends lossless transcript events and confirms they all arrive in order with the correct lag marker preceding them. - `remote_backpressure_preserves_transcript_notifications`: end-to-end test over a real websocket that verifies the remote transport preserves transcript events under the same backpressure scenario. - `event_requires_delivery_marks_transcript_and_disconnect_events` (remote): unit test confirming the remote-side classification covers transcript events and `Disconnected`. --------- Co-authored-by: Eric Traut <etraut@openai.com>
codex-app-server-client
Shared in-process app-server client used by conversational CLI surfaces:
codex-execcodex-tui
Purpose
This crate centralizes startup and lifecycle management for an in-process
codex-app-server runtime, so CLI clients do not need to duplicate:
- app-server bootstrap and initialize handshake
- in-memory request/event transport wiring
- lifecycle orchestration around caller-provided startup identity
- graceful shutdown behavior
Startup identity
Callers pass both the app-server SessionSource and the initialize
client_info.name explicitly when starting the facade.
That keeps thread metadata (for example in thread/list and thread/read)
aligned with the originating runtime without baking TUI/exec-specific policy
into the shared client layer.
Transport model
The in-process path uses typed channels:
- client -> server:
ClientRequest/ClientNotification - server -> client:
InProcessServerEventServerRequestServerNotificationLegacyNotification
JSON serialization is still used at external transport boundaries (stdio/websocket), but the in-process hot path is typed.
Typed requests still receive app-server responses through the JSON-RPC result envelope internally. That is intentional: the in-process path is meant to preserve app-server semantics while removing the process boundary, not to introduce a second response contract.
Bootstrap behavior
The client facade starts an already-initialized in-process runtime, but thread bootstrap still follows normal app-server flow:
- caller sends
thread/startorthread/resume - app-server returns the immediate typed response
- richer session metadata may arrive later as a
SessionConfiguredlegacy event
Surfaces such as TUI and exec may therefore need a short bootstrap phase where they reconcile startup response data with later events.
Backpressure and shutdown
- Queues are bounded and use
DEFAULT_IN_PROCESS_CHANNEL_CAPACITYby default. - Full queues return explicit overload behavior instead of unbounded growth.
shutdown()performs a bounded graceful shutdown and then aborts if timeout is exceeded.
If the client falls behind on event consumption, the worker emits
InProcessServerEvent::Lagged and may reject pending server requests so
approval flows do not hang indefinitely behind a saturated queue.