ComfyStream Input / Output

I’ve been hearing more requests from the community to expand beyond the image-to-image constraint of ComfyStream. For example, if the input is a stream of text (conversational avatar use case) or a stream of audio (live audio-to-video generation).

I’m curious to hear feedback about this expanded scope of ComfyStream - a developer toolkit for real time media centric AI. It should ideally cover streaming workflows on either the input or output side. This can be generalized by the stream of data and the protocol for the stream. Some examples:

  • Conversational avatar: text / audio / video stream input, audio / video stream output.
  • Live video monitoring: video stream input, data (json, text, etc) stream output.
  • “Director’s stream” for sports: multiple video streams input, single video stream output.

In all of the above examples, it seems a core “stream management engine” is important. It would manage the muxing of media based on rulesets. For example, for the AI avatar example, loop a pre-recorded “idle video” of an avatar, and quickly mux in the “speaking video” when it becomes available. Or for the sports example, switching between 6 input video streams based on the real-time rules from an AI, and create a single output video stream.

ps - this can also address the requirement to intelligently skip frames when the inference engine cannot keep up with input FPS in the current image-to-image world.

2 Likes

My belief is that there is an extremely high likelihood that the various modalities of media-centric AI will converge, and that a unimodal tool will rapidly become limiting. In fact, ComfyStream has already become multimodal: the recent additions of text prompt and JSON submissions are a first step in this direction.

In terms of core terminology and strategic approach, I could imagine something like the following:

Feeds
Feeds are the core data unit in the ComfyStream system. They can be:

  • Audio
  • Video
  • Structured token data
  • Unstructured / abstract token data

They can be input or output. Input-to-output mapping is 1:m, m:1, or m:m.

When interacting with the hosted service, users could configure feeds independently from Pipelines, and use those feeds in context of 1 or more Pipelines.

Workflows
Workflows are the core primitive of ComfyUI. They interact with one or more feeds, and execute AI tasks against those feed(s) according to the graph provided.

4 Likes

I’ll keep this brief, as Hunter has already raised some excellent points. I fully agree with his perspective—restricting to video and audio feels too limiting for future growth. Over the past month, my team has been developing several workflows that rely on a JSON data stream, which this restriction would impede. We’re more than willing to help move this initiative forward.

1 Like

It is also important take the view of ‘data transformation’ , which defines how a continuous data stream (video, audio, text, others) will be dealt with.

for example, data is hardly ever being processed at macro level such as audio, video, or other formats. Close to the pipeline, everything is transformed to tensor, while in between. to work with community and learn their use cases, to identify a universal data structure that is semanically complete to capture the data and their processing inter-relationship will be the key.

for example, video is currently broken down to segments first for transport, and then frame (picture) before transformed to image tensor in the buffer to process, while audio needs a min length of segmentation, such as 1-2 s audio track. word/text needs a natural break for embedding. some video models need 1-5s video to work, such as video understanding model qwen. how to adapt those pipeline needs while keep the data integrity in pre-processing, processing, post processing, and transportation, not mentioning programability will be important architecturally, and product wise.

what will be the smallest unit that most pipeline needs to process, and eveything else can be composite of such atomic unit will be the important question to answer. This also defines the smallest unit of data interface, such as a datarum, and it also will help system being more reliable as it defines the atomic transcation size, either succeed, or failed at the atomic unit level.

3 Likes