ByteDance dropped a video DepthAnything model last week. The paper mentions temporal consistency issues with DepthAnything 2, and improves on it with “an efficient spatial-temporal head”. This raises the question - is the current temporal inconsistency partially introduced by inconsistency in the depth map?
A potential way to examine this is compare the results of the DepthAnythingV2-Tensorrt model with the video DepthAnything model, and see whether the video model’s more temporally consistent depthmap produces more temporally consistent diffusion output.
ps - the paper does mention that it currently struggles with streaming workflows, which is left for future work. However, if the performance is fast enough for segments of video, it can still be useful for one-to-many live video use cases, with 1-2 seconds of additional delay.