Can someone smarter than me explain what this is about?
magicalhippo2 hours ago
Glossing through the paper, here's my take.
Someone previously found that that the cross-attention layers in text-to-image diffusion models captures correlation between the input text tokens and corresponding image regions, so that one can use this to segment the image, pixels containing "cat" for example. However this segmentation was rather coarse. The authors of this paper found that also using the self-attention layers leads to a much more detailed segmentation.
They then extend this to video by using the self-attention between two consecutive frames to determine how the segmentation changes from one frame to the next.
Now, text-to-image diffusion models require a text input to generate the image to begin with. From what I can gather they limit themselves to semi-supervised video segmentation, so that the first frame is already segmented by say a human or some other process.
They then run a "inversion" procedure which tries to generate text that causes the text-to-image diffusion model to segment the first frame as closely as possible to the provided segmentation.
With the text in hand, they can then run the earlier segmentation propagation steps to track the segmented object throughout the video.
The key here is that the text-to-image diffusion model is pretrained, and not fine-tuned for this task.
That said, I'm no expert.
jacquesm1 hour ago
For a 'not an expert' explanation you did a better job than the original paper.
Kalabint3 hours ago
> Can someone smarter than me explain what this is about?
I think you can find the answer under point 3:
> In this work, our primary goal is to show that pretrained text-to-image diffusion models can be repurposed as object trackers without task-specific finetuning.
Meaning that you can track Objects in Videos without using specialised ML Models for Video Object Tracking.
echelon2 hours ago
All of these emergent properties of image and video models leads me to believe that evolution of animal intelligence around motility and visually understanding the physical environment might be "easy" relative to other "hard steps".
The more complex that an eye gets, the more the brain evolves not just the physics and chemistry of optics, but also rich feature sets about predator/prey labels, tracking, movement, self-localization, distance, etc.
These might not be separate things. These things might just come "for free".
jacquesm1 hour ago
There is a massive amount of pre-processing already done in the retina itself and in the LGN:
So the brain does not necessarily receive 'raw' images to process to begin with, there is already a lot of high level data extracted at that point such as optical flow to detect moving objects.
DrierCycle20 minutes ago
And the occipital is developed around extraordinary levels of image separation, broken down into tiny areas of the input, scattered and woven for details of motion, gradient, contrast, etc.
Mkengin48 minutes ago
Interesting. So similar to the vision encoder + projector in VLMs?
fxtentacle1 hour ago
I wouldn't call these properties "emergent".
If you train a system to memorize A-B pairs and then you normally use it to find B when given A, then it's not surprising that finding A when given B also works, because you trained it in an almost symmetrical fashion on A-B pairs, which are, obviously, also B-A pairs.
Can someone smarter than me explain what this is about?
Glossing through the paper, here's my take.
Someone previously found that that the cross-attention layers in text-to-image diffusion models captures correlation between the input text tokens and corresponding image regions, so that one can use this to segment the image, pixels containing "cat" for example. However this segmentation was rather coarse. The authors of this paper found that also using the self-attention layers leads to a much more detailed segmentation.
They then extend this to video by using the self-attention between two consecutive frames to determine how the segmentation changes from one frame to the next.
Now, text-to-image diffusion models require a text input to generate the image to begin with. From what I can gather they limit themselves to semi-supervised video segmentation, so that the first frame is already segmented by say a human or some other process.
They then run a "inversion" procedure which tries to generate text that causes the text-to-image diffusion model to segment the first frame as closely as possible to the provided segmentation.
With the text in hand, they can then run the earlier segmentation propagation steps to track the segmented object throughout the video.
The key here is that the text-to-image diffusion model is pretrained, and not fine-tuned for this task.
That said, I'm no expert.
For a 'not an expert' explanation you did a better job than the original paper.
> Can someone smarter than me explain what this is about?
I think you can find the answer under point 3:
> In this work, our primary goal is to show that pretrained text-to-image diffusion models can be repurposed as object trackers without task-specific finetuning.
Meaning that you can track Objects in Videos without using specialised ML Models for Video Object Tracking.
All of these emergent properties of image and video models leads me to believe that evolution of animal intelligence around motility and visually understanding the physical environment might be "easy" relative to other "hard steps".
The more complex that an eye gets, the more the brain evolves not just the physics and chemistry of optics, but also rich feature sets about predator/prey labels, tracking, movement, self-localization, distance, etc.
These might not be separate things. These things might just come "for free".
There is a massive amount of pre-processing already done in the retina itself and in the LGN:
https://en.wikipedia.org/wiki/Lateral_geniculate_nucleus
So the brain does not necessarily receive 'raw' images to process to begin with, there is already a lot of high level data extracted at that point such as optical flow to detect moving objects.
And the occipital is developed around extraordinary levels of image separation, broken down into tiny areas of the input, scattered and woven for details of motion, gradient, contrast, etc.
Interesting. So similar to the vision encoder + projector in VLMs?
I wouldn't call these properties "emergent".
If you train a system to memorize A-B pairs and then you normally use it to find B when given A, then it's not surprising that finding A when given B also works, because you trained it in an almost symmetrical fashion on A-B pairs, which are, obviously, also B-A pairs.