Pix2Struct is an image encoder - text decoder model that is trained on image-text pairs for various tasks, including image captioning and visual question answering.
This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description.