Our goal is to enable generative AI models to seamlessly handle combinations of text, images, videos, audio, and more.

Let’s explore our second method in this segment:

  • Use a multimodal LLM to summarize images, pass summaries and text data to a text embedding model such as OpenAI’s “text-embedding-3-small”

To simplify multimodal retrieval and RAG, consider converting all data to text. This involves using a text embedding model to unify data in one vector space. Summarizing non-text data incurs extra cost, either through manual efforts or with LLMs.

Sharing Vriba Implemented use cases below for “text-embedding-3-small” Multimodal Embedding Model-

  • E-commerce Search
  • Content Management Systems
  • Visual Question Answering (VQA)
  • Social Media Content Retrieval

Vriba Review for the “text-embedding-3-small”” Multimodal Embedding Model:

OpenAI’s text-gen-3-small model has surpassed text-embedding-ada-002 in performance. Transitioning to 3- small should enhance semantic search. For large content volumes, text-embedding-3-small, at 5x lower cost, provides substantial savings.

To explore further insights on AI Multimodal Embedding modals & AI latest trends please visit & subscribe to our website: Vriba Blog

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top