Multimodal AI is anticipated to lead the next phase of AI development, with projections indicating its mainstream adoption by 2024. Unlike traditional AI systems limited to processing a single data type, multimodal models aim to mimic human perception by integrating various sensory inputs.
This approach acknowledges the multifaceted nature of the world, where humans interact with diverse data types simultaneously. Ultimately, the goal is to enable generative AI models to seamlessly handle combinations of text, images, videos, audio, and more.
Two methods for retrieval are explored:
- Use a multimodal embedding model to embed both text and images.
- Use a multimodal LLM to summarize images, pass summaries and text data to a text embedding model such as OpenAI’s “text-embedding-3-small”
Sharing use cases for method 1 mentioned above:
Use Cases Implementation Utilizing Multimodal Embedding Model:
E-commerce Search: Users can search for products using both text descriptions and images, enabling more accurate and intuitive searches.
Content Management Systems: Facilitates searching and categorizing multimedia content such as articles, images, and videos within a database.
Visual Question Answering (VQA): Enables systems to understand and respond to questions involving images by embedding both textual questions and visual content into the same vector space.
Social Media Content Retrieval: Enhances the search experience by allowing users to find relevant posts based on text descriptions and associated images.
Multimodal Retrieval for RAG
Architecture of text and image summaries being embedded by text embedding model