Have you seen LLaVA?
The Large Language and Vision Assistant is a multimodal (image & text) #ai model.
It's an open-source approach to visual & language prompting, combining a #ml vision encoder & a large language model (#Vicuna #LLaMA #llm).
It's surprisingly good!
🧵1/n