We will cover the Pixtral multimodal model released by Mistral AI last week. Hopefully we can get it loaded up in Le Chat (French ChatGPT?) and kick the tires on it a bit. From what I am hearing from some of our group, it’s definitely something to look into.
From the Mistral announcement:
- Natively multimodal, trained with interleaved image and text data
- Strong performance on multimodal tasks, excels in instruction following
- Maintains state-of-the-art performance on text-only benchmarks
- Architecture:
- New 400M parameter vision encoder trained from scratch
- 12B parameter multimodal decoder based on Mistral Nemo
- Supports variable image sizes and aspect ratios
- Supports multiple images in the long context window of 128k tokens
- Use:
|