Join us virtually this Wednesday as we crack open the magic behind modern multimodal embeddings and why they’re not just “text embeddings with pictures pasted on.”
Part I – Foundations
-
Modality alignment 101 – How contrastive pre-training pulls text, images, and video frames into one joint space.
-
Vector anatomy – Why pixel patches, temporal frame sequences, and word tokens all reduce to the same dot-product math.
-
Evaluation metrics – What “recall,” “precision,” and “rankability” really mean once your queries aren’t just strings.
Part II – Deep Dive
We’ll explore two recent papers as our testbed for these concepts:
- JinaEmbed V4 – A single-path backbone unifying images, video, text, and code.
- On the Rankability of Visual Embeddings – Recovering numeric “more vs. less” axes with only two labeled examples.
We’ll finish by surveying how to turn these insights into a two-stage reranking pipeline from dense k-NN recall to lightweight LoRA heads or simple linear probes for precision uplift.
|