CLIP · ViT-L/14 · Foundation Model
Multimodal Foundation Model Overview
CLIP ViT-L/14 trained conceptually on LAION-5B · cross-modal alignment, zero-shot transfer & retrieval at scale.
⌘K
GPU78%
Loss0.184
Throughput312/s
Foundation model · CLIP · 428M params
Cross-modal understanding, at billion-pair scale.
A unified representation space where pixels and language collapse into one geometry — enabling retrieval, zero-shot classification, captioning and rapid transfer to medical, fashion, agriculture and wildlife domains.
5.85B
image-text pairs
428M
ViT-L/14 params
76.2%
ImageNet zero-shot
Live training
streaming
0.184
contrastive loss · step 184k
Throughput
312/s
GPU util
78%
ETA
04:12:48
Top-1 Accuracy
76.2%
1.40% vs last epoch
F1 Score
0.831
0.60% vs last epoch
Embeddings Indexed
48.2M
3.10% vs last epoch
Inference Latency
38ms
2.40% vs last epoch
Capability profile
ViT-L/14 vs ViT-B/32
ViT-L/14ViT-B/32
Quick actions
Jump into a module
Image → Text
Caption + similarity
Text → Image
Semantic retrieval
Zero-Shot
Classify without retraining
Embeddings
t-SNE & PCA explorer
Active pipeline
Tokenize→
Encode→
Project→
Contrastive→
Index