CLIP · ViT-L/14 · Foundation Model

Multimodal Foundation Model Overview

CLIP ViT-L/14 trained conceptually on LAION-5B · cross-modal alignment, zero-shot transfer & retrieval at scale.

⌘K

GPU78%

Loss0.184

Throughput312/s

Foundation model · CLIP · 428M params

Cross-modal understanding, at billion-pair scale.

A unified representation space where pixels and language collapse into one geometry — enabling retrieval, zero-shot classification, captioning and rapid transfer to medical, fashion, agriculture and wildlife domains.

Launch multimodal search Inspect architecture

5.85B

image-text pairs

428M

ViT-L/14 params

76.2%

ImageNet zero-shot

Live training

streaming

0.184

contrastive loss · step 184k

Throughput

312/s

GPU util

78%

ETA

04:12:48

Top-1 Accuracy

76.2%

1.40% vs last epoch

F1 Score

0.831

0.60% vs last epoch

Embeddings Indexed

48.2M

3.10% vs last epoch

Inference Latency

38ms

2.40% vs last epoch

Capability profile

ViT-L/14 vs ViT-B/32

ViT-L/14ViT-B/32

Quick actions

Jump into a module

Classify without retraining

Embeddings

t-SNE & PCA explorer

Active pipeline

Tokenize→

Encode→

Project→

Contrastive→

Index