Abstract
Identifying trading cards from photographs is a significant challenge: the Pokémon TCG alone spans over 17,000 unique cards across 121 sets and nearly three decades of print history. Existing commercial solutions require transmitting user images to remote servers, raising privacy concerns and introducing connectivity dependencies. This paper presents PokVault Rev (reverse image search), a fully client-side visual search system that performs end-to-end card identification entirely within the web browser. The system combines classical computer vision for card localization and perspective correction with CLIP ViT-B/16 vision embeddings and a brute-force cosine similarity search over a pre-computed database of 17,531 card embeddings stored in IndexedDB. An adaptive uint8 quantization scheme compresses the entire database to 9.2 MB while preserving ranking fidelity. No images leave the user's device. We detail the system architecture, multi-strategy corner detection, quantization-aware search, and offline build pipeline, positioning PokVault Rev against commercial and academic alternatives.
Keywords: visual search · on-device inference · CLIP · trading card recognition · browser-based machine learning · WebAssembly · privacy-preserving AI
1. Introduction
The Pokémon Trading Card Game (TCG), first released in 1996, has grown into a global collectibles market. With thousands of unique cards spanning over 120 expansion sets, identifying a specific card from a photograph is a non-trivial visual recognition task. Cards within the same set often share nearly identical borders, layouts, and color palettes, differing only in the character illustration, card number, or subtle variant indicators such as holofoil treatments. This fine-grained visual similarity makes card identification challenging for both human collectors and automated systems.
Several commercial and academic approaches have been proposed for trading card recognition. TCGPlayer's Roca Vision system[1] achieves approximately 98% identification accuracy across supported product lines but operates as a proprietary server-side pipeline, requiring users to upload card images to remote infrastructure. Ximilar's Visual AI for Collectibles[2] provides a REST API capable of identifying cards from over 15 trading card games, but similarly requires cloud-based inference at approximately one second per query. The Collectors.com engineering team[3] explored classical feature descriptors (SIFT, SURF, ORB) for card identification in their grading pipeline, but reported that search quality degraded as the database grew, and selecting meaningful keypoints proved an unexpected challenge at scale.
Academic work has addressed related problems. Pua[4] demonstrated real-time Pokémon card detection from tournament footage using object detection models. Nahar et al.[5] proposed DeepCornerNet for automated corner grading of trading cards using deep learning-based defect identification. Several open-source projects[6,7,8] have employed perceptual hashing (pHash, dHash, wHash) with Hamming distance for card matching, but these approaches are brittle under perspective distortion, varying lighting conditions, and the presence of foil or holographic treatments that fundamentally alter the card's visual appearance at capture time.
This paper presents PokVault Rev (reverse image search), a system that addresses three limitations of the existing landscape simultaneously:
- Privacy: All computation occurs on the user's device. No card images are transmitted to any server. The only network traffic is the one-time download of pre-computed embedding shards.
- Training-free retrieval: By leveraging CLIP[9], a vision-language model trained on 400 million image-text pairs, the system achieves robust visual similarity matching without any task-specific fine-tuning. While the reference database is built from known card images, the query-side model requires no domain adaptation.
- Minimal storage footprint: An adaptive uint8 quantization scheme compresses the full 17,531-card embedding database to 9.2 MB, suitable for IndexedDB storage in the browser.
In addition to addressing these limitations, the paper makes the following contributions:
- A complete, production-deployed, end-to-end visual search pipeline running entirely in the browser via WebAssembly, from image capture through card identification.
- A multi-strategy classical computer vision algorithm for card corner detection and perspective correction, encompassing contour-based oriented bounding boxes, Hough line detection, and saturation-based segmentation.
- An adaptive quantization scheme for CLIP embeddings that preserves cosine similarity ranking while achieving 7.8x compression from float64 to uint8 (516 bytes per card including integrity hash).
- A comparative analysis positioning this approach against commercial and academic alternatives.
The remainder of this paper is organized as follows. Section 2 surveys related work in card recognition, visual search, and on-device inference. Section 3 describes the system architecture and each pipeline stage in detail. Section 4 covers implementation specifics. Section 5 presents system characteristics and comparative positioning. Section 6 discusses limitations and future work. Section 7 concludes.
2. Related Work
2.1 Commercial Card Recognition
TCGPlayer Roca Vision. TCGPlayer acquired Roca Robotics in 2021 to power both physical card sorting hardware and their mobile Scan & Identify feature[1]. Roca Vision reports ~99.9% confidence at the “Good” tier and ~98.94% at “Fair,” covering approximately 98% of scans. The system is notable for its transparency about confidence levels, distinguishing between Good, Fair, Poor, and Manual (user-corrected) matches. However, the technology operates entirely server-side, and the computer vision pipeline's architecture has not been publicly documented beyond marketing materials. A known limitation is that non-English cards can only be identified if an English version with identical artwork exists, suggesting the system relies heavily on visual matching against a reference database rather than OCR-based identification.
Ximilar Visual AI. Ximilar[2] offers a commercial REST API for trading card identification across Pokémon, Magic: The Gathering, Yu-Gi-Oh!, and other TCGs. The system combines object detection for card localization with recognition for attribute extraction (name, set, number, rarity, language, foil status). Results include marketplace pricing links. The API processes requests in approximately one second, but no quantified accuracy metrics have been publicly disclosed. The architecture relies on cloud-based inference, and each API call requires transmitting the card image.
Collectors.com. The Collectors.com engineering team[3] documented their exploration of classical feature descriptors for automating card identification in their grading pipeline. They evaluated Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Oriented FAST and Rotated BRIEF (ORB). While ORB showed initial promise due to its rotation invariance and computational efficiency, the team reported that search quality degraded as the reference database grew, highlighting a fundamental scalability limitation of local feature descriptor approaches for large card catalogs.
2.2 Academic Approaches
Object Detection. Pua[4] applied fine-tuned object detection models to identify Pokémon cards in tournament footage, framing the problem as a detection task where each unique card is a class. This approach requires retraining when new cards are released and faces class imbalance challenges given the long-tailed distribution of card appearances. YOLOv8 and YOLOv9 architectures[10,11] have been applied to playing card detection more broadly, with YOLOv9 introducing Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN) for improved feature learning.
Automated Grading. Nahar et al.[5] proposed DeepCornerNet for automated corner condition assessment in trading cards, addressing one of the four primary grading criteria (corners, edges, centering, surface). A related system by crimsonthinker[6] used VGG16 with U-Net preprocessing for card content extraction, achieving approximately 0.5 grade-point average deviation from professional PSA grades. These works address card quality assessment rather than identification but demonstrate the maturity of deep learning applications in the trading card domain.
Perceptual Hashing. Multiple open-source projects[7,8] have employed perceptual hashing for card matching. The Pokemon-TCGP-Card-Scanner[7] uses a 24-bit perceptual hash with precomputed hashes stored in IndexedDB. The tcg-scanner project[8] uses difference hashing with Hamming distance. While computationally efficient, perceptual hashes are designed for near-duplicate detection and are brittle under the geometric and photometric transformations typical of real-world card photography.
2.3 CLIP and Zero-Shot Visual Search
Radford et al.[9] introduced CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million image-text pairs that learns a joint embedding space for images and text. CLIP's vision encoder produces embeddings where visual similarity is captured by cosine similarity in the latent space, enabling zero-shot transfer to downstream tasks without fine-tuning. The ViT-B/16 variant uses a Vision Transformer with 16x16 patch size, producing 512-dimensional embeddings.
While CLIP excels at broad semantic similarity, it has known limitations in fine-grained classification[12]. Our system leverages CLIP's strength — robust visual similarity across diverse imaging conditions — rather than requiring it to distinguish between visually near-identical variants, a task better suited to metadata refinement in post-processing.
Recent work has explored binary quantization of CLIP embeddings for retrieval efficiency[13], demonstrating that test-time binary quantization impairs performance significantly, but training-aware binarization can retain 87–93% of float32 retrieval quality. Our approach takes a middle path with uint8 quantization, retaining greater precision than binary while achieving substantial compression.
2.4 Browser-Based Machine Learning
The emergence of WebAssembly (Wasm) and frameworks such as ONNX Runtime Web[14] and Hugging Face Transformers.js[15] has enabled neural network inference directly in the browser. Transformers.js converts Hugging Face models to ONNX format and executes them through ONNX Runtime Web's Wasm backend. Benchmarks indicate that the Wasm SIMD path achieves 8–12ms inference for small embedding models on Apple M2 hardware, while WebGPU can provide 7–64× speedups for larger models[16].
The MobiSys 2024 paper by Jin et al.[17] presented just-in-time kernel optimizations for in-browser deep learning inference on edge devices, demonstrating that browser-based inference is approaching native performance for vision tasks. This technological maturation makes fully client-side visual search systems practical for the first time.
2.5 Summary and Gap
Existing commercial systems (Roca Vision, Ximilar) achieve high accuracy but require server-side infrastructure, sacrificing user privacy and mandating network connectivity. Academic and open-source approaches based on perceptual hashing run on-device but lack robustness under real-world imaging conditions. Detection-based methods (YOLOv8[10], YOLOv9[11]) require retraining for each new card release. No existing system combines neural network-based visual similarity, on-device privacy, and training-free extensibility to new card sets. PokVault Rev addresses this gap.
3. System Architecture
PokVault Rev operates as a multi-stage pipeline with two distinct phases: an offline embedding build phase executed before deployment, and a runtime inference phase executed entirely in the user's browser. Figure 1 illustrates the complete pipeline architecture.
3.1 Offline Embedding Computation
The offline build pipeline executes four scripts sequentially. First, build-catalog reads all 121 set definition files and produces a unified catalog of 17,531 cards with associated metadata. Second, download-images fetches thumbnail-resolution card images from the Pokémon TCG API. Third, compute-embeddings processes each image through the same quantized CLIP ViT-B/16 model used in the browser:
- Each card image is loaded as a
RawImageand processed through the CLIP vision processor. - The
CLIPVisionModelWithProjectionmodel outputs 512-dimensionalimage_embeds. - Embeddings are L2-normalized to unit vectors using Float64 precision.
Quantization proceeds in two phases. Phase 1 computes all embeddings and determines the global maximum absolute value across all dimensions and all cards. Phase 2 quantizes each float value to uint8 using an adaptive zero-point scheme:
quantScale = floor(127 / globalMaxAbs) byte = clamp(0, 255, round(value × quantScale + 128))
The zero-point is 128, mapping the range [-globalMaxAbs, +globalMaxAbs] to [1, 255]. For the current database, globalMaxAbs yields a quantScale of 174, providing approximately 146 effective quantization levels per dimension. Each card's embedding is stored as a 516-byte record: a 4-byte hash of the card ID (using a Java String.hashCode()-style function with multiplier 31, for integrity verification) followed by 512 uint8 values.
Records are grouped into per-set binary shards prefixed with a uint32 card count. Alongside each shard, a JSON map file stores the ordered array of card IDs. A manifest file records per-set metadata (count, file size, checksum) and global parameters (model version, dimensions, quantization scheme, scale factor).
3.2 Card Corner Detection
The corner detection subsystem (cardScanCrop.js, ~1,800 lines) implements a cascade of four strategies executed in priority order, with the first successful detection returned.
Preprocessing. The input image is first checked for aspect-ratio match to a standard Pokémon card (width/height ratio of 0.714 ± 0.025). If matched and the image is below 2000px, the card is assumed to fill the frame, and a 2% inset rectangle is returned immediately. Otherwise, the image is downscaled to a maximum dimension of 400px and processed through: (a) grayscale conversion using ITU-R BT.601 luminance weights (0.299R + 0.587G + 0.114B), (b) 5×5 Gaussian blur with σ ≈ 1.0, and (c) Sobel edge magnitude computation with 3×3 kernels (Figure 2).
Strategy 1: Contour-Based OBB. A multi-threshold Canny edge detector evaluates four pairs of hysteresis thresholds at different percentiles of the edge magnitude distribution ([70/88], [60/82], [80/93], [50/75]). For each threshold pair, the algorithm performs morphological closing with a diamond-shaped 5×5 structuring element, then flood-fills from image borders to identify the exterior region. The largest interior connected component meeting card-shape criteria (scored by size × rectangularity² × aspect closeness to 0.714) is selected. Its boundary pixels are extracted, the convex hull is computed using Andrew's monotone chain algorithm, and the minimum-area oriented bounding box (OBB) is determined via rotating calipers.
Strategy 2: Hough Line Detection. If contour analysis fails, a Hough transform is applied to Canny edges with 180 angular bins and rho resolution spanning the image diagonal. Peak extraction with 5-pixel non-maximum suppression identifies dominant lines, which are classified as horizontal or vertical (within ±25°). Candidate rectangles are enumerated from all pairs of horizontal and vertical lines and scored by a weighted combination: aspect proximity to 0.714 (15%), enclosed area (35%), center proximity (20%), and edge pixel coverage along the perimeter (30%).
Strategy 3: Saturation-Based Detection. As a fallback for complex backgrounds where edge-based methods fail, the HSV saturation channel is extracted and binarized using Otsu's automatic thresholding. After Gaussian blur and morphological closing with a circular kernel (radius 5), the largest connected component is found via BFS. The convex hull and OBB are computed as in Strategy 1.
Strategy 4: Fallback. If all strategies fail, a 15% inset rectangle is returned as a safe default for the interactive crop editor.
Corner Refinement. After initial detection, corners are refined by sampling 20 points along each quad edge, casting perpendicular rays to find maximum gradient magnitudes, fitting a linear regression to determine the true edge line, and computing refined corners as adjacent edge-line intersections. Displacement is capped at 15% of the image diagonal.
3.3 Perspective Correction
The user-defined (or auto-detected) quadrilateral is mapped to a 300×418 pixel rectangle corresponding to a standard Pokémon card at approximately 120 DPI. A 3×3 homography matrix H is computed by solving an 8×8 linear system via Gaussian elimination with partial pivoting, mapping destination rectangle corners to source quadrilateral corners. For each output pixel (x', y'), the inverse transform yields source coordinates:
[wx, wy, w]^T = H^(-1) * [x', y', 1]^T source = (wx/w, wy/w)
Bilinear interpolation is applied at the source position to produce the final pixel value. The output is encoded as a JPEG blob at quality 0.9 using OffscreenCanvas.convertToBlob().
3.4 Feature Extraction
The perspective-corrected card image is processed through a quantized CLIP ViT-B/16 model loaded via the Hugging Face Transformers.js library[15]. The model files are fetched from the Hugging Face CDN on first use and cached in the browser's Cache API. The image is converted to a RawImage, processed through the CLIP vision processor (which handles resizing and normalization), and the 512-dimensional image_embeds output is L2-normalized to produce the query embedding.
All inference executes on the main thread via the ONNX Runtime Web Wasm backend with SIMD extensions. No Web Workers are employed.
3.5 Cosine Similarity Search
The query embedding (Float32Array) is searched against all downloaded shards stored in IndexedDB. For each card in each shard, the stored uint8 embedding is dequantized by subtracting 128, and cosine similarity is computed:
similarity = dot(query, stored - 128) / (||query|| * ||stored - 128||)
Critically, the quantization scale factor cancels in the cosine similarity computation. Because stored = original × quantScale + 128, the dequantized value (stored - 128) = original × quantScale. Since cosine similarity is scale-invariant (cos(a, kb) = cos(a, b) for positive scalar k), the float query can be compared directly against dequantized uint8 values without knowledge of the quantization scale. This property enables the search engine to operate without the quantScale parameter at query time.
Results are sorted by similarity score in descending order and the top-K matches (default K=10) are returned with card IDs and set associations.
4. Implementation
PokVault Rev is implemented as a React single-page application deployed on Cloudflare Pages. The system is accessed through three user-facing entry points:
- Reverse Image Search (
/rev-search): Desktop-oriented page supporting file upload, drag-and-drop, and camera capture. Includes the full corner detection and interactive crop editor workflow. - Real-Time Scanner (
/scanner): Mobile-oriented page with continuous camera feed. Bypasses the crop editor for speed, feeding captured frames directly through the embedding and search pipeline. A guide rectangle overlay (55% of viewport width, aspect ratio 2.5:3.5) helps users frame their card. - Binder Scan Tab: Integrated into the collection management modal, providing a simplified single-card scan path (file input to embedding to search, without corner detection).
Camera capture uses navigator.mediaDevices.getUserMedia() with {facingMode: 'environment', width: {ideal: 1920}, height: {ideal: 1080}}. Frames are captured to an OffscreenCanvas at full video resolution and encoded as JPEG at quality 0.92.
The interactive crop editor renders a draggable quadrilateral overlay with four corner handles (14px radius, 30px hit area). A zoom loupe (240×240px, 1.25x magnification) appears at the active corner during drag operations, showing the source image magnified with crosshair and quadrilateral outline.
Embedding data is managed through an IndexedDB database (pokemon-card-scan) with three object stores: meta (manifest and download status), shards (binary shard data), and maps (card ID arrays). A download manager supports pause, resume, and cancel via AbortController, with progress reporting for the initial 9.2 MB download. Staleness detection compares the stored manifest's model identifier against the current EMBEDDING_MODEL_ID constant, triggering automatic re-download when the embedding model changes.
Dependencies. The system uses @huggingface/transformers (v3.8+) for CLIP inference, which internally bundles onnxruntime-web (v1.22+) as its Wasm inference backend.
5. System Characteristics
5.1 Database Coverage
The current embedding database covers 17,531 unique Pokémon TCG cards across 121 sets, spanning from the original Base Set (1999) through Scarlet & Violet era sets (2024–2025). Set sizes range from 18 cards (Detective Pikachu) to 284 cards (Sword & Shield — Brilliant Stars). The database represents English-language printings.
5.2 Storage Efficiency
| Representation | Per-Card | Total | Ratio |
|---|---|---|---|
| Float64 (raw) | 4,096 B | 71.8 MB | 1.0× |
| Float32 | 2,048 B | 35.9 MB | 2.0× |
| Uint8 (ours) | 516 B | 9.2 MB | 7.8× |
| Binary (1-bit) | 64 B | 1.1 MB | 65.2× |
The uint8 scheme achieves 7.8x compression relative to float64 while preserving cosine similarity ordering. Binary quantization[13] would reduce storage further but at the cost of significant retrieval quality degradation when applied post-training.
5.3 Quantization Fidelity
The adaptive quantization scheme assigns a global scale of 174 (computed as floor(127 / globalMaxAbs)), yielding approximately 146 effective quantization levels for the observed embedding value range. The verify-embeddings build script validates self-match accuracy by computing fresh embeddings for a sample of cards and searching against the quantized database, confirming rank-1 self-matching.
5.4 Comparative Positioning
| System | Inference Location | Privacy | Requires Internet | Storage | Open Source |
|---|---|---|---|---|---|
| TCGPlayer Roca[1] | Server | No | Yes | N/A | No |
| Ximilar API[2] | Server | No | Yes | N/A | No |
| Collectors.com[3] | Server | No | Yes | N/A | No |
| pHash projects[7,8] | Client | Yes | No | ~1 MB | Yes |
| PokVault Rev | Client | Yes | No* | 9.2 MB | Yes |
PokVault Rev is the only system that combines client-side inference, privacy preservation, and neural network-based visual similarity. Perceptual hashing approaches achieve smaller storage but lack robustness under perspective distortion and photometric variation. Server-based systems offer the highest accuracy but sacrifice user privacy and require persistent connectivity.
6. Discussion
6.1 Advantages of Training-Free Retrieval
A key design decision is the use of a pre-trained CLIP model without any fine-tuning on Pokémon card imagery. It is important to clarify the scope of this claim: CLIP itself is used as-is (no domain adaptation), but the reference database is built from known card images processed through the same model. The system is therefore “training-free” at both build and query time — no gradient updates occur — but it is not “zero-shot” in the strict sense, since the reference corpus is curated.
This training-free approach provides two practical advantages. First, adding new card sets requires only re-running the offline embedding script, with no model retraining. Second, the same model generalizes across all eras of card design, from the simple illustrations of the 1999 Base Set to the full-art and alternate-art treatments of modern Scarlet & Violet cards.
The limitation is that CLIP's embeddings may not capture fine-grained differences between card variants (e.g., regular vs. reverse-holo vs. full-art of the same card). A hybrid system combining CLIP embeddings for initial retrieval with a task-specific refinement model for variant discrimination represents a promising direction.
6.2 Classical CV vs. Learned Detection
The card corner detection subsystem deliberately uses classical computer vision rather than a learned detector (e.g., the YOLOv9 card detector also present in the codebase but not deployed). This choice reflects a practical trade-off: the classical pipeline requires no additional model download (saving 12 MB) and avoids Wasm inference overhead for a task where rule-based approaches perform adequately. The multi-strategy cascade (contour OBB, Hough lines, saturation segmentation) covers a broad range of imaging conditions, with each strategy compensating for the others' weaknesses.
6.3 Limitations
Main-thread inference. All neural network inference executes on the browser's main thread, blocking UI interaction during embedding extraction. Migrating to a Web Worker would improve responsiveness but adds implementation complexity around transferable objects.
Brute-force search. The current linear scan over all 17,531 embeddings is acceptable for the current database size but will not scale to significantly larger catalogs. Approximate nearest-neighbor indices (e.g., IVF, HNSW) adapted for browser environments would be necessary for databases exceeding approximately 100,000 cards.
English-only coverage. The current database covers only English-language printings. Japanese and other language printings have different visual elements that would require separate embedding computation.
No variant discrimination. CLIP embeddings capture overall visual similarity but may rank different print variants (regular, reverse-holo, full-art) of the same card highly, requiring the user to select the correct variant from top results.
6.4 Future Work
Several extensions are planned or partially implemented. A multi-card detection system using YOLOv9[11] with tiled inference and grid fitting has been implemented at the service layer and would enable binder page scanning. WebGPU acceleration could provide 7–64× inference speedups[16] over the current Wasm backend. Approximate nearest-neighbor search using quantized HNSW indices would enable scaling to larger multi-game databases.
7. Conclusion
This paper presented PokVault Rev, a fully client-side visual search system for Pokémon Trading Card identification. The system demonstrates that modern browser technologies — WebAssembly-based neural network inference, IndexedDB for structured binary storage, and OffscreenCanvas for image processing — have matured sufficiently to support complete visual search pipelines without server-side infrastructure. The combination of CLIP embeddings with adaptive uint8 quantization achieves a practical trade-off between recognition quality and storage efficiency, compressing a 17,531-card database to 9.2 MB while preserving cosine similarity ranking. The multi-strategy classical CV pipeline for card localization provides robust corner detection across diverse imaging conditions without requiring additional model downloads.
By keeping all computation on-device, PokVault Rev eliminates privacy concerns inherent in server-based alternatives, enables offline operation after initial setup, and removes per-query latency from network round trips. The system is deployed in production at pokvault.com, serving the Pokémon TCG collector community.
More broadly, the architecture presented here — pre-computed quantized embeddings stored client-side, combined with a foundation model used as a feature extractor — generalizes beyond trading cards to any domain where a bounded visual catalog must be searched privately. Sports memorabilia, stamp collections, coin identification, and retail product lookup are all candidate applications. The source code, embedding build scripts, and pre-trained model are publicly available, enabling reproducibility and adaptation to other collectible domains.
References
- [1] TCGPlayer, “How Scan & Identify Technology Works,” TCGPlayer Help Center, 2024. help.tcgplayer.com
- [2] Ximilar, “Visual AI for Collectibles,” 2024. ximilar.com
- [3] I. Kalkanci, “Automating Card Identification Using Computer Vision,” Collectors Tech Blog, May 2022. blog.collectors.com
- [4] E. Pua, “Real-Time Pokémon Card Detection from Tournament Footage,” Stanford CS231n, 2024. cs231n.stanford.edu
- [5] L. Nahar, M. S. Islam, M. Awrangjeb, G. Tuxworth, “Automated corner grading of trading cards: Defect identification and confidence calibration through deep learning,” Computers in Industry, vol. 163, 2024.
- [6] crimsonthinker, “PSA Pokemon Cards: A Pokemon card grading system using Deep Learning,” GitHub, 2023. github.com
- [7] 1vcian, “Pokemon-TCGP-Card-Scanner,” GitHub, 2024. github.com
- [8] tranhd95, “tcg-scanner: Trading Card Game cards scanner using various approaches,” GitHub, 2024. github.com
- [9] A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proc. ICML, 2021, pp. 8748–8763.
- [10] G. Jocher, A. Chaurasia, J. Qiu, “YOLOv8 by Ultralytics,” GitHub, 2023. github.com
- [11] C.-Y. Wang, I.-H. Yeh, H.-Y. M. Liao, “YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information,” in Proc. ECCV, 2024.
- [12] A. Radford et al., “CLIP: Connecting text and images,” OpenAI Blog, Jan. 2021. openai.com
- [13] Marqo, “Learn to Binarize CLIP for Multimodal Retrieval and Ranking,” 2024. marqo.ai
- [14] Microsoft, “ONNX Runtime Web,” 2024. onnxruntime.ai
- [15] Hugging Face, “Transformers.js v3: WebGPU Support, New Models & Tasks, and More,” HF Blog, Oct. 2024. huggingface.co
- [16] SitePoint, “WebGPU vs WebASM: Browser Inference Benchmarks,” 2024. sitepoint.com
- [17] Q. Jin et al., “Empowering In-Browser Deep Learning Inference on Edge Through Just-In-Time Kernel Optimization,” in Proc. ACM MobiSys, 2024.