Embedding Model
Sharc-Embed
SHARC’s embedding model is optimized for code:| Property | Value |
|---|---|
| Model | Sharc-Embed |
| Dimensions | 4096 |
| Max Tokens | 8192 |
Key Features
- Code-optimized: Trained on large code corpora
- High dimensionality: 4096 dims capture nuanced semantics
- Long context: 8K tokens handles large functions
Embedding Generation
Process Overview
Batching Strategy
Embeddings are generated in optimized batches for efficiency. Multiple batches are processed concurrently, enabling fast indexing even for large codebases.Vector Storage
Full Dimensions
Unlike some implementations that truncate embeddings, SHARC stores the full 4096 dimensions:Normalization
Vectors are L2-normalized for cosine similarity:Context Injection
Embeddings capture more than raw code. SHARC injects context:For Code (AST-parsed)
- This is a method in
AuthService - It’s in the
authmodule - File is
services/auth.ts
With Decorators
For Documentation
Semantic Properties
What Embeddings Capture
| Property | Example |
|---|---|
| Function purpose | ”authentication” vs “validation” |
| Code patterns | async/await, error handling |
| Data structures | arrays, objects, classes |
| Domain concepts | ”user”, “payment”, “order” |
| Relationships | caller-callee, inheritance |
Similarity Examples
Indexing Performance
Indexing time depends on codebase size:| Codebase | Approximate Time |
|---|---|
| Small (~500 files) | ~15-20 seconds |
| Medium (e.g., Hono) | ~30-45 seconds |
| Large (e.g., Next.js) | ~3-4 minutes |
Troubleshooting
Slow Embedding Generation
- Re-index the codebase if search results look stale.
- Ensure your codebase indexing completed successfully.
- Retry with a more specific semantic query.