Embeddings - Sharc

SHARC uses state-of-the-art embeddings to understand the semantic meaning of code. This page explains the embedding model, generation process, and optimization techniques.

Embedding Model

Sharc-Embed

SHARC’s embedding model is optimized for code:

Property	Value
Model	Sharc-Embed
Dimensions	4096
Max Tokens	8192

Key Features

Code-optimized: Trained on large code corpora
High dimensionality: 4096 dims capture nuanced semantics
Long context: 8K tokens handles large functions

Embedding Generation

Process Overview

Batching Strategy

Embeddings are generated in optimized batches for efficiency. Multiple batches are processed concurrently, enabling fast indexing even for large codebases.

Vector Storage

Full Dimensions

Unlike some implementations that truncate embeddings, SHARC stores the full 4096 dimensions:

MRL Truncation (other tools):
  4096 → 1024 dims (75% information loss)

SHARC:
  4096 → 4096 dims (no loss)

Normalization

Vectors are L2-normalized for cosine similarity:

// Before normalization
[0.5, 0.3, 0.8, ...]

// After normalization (unit length)
[0.47, 0.28, 0.75, ...] // ||v|| = 1

This enables efficient cosine similarity via dot product.

Context Injection

Embeddings capture more than raw code. SHARC injects context:

For Code (AST-parsed)

// Original function:
async authenticate(user: string): Promise<boolean> { ... }

// With injected context:
// Context: class AuthService > module auth (services/auth.ts)
async authenticate(user: string): Promise<boolean> { ... }

The embedding now “knows”:

This is a method in AuthService
It’s in the auth module
File is services/auth.ts

With Decorators

// TypeScript/JavaScript with decorators:
// Context: class UserController @Controller("/users") (controllers/user.ts)
// @Get("/:id") @Auth
async getUser(id: string): Promise<User> { ... }

For Documentation

// Original:
## Authentication
Users must provide valid credentials...

// With file context:
// File: docs/security/authentication.md
## Authentication
Users must provide valid credentials...

Semantic Properties

What Embeddings Capture

Property	Example
Function purpose	”authentication” vs “validation”
Code patterns	async/await, error handling
Data structures	arrays, objects, classes
Domain concepts	”user”, “payment”, “order”
Relationships	caller-callee, inheritance

Similarity Examples

Query: "user authentication"

High similarity (0.9+):
- async function authenticateUser(credentials) { ... }
- class AuthenticationService { verify() { ... } }

Medium similarity (0.7-0.9):
- function validateUserInput(input) { ... }
- const userSession = { authenticated: true }

Low similarity (< 0.5):
- const styles = { color: 'red' }
- function calculateTax(amount) { ... }

Indexing Performance

Indexing time depends on codebase size:

Codebase	Approximate Time
Small (~500 files)	~15-20 seconds
Medium (e.g., Hono)	~30-45 seconds
Large (e.g., Next.js)	~3-4 minutes

Subsequent incremental syncs use Merkle diffs and complete near-instantly for unchanged codebases.

Troubleshooting

Slow Embedding Generation

Re-index the codebase if search results look stale.
Ensure your codebase indexing completed successfully.
Retry with a more specific semantic query.

​Embedding Model

​Sharc-Embed

​Key Features

​Embedding Generation

​Process Overview

​Batching Strategy

​Vector Storage

​Full Dimensions

​Normalization

​Context Injection

​For Code (AST-parsed)

​With Decorators

​For Documentation

​Semantic Properties

​What Embeddings Capture

​Similarity Examples

​Indexing Performance

​Troubleshooting

​Slow Embedding Generation