Skip to content

fix(core): L2-normalize FastEmbed vectors to satisfy unit-vector contract#843

Open
tk-pkm111 wants to merge 1 commit into
basicmachines-co:mainfrom
tk-pkm111:fix/fastembed-l2-normalization
Open

fix(core): L2-normalize FastEmbed vectors to satisfy unit-vector contract#843
tk-pkm111 wants to merge 1 commit into
basicmachines-co:mainfrom
tk-pkm111:fix/fastembed-l2-normalization

Conversation

@tk-pkm111
Copy link
Copy Markdown

@tk-pkm111 tk-pkm111 commented May 21, 2026

Summary

  • sqlite_search_repository.py converts vector distance to similarity with similarity = 1 - distance²/2, a formula that assumes unit-normalized vectors (the requirement is explicitly stated in comments on lines 65-67 of that file).
  • Some embedding models — notably multilingual ones such as paraphrase-multilingual-MiniLM-L12-v2 — return vectors with norm ≈ 2.9 rather than 1.0. This causes every similarity score to collapse near 0.0, making vector/hybrid search effectively non-functional for those models.
  • This PR applies L2 normalization inside _embed_batch() after converting raw model output to Python floats. Zero vectors are returned as-is to prevent division errors.

Root cause

FastEmbed's default English model (bge-small-en-v1.5) happens to return unit-normalized vectors, so the bug is invisible with the default config. Switching to any model that does not self-normalize (e.g., the multilingual MiniLM family) exposed the contract violation.

Changes

File Change
src/basic_memory/repository/fastembed_provider.py Add import math; L2-normalize each vector in _embed_batch()
tests/repository/test_fastembed_provider.py Two new tests: norm==1.0 assertion for unnormalized stub model; zero-vector no-raise assertion

Test plan

  • All 8 existing + new unit tests pass: uv run --python 3.12 pytest tests/repository/test_fastembed_provider.py -v
  • New test test_fastembed_provider_l2_normalizes_output_vectors confirms output norm is 1.0 ± 1e-6 for a stub model that returns norm ≈ 2.9
  • New test test_fastembed_provider_zero_vector_does_not_raise confirms no ZeroDivisionError on zero vector

🤖 Generated with Claude Code

…ract

sqlite_search_repository uses the formula `similarity = 1 - distance²/2`
which is only correct for unit-normalized vectors (see comment on lines 65-67
of sqlite_search_repository.py). Models such as
paraphrase-multilingual-MiniLM-L12-v2 return vectors with norm ≈ 2.9, causing
cosine similarity scores to collapse near 0 for every query.

Apply L2 normalization in `_embed_batch()` after converting to Python floats,
so the contract holds regardless of which model is configured. Zero vectors
are returned as-is to avoid division errors.

Adds two unit tests: one that asserts the output norm is 1.0 for an
unnormalized stub model, and one that verifies no exception is raised for
zero vectors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: tk-pkm111 <133480534+tk-pkm111@users.noreply.github.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 21, 2026

CLA assistant check
All committers have signed the CLA.

@phernandez
Copy link
Copy Markdown
Member

Hi @tk-pkm111 — thanks for the careful fix and the clear reproduction. The change looks solid: the unit-norm contract on sqlite_search_repository.py lines 65-67 is real, multilingual MiniLM models do break it, and the patch addresses it at the right boundary. Tests look meaningful too.

The one blocker before we can merge is the CLA — the bot still shows it as unsigned for this PR. Could you sign it via the link in CLAassistant's comment above? Once that's in we can move toward landing this.

phernandez added a commit to RheagalFire/basic-memory that referenced this pull request May 26, 2026
Bring the LiteLLM provider in line with the unit-norm contract from
sqlite_search_repository.py (lines 65-67): the cosine-similarity formula
`1 - L²/2` is correct only for unit-normalized vectors. LiteLLM routes to
many backends (Cohere, Vertex, Bedrock, etc.) that do not return normalized
embeddings, so normalize at the provider boundary — same fix shape as the
parallel FastEmbed change in basicmachines-co#843.

Also align the response handling with OpenAIEmbeddingProvider:
- attribute access on response items (item.index / item.embedding)
- explicit duplicate-index guard

Tests cover the three behaviors directly (unit norm, zero-vector pass-through,
duplicate-index error) and the existing ordering test now reconstructs the
expected normalized vectors so a normalization regression would be caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@tk-pkm111
Copy link
Copy Markdown
Author

CLA is now signed — could you merge this when you get a chance?

Copy link
Copy Markdown
Member

@phernandez phernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the FastEmbed normalization change against the sqlite_search_repository unit-vector contract. No blockers found.\n\nLocal verification on PR head 4ef062b:\n- uv run pytest tests/repository/test_fastembed_provider.py -> 8 passed\n- uv run ruff check src/basic_memory/repository/fastembed_provider.py tests/repository/test_fastembed_provider.py -> passed\n\nNote: the repo's full Tests workflow is push-only and did not run for this external PR branch, so this review is based on focused local verification rather than the full CI matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants