Python code running in a text editor with semantic similarity charts in the background

Semantic Search with Python and Open-Source Models

NeuralPulse|13 de junho de 2026|5 min read|Ler em Português

What separates a search that understands meaning from one that just finds keywords? In 2026, the answer has a name: embeddings.

Behind recommendation systems, chatbots, and modern search engines lies a technique that transforms text into numbers. And not just any numbers—into vectors that capture context and meaning.

The good news? You don't need a big tech budget to implement this. Models like BGE-M3 (from BAAI) and GTE-Qwen2 (from Alibaba) lead the MTEB ranking with scores above 60 (source: Hugging Face MTEB Leaderboard, June 2026). And they run on modest hardware.

In this tutorial, you'll learn how to use embeddings for semantic search in Python—with code that works, real metrics, and no reliance on expensive APIs.

What are embeddings and why they truly matter

Embeddings are numerical representations of text. Each sentence, paragraph, or document becomes a vector—a list of hundreds or thousands of numbers.

The magic lies in geometry. Similar texts generate nearby vectors in multidimensional space. "Cat eats food" and "The feline feeds" end up close together. "Broken car" stays far away.

Semantic search doesn't find words—it finds intentions. Embeddings transform language into geometry, and this changes everything in information retrieval.

This technique replaces exact keyword search. Instead of searching for "dog," the system understands that "canine pet" also works. This drastically reduces false negatives.

The MTEB (Massive Text Embedding Benchmark) evaluates models on tasks like classification, clustering, and retrieval. As of June 2026, open-source models dominate the top. BGE-M3, for example, achieves competitive performance with proprietary solutions (source: sbert.net).

Step-by-step: implementing semantic search with Python

Let's build a semantic search system from scratch. We'll use the sentence-transformers library, which offers pre-trained models and a clean API.

1. Installation and setup

First, install the dependencies:

pip install sentence-transformers numpy scikit-learn

sentence-transformers abstracts the complexity of the models. You don't need to deal with manual tokenization or GPU optimization.

2. Loading the model

We'll use BGE-M3, one of the MTEB leaders. It's efficient and runs on CPU:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-m3')

The model downloads on the first run. The total weight is around 2 GB.

3. Creating embeddings for a document set

Suppose you have a base of technical articles:

documents = [
    "Python is a high-level programming language.",
    "JavaScript is primarily used for web development.",
    "Machine learning uses algorithms to find patterns.",
    "Neural networks are inspired by the human brain.",
    "Pandas is a Python library for data analysis.",
]

doc_embeddings = model.encode(documents) print(doc_embeddings.shape) # (5, 1024) for BGE-M3

Each document becomes a 1024-dimensional vector. Similarity between them is calculated using cosine similarity.

4. Semantic search function

Now, the search itself:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

ElevenLabs

Transforme texto em voz com IA realista. Perfeito para narracoes, podcasts e audiolivros.

Testar gratuito

def search(query, documents, doc_embeddings, top_k=3): query_embedding = model.encode([query]) similarities = cosine_similarity(query_embedding, doc_embeddings)[0] indices = np.argsort(similarities)[::-1][:top_k] results = [] for i in indices: results.append({ "document": documents[i], "score": similarities[i] }) return results

Testing

query = "How to analyze data in Python?" results = search(query, documents, doc_embeddings) for r in results: print(f"{r['score']:.2f} - {r['document']}")

The output should be something like:

0.89 - Pandas is a Python library for data analysis.
0.72 - Python is a high-level programming language.
0.45 - Machine learning uses algorithms to find patterns.

Notice: the search understood that "analyze data" relates to pandas, even though the word "pandas" wasn't in the query.

Local embeddings vs API: cost, latency, and performance

Is it worth paying for an API like OpenAI's? It depends on your scenario.

Feature	Local embedding (BGE-M3)	OpenAI API (text-embedding-3-small)
Cost per 1M tokens	Zero (after hardware)	~$0.02 (80% lower than 2025, source: OpenAI Pricing, June 2026)
Latency (first token)	10-50ms (depends on GPU)	200-500ms (network + processing)
Data privacy	Total	Data goes to external server
Internet dependency	No	Yes
Performance (MTEB)	64.2 (BGE-M3, source: MTEB Leaderboard, June 2026)	62.3 (text-embedding-3-small, source: MTEB Leaderboard, June 2026)

The cheapest open-source model is the one running on your machine. Zero API cost. Zero network latency. Total data privacy.

For applications requiring low latency (like real-time chatbots), local models win. For sporadic projects with few documents, the API might be more practical.

A crucial point: the OpenAI API has reduced costs by 80% since 2025, but it still depends on a connection. If your system needs to work offline (industry, healthcare, sensitive data), a local model is the only way.

How to choose the right model for your application

The MTEB ranking (accessible at huggingface.co/spaces/mteb/leaderboard) is the best starting point. As of June 2026, these are the open-source highlights:

BGE-M3 (BAAI): Excellent for multilingual search. Supports Chinese, English, and Portuguese reasonably well.
GTE-Qwen2 (Alibaba): Superior performance in information retrieval tasks. Slightly larger than BGE-M3.
E5-mistral-7b-instruct (Microsoft): Larger, but with better understanding of complex instructions.

For most cases, BGE-M3 offers the best balance of size, speed, and accuracy.

Practical tip: normalization and batch size

Always normalize embeddings before calculating similarity. sentence-transformers does this automatically, but it's good to confirm:

doc_embeddings = model.encode(documents, normalize_embeddings=True)

For large volumes (thousands of documents), use batch_size to avoid memory overflow:

doc_embeddings = model.encode(documents, batch_size=32, show_progress_bar=True)

Conclusion: embeddings are the foundation of intelligent search

Embeddings have transformed information retrieval. In 2026, with open-source models achieving competitive scores on MTEB, there's no excuse to rely on expensive APIs or closed solutions.

In this tutorial, you saw how to implement semantic search in less than 30 lines of Python. BGE-M3 and GTE-Qwen2 offer top-tier performance with zero API cost.

The next step? Integrate embeddings with vector databases like ChromaDB or Qdrant to scale to millions of documents. But that's a story for another tutorial.

Start with the code above. Test it with your own data. The difference between a keyword search and a semantic search is the ability to understand the context and intention behind the query—something embeddings make possible in an accessible and efficient way.

#embeddings#semantic-search#textual-similarity#bge-m3#gte-qwen2#mteb#python#open-source

Hyperparameter optimization graph with performance curves and search points, representing tuning automation with Hyperopt.

tutorials|7 min

Hyperparameter Optimization with Hyperopt in 2026: Practical Guide

2026 practical tutorial: learn to optimize machine learning model hyperparameters using Hyperopt, with Bayesian search and result visualization.

12 de junho de 2026Read more

Python code interface with audio waves and a virtual chatbot

tutorials|7 min

Transcription and Response Pipeline with Whisper and Llama 3: Local Implementation in Python

Learn to build a complete voice processing pipeline using Whisper and Llama 3, all locally in Python, with no API costs and full privacy.

11 de junho de 2026Read more

Charts and Python code on a computer screen, representing code assistant building

tutorials|12 min

Building a Code Assistant with RAG and Python: A Practical Guide for 2026

Learn to build a custom code assistant using RAG (Retrieval-Augmented Generation) and Python. Practical tutorial with code, embeddings, and sear...

10 de junho de 2026Read more