Issue #1 · Tuesday, May 5, 2026

Embeddings for People Who Understand Hashing

Embeddings — explained through hash functions

The Concept

Embeddings are fixed-length numeric vectors that represent the meaning of text, images, or other data. Every ML model that works with language uses them under the hood. When someone says "vector search" or "semantic similarity," they're talking about comparing embeddings.

You'll see them everywhere: recommendation systems, search, RAG pipelines, clustering. If you're building anything with an LLM, you'll use embeddings directly or you'll use something that depends on them.

If You Already Know Hash Functions, You Already Know Most of This

A hash function takes arbitrary input and produces a fixed-length output. MD5 gives you 128 bits. SHA-256 gives you 256 bits. The input can be anything — a string, a file, a stream of bytes. The output is always the same size.

An embedding function does the same thing. It takes text (or an image, or audio) and produces a fixed-length vector. OpenAI's text-embedding-3-small gives you 1536 floats. Cohere's embed-v3 gives you 1024. The input can be a word, a sentence, or a full document. The output is always the same dimensionality.

Here's where it maps cleanly:

Hash function Embedding function
Arbitrary input → fixed output Arbitrary input → fixed output
Deterministic (same input → same output) Deterministic (same input → same output)
Output is compact Output is compact
You compare outputs to check for equality You compare outputs to check for similarity

That last row is where it gets interesting.

What's Actually New

Hash functions are designed so that similar inputs produce completely different outputs. Change one bit in the input and the hash is unrecognizable. That's the point — it's a security property.

Embedding functions are designed for the opposite. Similar inputs produce similar outputs. "The server crashed at 3am" and "Our backend went down overnight" should produce vectors that are close together in vector space. That's the whole value proposition.

The distance between two embedding vectors tells you how semantically similar the inputs are. Cosine similarity is the standard measure:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Two similar sentences will have similarity close to 1.0
# Two unrelated sentences will be close to 0.0

This means you can do things that are impossible with hashes:

  • Find documents that are about the same thing, even if they use different words
  • Rank results by relevance, not just match/no-match
  • Cluster data by meaning without defining the categories upfront

Under the Hood

An embedding model is a neural network (usually a transformer) that's been trained on massive text corpora. During training, it learns to place semantically similar text close together in a high-dimensional space.

The practical workflow:

from openai import OpenAI

client = OpenAI()

# Generate an embedding
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Kubernetes pod scheduling"
)

vector = response.data[0].embedding  # List of 1536 floats

That vector is now a point in 1536-dimensional space. Store it. When a query comes in, embed the query the same way and find the nearest stored vectors.

The key insight for systems engineers: embedding generation is a pure function with no side effects. Same model + same input = same vector. You can cache aggressively. You can batch. You can run it offline and store the results.

The cost profile: embedding is cheap (roughly 100x cheaper than an LLM completion call) and fast (single-digit milliseconds for short text). The expensive part is storing and searching millions of vectors efficiently — which is where vector databases come in. That's next week.

Decision Framework

Use embeddings when:

  • You need semantic search (not just keyword matching)
  • You're building a RAG pipeline and need to find relevant context
  • You want to cluster or classify text without predefined rules
  • Your search queries won't match the exact words in the documents

Don't use embeddings when:

  • Exact match is sufficient (use a hash or a database index)
  • Your data is structured and queryable (use SQL)
  • You need explanations for why results matched (embeddings are opaque)
  • Your corpus is under 1,000 documents and keyword search works fine

Model recommendations:

  • Under 1M documents, budget-sensitive: text-embedding-3-small (1536 dims, cheap)
  • Production semantic search: text-embedding-3-large (3072 dims, better quality)
  • On-premise / privacy requirements: sentence-transformers all-MiniLM-L6-v2 (runs locally, 384 dims)

What Your Manager Thinks It Does vs. What It Actually Does

What your manager thinks: "Embeddings are AI that understands our data."

What it actually does: Converts text to numbers in a way that preserves semantic relationships. It doesn't understand anything. It maps language to geometry. Similar meaning = nearby points. That's it.

The useful reframe for your next meeting: "Embeddings let us do fuzzy matching on meaning instead of exact matching on keywords. Same concept as a search index, but it catches paraphrases and related concepts."

Ship This Weekend

Build a semantic search engine over your team's documentation in under 4 hours.

  1. Export your docs to plain text (Confluence API, Notion export, or just copy-paste the top 50 pages)
  2. Chunk each doc into ~500-token paragraphs
  3. Embed each chunk with text-embedding-3-small
  4. Store vectors in a local ChromaDB instance (pip install chromadb, zero config)
  5. Build a CLI that takes a question, embeds it, and returns the top 5 matching chunks

Total cost: under $0.10 in API calls for 50 docs. Zero infrastructure.

import chromadb

client = chromadb.Client()
collection = client.create_collection("team-docs")

# Add your chunks
collection.add(
    documents=["chunk text here", "another chunk"],
    ids=["doc1-chunk1", "doc1-chunk2"]
)

# Query
results = collection.query(
    query_texts=["How do we deploy to staging?"],
    n_results=5
)

When you demo this on Monday, your team will get it immediately. "Oh, it's like search but it actually finds what I meant." That's the moment embeddings click.

Further Reading

Get Upshift every Tuesday

One AI concept per week, explained through systems you already know.