Skip to content

Projection Modes

The Engine supports four Locality Sensitive Hashing (LSH) projection strategies for mapping continuous embeddings to discrete prime integers. Each mode has different trade-offs.

Overview

Mode Deterministic Requires Labels Best For
random No (seed-dependent) No Baseline, exploration
pca Yes No Production, reproducibility
consensus Yes No Noise filtering, stability analysis
contrastive Yes Yes (hypernym pairs) Maximum accuracy (100% TP at k=6)

Random

Classic LSH with random hyperplanes. Fast to set up but results depend on the random seed.

mapper = DiscreteMapper(n_bits=8, projection="random", seed=42)
prime_map = mapper.fit_transform(concepts, embeddings)

When to use: Initial exploration, quick experiments, baselines.

Warning

Changing the seed changes the projection. Two runs with different seeds may produce different subsumption relationships. Use PCA or consensus for reproducible results.

PCA

Uses the top-k principal components of the embedding corpus as hyperplanes. Fully deterministic given the same corpus.

mapper = DiscreteMapper(n_bits=8, projection="pca")
prime_map = mapper.fit_transform(concepts, embeddings)

When to use: Production pipelines, reproducible research, any setting where determinism matters.

Tip

PCA is the recommended default for most use cases. It captures the directions of maximum variance in your data, producing semantically meaningful prime factors.

Consensus

Runs multiple random projections and keeps only the hyperplanes that agree across seeds. Filters out noise to produce stable projections.

mapper = DiscreteMapper(
    n_bits=8,
    projection="consensus",
    consensus_seeds=20
)
prime_map = mapper.fit_transform(concepts, embeddings)

When to use: When you need to distinguish genuine structure from noise, stability analysis across corpora.

Contrastive

Supervised mode that learns hyperplanes from labeled hypernym pairs (e.g., "Animal" is a hypernym of "Dog"). Achieves the highest accuracy for hierarchical relationships.

mapper = DiscreteMapper(
    n_bits=8,
    projection="contrastive",
    hypernym_pairs=[
        ("Animal", "Dog"),
        ("Animal", "Cat"),
        ("Vehicle", "Car"),
        ("Vehicle", "Truck"),
    ]
)
prime_map = mapper.fit_transform(concepts, embeddings)

When to use: When you have known taxonomic relationships and need maximum subsumption accuracy.

Benchmark result

Contrastive projection achieves 100% true positive rate for hypernym detection at k=6 on the evaluation benchmark.

Choosing k (Number of Bits)

The n_bits parameter controls how many hyperplanes (and thus prime factors) are used. Practical range is 6--12.

k Prime space Trade-off
6 Small, fast May conflate distinct concepts
8 Recommended Good balance of precision and coverage
12 Large, precise Slower, sparser — some concepts may share no factors

Known limitation

There is no principled method for selecting the optimal k. The useful range is empirical (6--12). Values outside this range produce either too many collisions (low k) or too-sparse signatures (high k).