Projection Modes¶
The Engine supports four Locality Sensitive Hashing (LSH) projection strategies for mapping continuous embeddings to discrete prime integers. Each mode has different trade-offs.
Overview¶
| Mode | Deterministic | Requires Labels | Best For |
|---|---|---|---|
random |
No (seed-dependent) | No | Baseline, exploration |
pca |
Yes | No | Production, reproducibility |
consensus |
Yes | No | Noise filtering, stability analysis |
contrastive |
Yes | Yes (hypernym pairs) | Maximum accuracy (100% TP at k=6) |
Random¶
Classic LSH with random hyperplanes. Fast to set up but results depend on the random seed.
mapper = DiscreteMapper(n_bits=8, projection="random", seed=42)
prime_map = mapper.fit_transform(concepts, embeddings)
When to use: Initial exploration, quick experiments, baselines.
Warning
Changing the seed changes the projection. Two runs with different seeds may produce different subsumption relationships. Use PCA or consensus for reproducible results.
PCA¶
Uses the top-k principal components of the embedding corpus as hyperplanes. Fully deterministic given the same corpus.
mapper = DiscreteMapper(n_bits=8, projection="pca")
prime_map = mapper.fit_transform(concepts, embeddings)
When to use: Production pipelines, reproducible research, any setting where determinism matters.
Tip
PCA is the recommended default for most use cases. It captures the directions of maximum variance in your data, producing semantically meaningful prime factors.
Consensus¶
Runs multiple random projections and keeps only the hyperplanes that agree across seeds. Filters out noise to produce stable projections.
mapper = DiscreteMapper(
n_bits=8,
projection="consensus",
consensus_seeds=20
)
prime_map = mapper.fit_transform(concepts, embeddings)
When to use: When you need to distinguish genuine structure from noise, stability analysis across corpora.
Contrastive¶
Supervised mode that learns hyperplanes from labeled hypernym pairs (e.g., "Animal" is a hypernym of "Dog"). Achieves the highest accuracy for hierarchical relationships.
mapper = DiscreteMapper(
n_bits=8,
projection="contrastive",
hypernym_pairs=[
("Animal", "Dog"),
("Animal", "Cat"),
("Vehicle", "Car"),
("Vehicle", "Truck"),
]
)
prime_map = mapper.fit_transform(concepts, embeddings)
When to use: When you have known taxonomic relationships and need maximum subsumption accuracy.
Benchmark result
Contrastive projection achieves 100% true positive rate for hypernym detection at k=6 on the evaluation benchmark.
Choosing k (Number of Bits)¶
The n_bits parameter controls how many hyperplanes (and thus prime factors) are used. Practical range is 6--12.
| k | Prime space | Trade-off |
|---|---|---|
| 6 | Small, fast | May conflate distinct concepts |
| 8 | Recommended | Good balance of precision and coverage |
| 12 | Large, precise | Slower, sparser — some concepts may share no factors |
Known limitation
There is no principled method for selecting the optimal k. The useful range is empirical (6--12). Values outside this range produce either too many collisions (low k) or too-sparse signatures (high k).