Projection Modes¶

The Engine supports four Locality Sensitive Hashing (LSH) projection strategies for mapping continuous embeddings to discrete prime integers. Each mode has different trade-offs.

Overview¶

Mode	Deterministic	Requires Labels	Best For
`random`	No (seed-dependent)	No	Baseline, exploration
`pca`	Yes	No	Production, reproducibility
`consensus`	Yes	No	Noise filtering, stability analysis
`contrastive`	Yes	Yes (hypernym pairs)	Maximum accuracy (100% TP at k=6)

Random¶

Classic LSH with random hyperplanes. Fast to set up but results depend on the random seed.

mapper = DiscreteMapper(n_bits=8, projection="random", seed=42)
prime_map = mapper.fit_transform(concepts, embeddings)

When to use: Initial exploration, quick experiments, baselines.

Warning

Changing the seed changes the projection. Two runs with different seeds may produce different subsumption relationships. Use PCA or consensus for reproducible results.

PCA¶

Uses the top-k principal components of the embedding corpus as hyperplanes. Fully deterministic given the same corpus.

mapper = DiscreteMapper(n_bits=8, projection="pca")
prime_map = mapper.fit_transform(concepts, embeddings)

When to use: Production pipelines, reproducible research, any setting where determinism matters.

Tip

PCA is the recommended default for most use cases. It captures the directions of maximum variance in your data, producing semantically meaningful prime factors.

Consensus¶

Runs multiple random projections and keeps only the hyperplanes that agree across seeds. Filters out noise to produce stable projections.

mapper = DiscreteMapper(
    n_bits=8,
    projection="consensus",
    consensus_seeds=20
)
prime_map = mapper.fit_transform(concepts, embeddings)

When to use: When you need to distinguish genuine structure from noise, stability analysis across corpora.

Contrastive¶

Supervised mode that learns hyperplanes from labeled hypernym pairs (e.g., "Animal" is a hypernym of "Dog"). Achieves the highest accuracy for hierarchical relationships.

mapper = DiscreteMapper(
    n_bits=8,
    projection="contrastive",
    hypernym_pairs=[
        ("Animal", "Dog"),
        ("Animal", "Cat"),
        ("Vehicle", "Car"),
        ("Vehicle", "Truck"),
    ]
)
prime_map = mapper.fit_transform(concepts, embeddings)

When to use: When you have known taxonomic relationships and need maximum subsumption accuracy.

Benchmark result

Contrastive projection achieves 100% true positive rate for hypernym detection at k=6 on the evaluation benchmark.

Choosing k (Number of Bits)¶

The n_bits parameter controls how many hyperplanes (and thus prime factors) are used. Practical range is 6--12.

k	Prime space	Trade-off
6	Small, fast	May conflate distinct concepts
8	Recommended	Good balance of precision and coverage
12	Large, precise	Slower, sparser — some concepts may share no factors

Known limitation

There is no principled method for selecting the optimal k. The useful range is empirical (6--12). Values outside this range produce either too many collisions (low k) or too-sparse signatures (high k).