Skip to content

refactor: split levenshtein_distance.js into cohesive single-responsibility modules#776

Open
Wolfvin wants to merge 1 commit into
NaturalNode:masterfrom
Wolfvin:refactor/split-distance-module
Open

refactor: split levenshtein_distance.js into cohesive single-responsibility modules#776
Wolfvin wants to merge 1 commit into
NaturalNode:masterfrom
Wolfvin:refactor/split-distance-module

Conversation

@Wolfvin

@Wolfvin Wolfvin commented Jun 13, 2026

Copy link
Copy Markdown

What was refactored and why

lib/natural/distance/levenshtein_distance.js was a 242-line file exporting 4 functions (Levenshtein + Damerau-Levenshtein variants) from a single module. This violated several design principles:

  • DECOMPOSITION: One file contained two distinct algorithms (Levenshtein and Damerau-Levenshtein)
  • COHESION: The Damerau variant was coupled to Levenshtein internals
  • SINGLE RESPONSIBILITY: One file served two unrelated use cases
  • NAMING: Internal functions like _getMatchStart() and levenshteinDistance() were unclear

What changed

Before After Purpose
levenshtein_distance.js (242 lines, 4 exports) levenshtein_distance.js (68 lines, 2 exports) Pure Levenshtein distance
damerau_levenshtein_distance.js (70 lines, 2 exports) Pure Damerau-Levenshtein distance
edit_distance_utils.js (163 lines) Shared DP utilities
index.js imports from one file index.js imports from two files Re-exports unchanged

Naming improvements

  • _getMatchStart()traceBackMatchStart() — describes what it does
  • getMinCostSubstring()findMinCostSubstring() — more precise verb
  • levenshteinDistance()computeEditDistance() — generic name reflecting shared nature

Behavioral verification

This refactor was verified using Regrets (fingerprint-based regression testing) with a 4-verification pattern:

Verification Result
V1: All 19 Regrets clusters GREEN ✅ PASS
V2: Direct output matches KEBENARAN 1 (pre-refactor raw output) ✅ IDENTIK
V3: New fingerprints match KEBENARAN 2 (pre-refactor fingerprints) ✅ IDENTIK
V4: Chain hashes match pre-refactor captures ✅ ALL MATCH

KEBENARAN 1 vs Final Output

All 19 clusters produce identical output after refactoring:

  • LevenshteinDistance("kitten", "sitting") → 3 (unchanged)
  • DamerauLevenshteinDistance("az", "za") → 1 (unchanged)
  • All other 17 clusters: identical output

Before/After Fingerprints

Cluster Before After
levenshtein-distance tu16lpe tu16lpe
damerau-levenshtein-distance 1c1vrd4 1c1vrd4
All other 17 clusters unchanged unchanged

Before/After Chain Hashes

Chain Before After
tokenize-stem-distance 35jlpb2 35jlpb2
phonetic-comparison 1lujd98 1lujd98
normalize-transliterate 2xc2o2w 2xc2o2w

Non-breaking

The distance/index.js re-exports are completely unchanged. Any code that imports from natural/distance will continue to work identically. The only change is internal file organization.

…bility modules

DECOMPOSITION: Split levenshtein_distance.js (242 lines, 4 exports) into:
- levenshtein_distance.js: Pure Levenshtein distance only
  (LevenshteinDistance + LevenshteinDistanceSearch)
- damerau_levenshtein_distance.js: Damerau-Levenshtein distance only
  (DamerauLevenshteinDistance + DamerauLevenshteinDistanceSearch)
- edit_distance_utils.js: Shared DP utilities
  (computeEditDistance, findMinCostSubstring)

NAMING: Renamed internal functions for clarity:
- _getMatchStart() → traceBackMatchStart()
- getMinCostSubstring() → findMinCostSubstring()
- levenshteinDistance() → computeEditDistance()

COHESION: Each file now has a single responsibility:
- edit_distance_utils.js: Core DP algorithm (shared by both variants)
- levenshtein_distance.js: Levenshtein-specific wrapper functions
- damerau_levenshtein_distance.js: Damerau-specific wrapper functions

SINGLE RESPONSIBILITY: One file = one distance algorithm variant

REDUCE COUPLING: Levenshtein and Damerau variants no longer need
to know about each other. They both delegate to the shared DP core
in edit_distance_utils.js.

All existing tests pass. The index.js re-exports are unchanged,
so this is a non-breaking internal refactor.

Verified by Regrets fingerprint-based regression testing:
- 19 clusters: ALL GREEN
- 3 chains: ALL GREEN
- KEBENARAN 1 (raw output): IDENTIK
- KEBENARAN 2 (fingerprints): IDENTIK
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant