Our Daily Method #15: Kernel#levenshtein

Geplaatst door Michiel de Mare ma, 25 feb 2008 08:00:00 GMT

Not in any way invented by us, but still extremely useful and not a part of the standard library: the Levenshtein Distance

What does it calculate? The number of characters to add, delete or switch to turn the first string into the second. Ideal for detecting typos.


# example: 
levenshtein('levenstine', 'levenshtein') # => 3
Or how about instead of a LIKE-clause? Not quite as fast (cough) but much more accurate.

# In the old days:
User.find(:all, 
    :conditions => ['name like ?', params[:name] + '%']])

# Now:
User.find(:all).
    sort_by {|u| levenshtein(params[:name], u.name)}.
    first

def levenshtein(s1, s2)
  d = {}
  (0..s1.size).each { |row| d[[row, 0]] = row }
  (0..s2.size).each { |col| d[[0, col]] = col }
  (1..s1.size).each do |i|
    (1..s2.size).each do |j|
      i1,j1 = i-1,j-1
      cost = s1[i1] != s2[j1] ? 1 : 0
      d[[i, j]] = [d[[i1, j1]] + cost,
                   d[[i1, j]] + 1,
                   d[[i, j1]] + 1].min
      if i1*j1 > 0 and s1[i1] == s2[j-2] and s1[i-2] == s2[j1]
        d[[i, j]] = [d[[i,j]] , d[[i-2, j-2]] + cost].min
      end
    end
  end
  d[[s1.size, s2.size]]
end

Geplaatst in , ,  | geen reacties

Reacties

(Laat url/e-mail achter »)

   Voorvertoning