December 7, 2011

Did You Mean

Peter Norvig’s Did You Mean ?

Almost one year and a half ago, I did found something interesting on the web. An article of Peter Norvig explaining How to Write a Spelling Corrector. At this time I found it brilliant and did implement it in OCaml (but lost it). I tried to shorten the provided Python version, and also the Java version made by Rael Cunha. Eventually, I did come with a shortest version in both Java and Python.

A week ago I did get lost on gmail, and read the mail sent to both of them for I don’t know what reason, and I thought I could post them here.

Python 3

It uses lambdas, string.ascii_lowercase, collections.Counter and dict comprehension to reduce the number of lines to only 15, which is the lowest we can find on Peter Norvig’s page, equals to Tiago “PacMan” Peczenyj ‐ Awk and Dejan Jelovic ‐ F# implementation.

import sys, time, re, string, collections

load = lambda txt: collections.Counter(re.findall('[a-z]+', txt.lower()))
NWORDS = load(open('big.txt').read())

def edits1(word):
   splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   return set([a + b[1:] for a, b in splits if b] + # deletes
              [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1] + # transposes
              [a + c + b[1:] for a, b in splits for c in string.ascii_lowercase if b] + # replaces
              [a + c + b for a, b in splits for c in string.ascii_lowercase]) # inserts

edits2 = lambda word: {e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS}

known = lambda words: {w for w in words if w in NWORDS}

def correct(word):
   candidates = known([word]) or known(edits1(word)) or edits2(word) or [word]
   return max(candidates, key=NWORDS.get)

for arg in sys.argv[1:]: print('%s -> Did you mean %s ?' % (arg, correct(arg)))

Test and performance:

$ time python3 DidYouMean.py speling sujar roopm
speling -> Did you mean spelling ?
sujar -> Did you mean sugar ?
roopm -> Did you mean room ?

real  0m2.030s
user  0m1.906s
sys   0m0.115s

Java

The Java version is almost equivalent to the one of Rael Cunha. But it uses a trick to read all the file in one line. This way, the program takes only 32 lines.

import java.io.*;
import java.util.*;
import java.util.regex.*;

class DidYouMean {
    private final Map<String, Integer> nWords = new HashMap<String, Integer>();

    public DidYouMean(String file) throws IOException {
        String contents = new Scanner(new File(file)).useDelimiter("\\Z").next();
        Matcher m = Pattern.compile("[a-z]+").matcher(contents.toLowerCase());
        while (m.find()) nWords.put(m.group(), nWords.containsKey(m.group()) ? nWords.get(m.group()) + 1 : 1);
    }

    private final List<String> edits(String w) {
        List<String> result = new ArrayList<String>();
        for(int i=; i < w.length(); ++i) result.add(w.substring(0, i) + w.substring(i+1));
        for(int i=; i < w.length()-1; ++i) result.add(w.substring(0, i) + w.substring(i+1, i+2) + w.substring(i, i+1) + w.substring(i+2));
        for(int i=; i < w.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(w.substring(0, i) + String.valueOf(c) + w.substring(i+1));
        for(int i=; i <= w.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(w.substring(0, i) + String.valueOf(c) + w.substring(i));
        return result;
    }

    public final String correct(String word) {
        if (nWords.containsKey(word)) return word;
        List<String> list = edits(word);
        Map<Integer, String> candidates = new HashMap<Integer, String>();
        for(String s : list) if(nWords.containsKey(s)) candidates.put(nWords.get(s),s);
        if(candidates.size() > ) return candidates.get(Collections.max(candidates.keySet()));
        for(String s : list) for(String w : edits(s)) if(nWords.containsKey(w)) candidates.put(nWords.get(w),w);
        return candidates.size() >  ? candidates.get(Collections.max(candidates.keySet())) : word;
    }

    public static void main(String args[]) throws IOException {
        DidYouMean sp = new DidYouMean("big.txt");
        for (String arg: args) System.out.println(arg + " -> Did you mean " + sp.correct(arg) + " ?");
    }
}

Trying the Java version too:

$ time java DidYouMean speling sujar roopm
speling -> Did you mean spelling ?
sujar -> Did you mean sugar ?
roopm -> Did you mean room ?

real  0m1.425s
user  0m1.695s
sys   0m0.154s

It can probably be shortened using third parties libraries, but the initial goal of Rael Cunha was to use only standard Java.

OCaml

Finally I wanted to start again on my previous OCaml version, but this time using Batteries Included. Here is what I finally come with, a 16 lines program, which implements the did you mean:

(* Open Awesomeness *)
open Batteries;;
(* Read the content of the given file *)
let read_file filename = File.with_file_in filename IO.read_all;;
(* Extract all words *)
let words = Str.split (Str.regexp "[^a-zA-Z]+") |- List.enum |- map String.lowercase;;
(* Count word occurences *)
let train tbl = iter (fun w -> let prev = Hashtbl.find_default tbl w  in Hashtbl.replace tbl w (prev + 1));;
(* Construct the alphabet *)
let alphabet = [? List:String.of_char(Char.chr i) | i <- 97 -- 122 ?];;
(* Convenience functions *)
let (+..), slice, sub, length = Enum.append, String.slice, String.sub, String.length;;
(* the edits function *)
let edits w = let splits w = [? (slice ~last:i w, slice ~first:i w) | i <-  -- (length w) ?] in
   [? a^(slice ~first:1 b) | a,b <- splits w; b != "" ?] +..
   [? a^(sub b 1 1)^(sub b  1)^(slice ~first:2 b) | a,b <- splits w; length b > 1 ?] +..
   Enum.flatten [? [? a^c^(slice ~first:1 b) | a,b <- splits w; b != "" ?] | c <- (alphabet |> List.enum) ?] +..
   Enum.flatten [? [? a^c^b | a,b <- splits w; b != "" ?] | c <- (alphabet |> List.enum) ?] |> List.of_enum;;
(* load the words *)
let known = let model = Hashtbl.create 4096 in "big.txt" |> read_file |> (words |- train model);
   model |> Hashtbl.enum |> List.of_enum |> List.fast_sort (fun (_, c1) (_, c2) -> compare c2 c1);;
(* correct a word*)
let correct word = let k = List.map fst known in (List.map (fun w -> List.index_of w k) (edits word)
   |> List.remove_all)(None) |> List.min |> Option.get |> List.at known |> fst;;
(* try it *)
# List.map correct ["sward"; "sujar"; "speling"; "roopm"];;
- : string list = ["word"; "sugar"; "spelling"; "room"]

It’s a bit verbose, even with the convenience functions for Enum appending, slicing and string operations. Without them we can reach the 15 lines, but then it becomes a mess to read.

It think it can be made faster by using a Hashtbl instead of a List in known and correct, because List.index_of takes some time to execute.

Enjoy :)

Alexandre Grison - //grison.me - @algrison