Peter Norvig’s Did You Mean ?
Almost one year and a half ago, I did found something interesting on the web. An article of Peter Norvig explaining How to Write a Spelling Corrector. At this time I found it brilliant and did implement it in OCaml (but lost it). I tried to shorten the provided Python version, and also the Java version made by Rael Cunha. Eventually, I did come with a shortest version in both Java and Python.
A week ago I did get lost on gmail, and read the mail sent to both of them for I don’t know what reason, and I thought I could post them here.
Python 3
It uses lambdas
, string.ascii_lowercase
, collections.Counter
and dict comprehension to reduce the number of lines to only 15, which is the lowest we can find on Peter Norvig’s page, equals to Tiago “PacMan” Peczenyj ‐ Awk and Dejan Jelovic ‐ F# implementation.
import sys, time, re, string, collections
load = lambda txt: collections.Counter(re.findall('[a-z]+', txt.lower()))
NWORDS = load(open('big.txt').read())
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
return set([a + b[1:] for a, b in splits if b] + # deletes
[a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1] + # transposes
[a + c + b[1:] for a, b in splits for c in string.ascii_lowercase if b] + # replaces
[a + c + b for a, b in splits for c in string.ascii_lowercase]) # inserts
edits2 = lambda word: {e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS}
known = lambda words: {w for w in words if w in NWORDS}
def correct(word):
candidates = known([word]) or known(edits1(word)) or edits2(word) or [word]
return max(candidates, key=NWORDS.get)
for arg in sys.argv[1:]: print('%s -> Did you mean %s ?' % (arg, correct(arg)))
Test and performance:
$ time python3 DidYouMean.py speling sujar roopm
speling -> Did you mean spelling ?
sujar -> Did you mean sugar ?
roopm -> Did you mean room ?
real 0m2.030s
user 0m1.906s
sys 0m0.115s
Java
The Java version is almost equivalent to the one of Rael Cunha. But it uses a trick to read all the file in one line. This way, the program takes only 32 lines.
import java.io.*;
import java.util.*;
import java.util.regex.*;
class DidYouMean {
private final Map<String, Integer> nWords = new HashMap<String, Integer>();
public DidYouMean(String file) throws IOException {
String contents = new Scanner(new File(file)).useDelimiter("\\Z").next();
Matcher m = Pattern.compile("[a-z]+").matcher(contents.toLowerCase());
while (m.find()) nWords.put(m.group(), nWords.containsKey(m.group()) ? nWords.get(m.group()) + 1 : 1);
}
private final List<String> edits(String w) {
List<String> result = new ArrayList<String>();
for(int i=; i < w.length(); ++i) result.add(w.substring(0, i) + w.substring(i+1));
for(int i=; i < w.length()-1; ++i) result.add(w.substring(0, i) + w.substring(i+1, i+2) + w.substring(i, i+1) + w.substring(i+2));
for(int i=; i < w.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(w.substring(0, i) + String.valueOf(c) + w.substring(i+1));
for(int i=; i <= w.length(); ++i) for(char c='a'; c <= 'z'; ++c) result.add(w.substring(0, i) + String.valueOf(c) + w.substring(i));
return result;
}
public final String correct(String word) {
if (nWords.containsKey(word)) return word;
List<String> list = edits(word);
Map<Integer, String> candidates = new HashMap<Integer, String>();
for(String s : list) if(nWords.containsKey(s)) candidates.put(nWords.get(s),s);
if(candidates.size() > ) return candidates.get(Collections.max(candidates.keySet()));
for(String s : list) for(String w : edits(s)) if(nWords.containsKey(w)) candidates.put(nWords.get(w),w);
return candidates.size() > ? candidates.get(Collections.max(candidates.keySet())) : word;
}
public static void main(String args[]) throws IOException {
DidYouMean sp = new DidYouMean("big.txt");
for (String arg: args) System.out.println(arg + " -> Did you mean " + sp.correct(arg) + " ?");
}
}
Trying the Java version too:
$ time java DidYouMean speling sujar roopm
speling -> Did you mean spelling ?
sujar -> Did you mean sugar ?
roopm -> Did you mean room ?
real 0m1.425s
user 0m1.695s
sys 0m0.154s
It can probably be shortened using third parties libraries, but the initial goal of Rael Cunha was to use only standard Java.
OCaml
Finally I wanted to start again on my previous OCaml version, but this time using Batteries Included
. Here is what I finally come with, a 16 lines program, which implements the did you mean:
(* Open Awesomeness *)
open Batteries;;
(* Read the content of the given file *)
let read_file filename = File.with_file_in filename IO.read_all;;
(* Extract all words *)
let words = Str.split (Str.regexp "[^a-zA-Z]+") |- List.enum |- map String.lowercase;;
(* Count word occurences *)
let train tbl = iter (fun w -> let prev = Hashtbl.find_default tbl w in Hashtbl.replace tbl w (prev + 1));;
(* Construct the alphabet *)
let alphabet = [? List:String.of_char(Char.chr i) | i <- 97 -- 122 ?];;
(* Convenience functions *)
let (+..), slice, sub, length = Enum.append, String.slice, String.sub, String.length;;
(* the edits function *)
let edits w = let splits w = [? (slice ~last:i w, slice ~first:i w) | i <- -- (length w) ?] in
[? a^(slice ~first:1 b) | a,b <- splits w; b != "" ?] +..
[? a^(sub b 1 1)^(sub b 1)^(slice ~first:2 b) | a,b <- splits w; length b > 1 ?] +..
Enum.flatten [? [? a^c^(slice ~first:1 b) | a,b <- splits w; b != "" ?] | c <- (alphabet |> List.enum) ?] +..
Enum.flatten [? [? a^c^b | a,b <- splits w; b != "" ?] | c <- (alphabet |> List.enum) ?] |> List.of_enum;;
(* load the words *)
let known = let model = Hashtbl.create 4096 in "big.txt" |> read_file |> (words |- train model);
model |> Hashtbl.enum |> List.of_enum |> List.fast_sort (fun (_, c1) (_, c2) -> compare c2 c1);;
(* correct a word*)
let correct word = let k = List.map fst known in (List.map (fun w -> List.index_of w k) (edits word)
|> List.remove_all)(None) |> List.min |> Option.get |> List.at known |> fst;;
(* try it *)
# List.map correct ["sward"; "sujar"; "speling"; "roopm"];;
- : string list = ["word"; "sugar"; "spelling"; "room"]
It’s a bit verbose, even with the convenience functions for Enum appending, slicing and string operations. Without them we can reach the 15 lines, but then it becomes a mess to read.
It think it can be made faster by using a Hashtbl instead of a List in known
and correct
, because List.index_of
takes some time to execute.
Enjoy :)