October 19, 2016

Did you mean Elixir?

So I was procrastinating on HN and I saw the following post « Elm and Phoenix/Elixir in Production for France TV », something like 3 days ago. You can find the HN comments here.

It’s been some time I read about Elm and Elixir on HN, but I did not have the chance, or took the time to try these two piece of technology.

So here we are, I’ll try to implement the Norvig’s Did you mean in Elixir, and give you my feeling about it in the same time. It’s probably not the best example, however that’s a very small program that you can easily implement in a couple of hours, even if you don’t know the language.

Keep in mind that I have absolutely zero experience with Elixir nor Erlang, that’s why as always, feel free to show me the correct way :)

You can also refer to my old post of 2011 on Did You Mean, implemented in Java, Python 3 and OCaml.

Installation

First cool thing? The docker install right on the installation page! Thank you for making it. I really don’t want to install it on my mac if I don’t have to.

Let’s run that docker container, and directly download the big.txt file since we’re gonna need it in some minutes. Finally, we run iex.

$ docker run -it --rm elixir bash
root@f0a608c9a46f:~# wget http://norvig.com/big.txt
root@f0a608c9a46f:~# echo "the quick brown fox jumps over the lazy dog" > foo.txt
root@f0a608c9a46f:~# iex
Erlang/OTP 19 [erts-8.1] [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]

Interactive Elixir (1.3.4)
iex(1)>

Ready to rumble?

Loading the file

We need a function named load that takes a file name, and will:

  • load the content as a string (File.read!)
  • lowercase the whole thing (String.downcase)
  • scan for words (Regex.scan(~r/\w+/, ...))
  • produce a map where keys are words, and values are the occurrences of this word in the original string.
  • return that map
alias String, :as S
def load(f) do
  Enum.reduce((Regex.scan(~r/\w+/, S.downcase (File.read! f)) |> List.flatten), %{}, fn word, counter ->
    Map.update(counter, word, 1, &(&1 + 1))
  end)
end

First, in Elixir, it seems that a function have to be declared in a module, so we’ll wrap all the functions in a defmodule DidYouMean later on.

I aliased String as S since we’re gonna need some functions in the String module, and it helps in the reading process.

Reading the file and lowercasing it is really simple:

iex(1)> String.downcase (File.read! "foo.txt")
"the quick brown fox jumps over the lazy dog\n"
# it can also be written like that:
iex(2)> txt = File.read! "foo.txt" |> String.downcase
"the quick brown fox jumps over the lazy dog\n"

Searching for words is also easy using Regex.scan:

iex(3)> Regex.scan(~r/\w+/, txt)
[["the"], ["quick"], ["brown"], ["fox"], ["jumps"], ["over"], ["the"], ["lazy"], ["dog"]]
# flatten it please!
iex(4)> Regex.scan(~r/\w+/, txt) |> List.flatten
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

There’s one thing bothering me, the |> operator let’s you feed arguments to the next function. But you can’t place arguments where you want, for instance for Regex.scan you can’t feed the string to it using |>. It would have been cool to have a specific operator or some wildcard token to place the arguments where needed. Something like "foo" |>2 Regex.scan ~r/../.

Ok now we have our list of words, and we want count the occurrences and put the whole thing in a Map.

I first tried this:

counter = Map.new
for word<-words, do: Map.update(counter, word, 1, &(&1 + 1))
counter

Well it’s not working, due to function naming and scoping issues. The for makes you enter in a new specific scope where counter is a new variable, so you can’t update the variable which was defined in the upper scope. What you need to do is relying on Enum.reduce for such purposes

iex(5)> words = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
iex(6)> Enum.reduce(words, %{}, fn word, counter ->
...(6)>     Map.update(counter, word, 1, &(&1 + 1))
...(6)> end)
%{"brown" => 1, "dog" => 1, "fox" => 1, "jumps" => 1, "lazy" => 1, "over" => 1,
  "quick" => 1, "the" => 2}

The &(&1 + 1) thing represents an anonymous function which returns its first argument added to one.

The edits1 function

Here’s the beast:

def edits1(w) do
  alphabet = S.codepoints("abcdefghijklmnopqrstuvwxyz")
  splits = for i<-0..(S.length w)+1, do: {S.slice(w,0,i), S.slice(w,i,S.length w)}
  (for {l,r}<-splits, do:
    [l <> S.slice(r,1,S.length r), # deletes
     l <> S.slice(r,1,1) <> S.slice(r,0,1) <> S.slice(r,2,S.length r), # transposes
     (for c<-alphabet, do: l <> c <> S.slice(r,1,S.length r)), # replaces
     (for c<-alphabet, do: l <> c <> r)]) |> List.flatten |> MapSet.new # inserts > to set
end

String concatenation is made with the <> operator, ranges can be created using .. so that 0..10 is an iterable from 0 to 10 (included).

We need all letters of the alphabet, which we can get using String.codepoints("...."), it makes an iterable of all the letters. At first I tried to be smart and all and create a range with ?a..?z which works! The ? before a character returns its codepoint. But I’ve found no easy way to do the opposite (going from 97 to a). We could win one more line.

There’s nothing really hard in this par of the code, this is just standard String slicing. I would have liked the syntax to be easier, I miss the slicing like in Python, it makes for easier reading of the source code.

The edits2 function

Really simple there’s nothing specific going on

def edits2(w, nwords) do
  (for e1<-edits1(w), e2<-edits1(e1), Map.has_key?(nwords, e1) do e2 end) |> MapSet.new
end

The for iterates on two streams (think imbricated for) and you can filter using a predicate, here Map.has_key?, in case the word satisfies the predicate it is returned.

The known function

def known(words, nwords) do
  kn = (for w<-words, Map.has_key?(nwords, w) do w end) |> MapSet.new
  cond do MapSet.size(kn) == 0 -> nil
          true -> kn end
end

The known function iterates on the words paremeter to verify that this word is known (in the nwords set). If the set is empty we return nil (so that we can use the || operator later on), otherwise we return the set.

The correct function

def correct(word, nwords) do
  matches = known([word], nwords) || known(edits1(word), nwords) || edits2(word, nwords) || [word]
  Enum.sort_by(matches, &(Map.get(nwords, &1, 0)), &>=/2) |> List.first
end

In the correct method we try first if the word is known, in which case there’s no need to correct it. Otherwise we search in the words created by edits1, then in edits2. If nothing is found, we return the word.

Finally we order these matches by the occurrences found while loading the big.txt file. And we return the first match.

The whole program

defmodule DidYouMean do
  alias String, as: S
  def load(f) do
    Enum.reduce((Regex.scan(~r/\w+/, S.downcase (File.read! f)) |> List.flatten), %{}, fn word, counter ->
      Map.update(counter, word, 1, &(&1 + 1))
    end)
  end
  def edits1(w) do
    alphabet = S.codepoints("abcdefghijklmnopqrstuvwxyz")
    splits = for i<-0..(S.length w)+1, do: {S.slice(w,0,i), S.slice(w,i,S.length w)}
    (for {l,r}<-splits, do:
      [l <> S.slice(r,1,S.length r), # deletes
       l <> S.slice(r,1,1) <> S.slice(r,0,1) <> S.slice(r,2,S.length r), # transposes
       (for c<-alphabet, do: l <> c <> S.slice(r,1,S.length r)), # replaces
       (for c<-alphabet, do: l <> c <> r)]) |> List.flatten |> MapSet.new # inserts > to set
  end
  def edits2(w, nwords) do
    (for e1<-edits1(w), e2<-edits1(e1), Map.has_key?(nwords, e1) do e2 end) |> MapSet.new
  end
  def known(words, nwords) do
    kn = (for w<-words, Map.has_key?(nwords, w) do w end) |> MapSet.new
    cond do MapSet.size(kn) == 0 -> nil
            true -> kn end
  end
  def correct(word, nwords) do
    matches = known([word], nwords) || known(edits1(word), nwords) || edits2(word, nwords) || [word]
    Enum.sort_by(matches, &(Map.get(nwords, &1, 0)), &>=/2) |> List.first
  end
end

Caveat

There’s only one thing I was not able to do, and I don’t know if it’s possible or not. I wanted to store the result of the load directly in the module so that we can directly call DidYouMean.correct and avoid passing the nwords variable as a parameter of the correct, edits2 and known functions, but when doing so:

defmodule Foo do
  def foo, do: 42
  @f foo
end

It says:

** (CompileError) iex:23: undefined function foo/0
    (elixir) expanding macro: Kernel.@/1
             iex:23: Foo (module)

Testing the corrector

iex(17)> nwords = DidYouMean.load "big.txt"
iex(18)> DidYouMean.correct "speling", nwords
"spelling"
iex(19)> DidYouMean.correct "sujar", nwords
"sugar"
iex(20)> DidYouMean.correct "roopm", nwords
"room"

Feelings

Elixir seems cool, I’ve read much more than needed to write such a simple program, and it is definitely not the type of example where Elixir shines since it involves no concurrency nor supervising and advanced stuff.

That’s why I have a feeling of not gaining anything in using it for this problem, because I think there’s not really something to gain. However, the language feels concise and very pragmatic.

So this solves the challenge in 28 lines. Not bad, but I’m sure an experienced Elixir developer will show me how to shorten that :)

Until next time!

Alexandre Grison - //grison.me - @algrison