Elixir: Troubleshooting Mnesia
Published on 2020-11-11
Context
Mnesia is a powerful DBMS implemented in Erlang which you can use in your Elixir application.
Why would you want to do that?
- It can span over multiple nodes in your cluster, providing redundancy and recovery mechanisms.
- It prevents you from adding an external dependency like Redis, PostgreSQL, or whatever.
- It provides all the features you need from a solid DBMS like transactions, locks, indices, dumps to disk, and a consistent data structure.
After looking at and evaluating alternatives when working on a project at Pandascore, I finally decided to give a shot at Mnesia for storing an internal state. It took me a few hours to set it up correctly, but I ran into several issues that may drive anyone nearing an end of sprint crazy, so here's a few tips for it :-)
One disclaimer though: this project is in production (\o/), but due to other priorities, I'm not really monitoring it or improving the design, and thus learning more about Mnesia.
So this is pretty much a beginner to beginner feedback. Please take it with a pinch of salt.
Some useful links
Since I'm pretty much writing an addendum, you'll find more exhaustive information with the following articles and resources. You can read them afterwards if you prefer, and go back to this article if you encounter any issue.
First, this excellent article from Welcome to the Jungle gave me a good overview on how to setup Mnesia in a cluster.
Two more links are mentioned at the end of this article, but I'll paste them here too. Elixir School has a good walkthrough from begin to end mostly, and of course, the Erlang documentation is a gold mine, though it can be a bit rough to read if you're well-versed in Elixir but not Erlang.
You can also check my article on deploying an Elixir cluster on Kubernetes since I give a few details on how to set up a simple local cluster. I'll do it more quickly here anyway.
Let's create a sample project
Let's create a simple project to try out Mnesia locally. We'll need to run a small cluster, so we'll throw in one specific library.
$ mix new clustertest
$ cd clustertest
And let's head out to mix.exs
to add libcluster
:
# In mix.exs
defp deps do
[
{:libcluster, "~> 3.2.1"},
]
end
And let's define a very simple Supervisor. This will get our ClusterSupervisor started, with a simple configuration for our local experiments!
# In mix.exs:
def application do
[
mod: {Clustertest.Application, []},
extra_applications: [:logger]
]
end
# In lib/clustertest/application.ex
defmodule Clustertest.Application do
use Application
def start(_type, _args) do
topologies = [
epmd_example: [
strategy: Cluster.Strategy.Epmd,
config: [
hosts: [:"a@127.0.0.1", :"b@127.0.0.1"]
]
]
]
[
{Cluster.Supervisor, [topologies, [name: Clustertest.ClusterSupervisor]]}
]
|> Supervisor.start_link(strategy: :one_for_one)
end
end
This will get you started with a small cluster of two nodes. Epmd
is
perfectly fit for our example here, since we just have to specify a few hosts.
Now, let's open two shells and start two instances:
# In one shell
iex --name a@127.0.0.1 -S mix
# In another one
iex --name b@127.0.0.1 -S mix
Now, running Node.list()
in each REPL should give you exactly one atom:
> [:"b@127.0.0.1"]
> [:"a@127.0.0.1"]
Initializing Mnesia
Now, let's start actually using Mnesia. We'll pretend we're running a small wildlife protection office taking care of local racoons.
Also, for the sake of simplicity, we'll add it directly to the children of our app. There's no need for complexity for hello world code :)
# In lib/clustertest/application.ex
defmodule Clustertest.Application do
use Application
def start(_type, _args) do
topologies = [...]
[
{Cluster.Supervisor, [topologies, [name: Clustertest.ClusterSupervisor]]},
{Clustertest.Store.Racoon, []},
]
|> Supervisor.start_link(strategy: :one_for_one)
end
end
# In lib/clustertest/store/racoon.ex
defmodule Clustertest.Store.Racoon do
use GenServer
def start_link(opts \\ []) do
GenServer.start_link(__MODULE__, %{}, opts)
end
def init(state) do
setup_store()
{:ok, state}
end
defp setup_store() do
IO.puts("Setting up store...")
:ok = ensure_schema_exists()
:ok = :mnesia.start()
:ok = ensure_table_exists()
IO.puts("...Store set up!")
end
defp ensure_schema_exists() do
case :mnesia.create_schema([node()]) do
{:error, {_node, {:already_exists, __node}}} ->
:ok
:ok -> :ok
end
end
defp ensure_table_exists() do
:mnesia.create_table(
Racoon,
[
attributes: [
:id,
:name,
:caretaker_id
]
]
)
|> case do
{:atomic, :ok} ->
:ok
{:aborted, {:already_exists, Racoon}} ->
:ok
end
:ok = :mnesia.wait_for_tables([Racoon], 5000)
end
end
There are a few important things to note here.
First, you need to create a schema before starting Mnesia. This is really important. Try switching the two lines to see what happens.
Then, you're free to create your table. Both schema and table can be already created when you run your app, since Mnesia keeps RAM and disk copies, depending on how you configure it. By the way, you should have noticed there are now two new folders in your project directory:
$ ls
Mnesia.a@127.0.0.1
Mnesia.b@127.0.0.1
Hmm, let's see if Mnesia is properly configured. Type this in a terminal:
$ :mnesia.info()
This will be your best friend for debugging Mnesia :)
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
Elixir.Racoon : with 0 records occupying 305 words of mem
schema : with 2 records occupying 535 words of mem
===> System info in version "4.17", debug level = none <===
opt_disc. Directory "/home/makks/code/mbuffa/clustertest/Mnesia.a@127.0.0.1" is used.
use fallback at restart = false
running db nodes = ['a@127.0.0.1']
stopped db nodes = []
master node tables = []
remote = []
ram_copies = ['Elixir.Racoon']
disc_copies = [schema]
disc_only_copies = []
[{'a@127.0.0.1',disc_copies}] = [schema]
[{'a@127.0.0.1',ram_copies}] = ['Elixir.Racoon']
3 transactions committed, 0 aborted, 0 restarted, 2 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
:ok
Hmm, looking at running db nodes
, we're only running two Mnesia nodes independently. We want to connect them, but we have to do it ourselves when new nodes are connected.
Let's go back to our Store and add a bit of code:
- We want to be notified when new nodes connect...
defmodule Clustertest.Store.Racoon do
use GenServer
[...]
def init(state) do
# Get notified when new nodes are connected.
:ok = :net_kernel.monitor_nodes(true)
setup_store()
{:ok, state}
end
[...]
end
- ...And we want to configure Mnesia to use extra nodes, create a table copy on the other node, and remove the other node when connection is lost.
defmodule Clustertest.Store.Racoon do
[...]
def handle_info({:nodeup, node}, state) do
IO.puts("Node connected: #{inspect node}")
:ok = connect_mnesia_to_cluster()
{:noreply, state}
end
def handle_info({:nodedown, node}, state) do
IO.puts("Node disconnected: #{inspect node}")
update_mnesia_nodes()
{:noreply, state}
end
defp connect_mnesia_to_cluster() do
:ok = :mnesia.start()
{:ok, [_|_] = nodes} = :mnesia.change_config(:extra_db_nodes, Node.list())
IO.puts("Extra db nodes: #{ inspect nodes }")
:ok = ensure_table_exists()
:mnesia.change_table_copy_type(:schema, node(), :disc_copies)
:ok = ensure_table_copy_exists()
IO.puts("Successfully connected Mnesia to the cluster!")
:ok
end
defp update_mnesia_nodes do
nodes = Node.list()
IO.puts("Updating Mnesia nodes with #{inspect nodes}")
:mnesia.change_config(:extra_db_nodes, nodes)
end
defp ensure_schema_exists() do
[...]
end
defp ensure_table_exists() do
[...]
end
defp ensure_table_copy_exists() do
case :mnesia.add_table_copy(Racoon, node(), :disc_copies) do
{:atomic, :ok} -> :ok
{:aborted, {:already_exists, Racoon, _node}} -> :ok
end
end
end
Now, restarting our two nodes should raise an error:
17:20:34.541 [error] GenServer #PID<0.213.0> terminating
** (MatchError) no match of right hand side value: {:ok, []}
(clustertest 0.1.0) lib/clustertest/store/racoon.ex:46: Clustertest.Store.Racoon.connect_mnesia_to_cluster/0
(clustertest 0.1.0) lib/clustertest/store/racoon.ex:30: Clustertest.Store.Racoon.handle_info/2
(stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
(stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
(stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {:nodeup, :"b@127.0.0.1"}
State: %{}
Why is :mnesia.change_config(:extra_db_nodes, Node.list())
returning :ok
with an empty array?
Well, you can't really guess, and silent errors is why Mnesia can be difficult to work with for the first time.
In fact, Mnesia requires that you create an identical schema on each of your nodes, sharing the same cookie. By calling setup_store()
and its :mnesia.create_schema()
, you're creating two conflicting schemas Mnesia can't resolve.
Let's remove setup_store()
completely (and ensure_schema_exists()
too). If we need to be able to deploy single nodes (locally for example) we can still define an environment variable to decide what to do. But this is off-topic.
Also, remember those two folders that popped up in your directory? Remove those folders. Those may contain conflicting schemas.
Now, let's restart our two REPLs.
17:35:48.049 [error] GenServer #PID<0.213.0> terminating
** (MatchError) no match of right hand side value: {:ok, []}
(clustertest 0.1.0) lib/clustertest/store/racoon.ex:34: Clustertest.Store.Racoon.connect_mnesia_to_cluster/0
(clustertest 0.1.0) lib/clustertest/store/racoon.ex:18: Clustertest.Store.Racoon.handle_info/2
(stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
(stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
(stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {:nodeup, :"b@127.0.0.1"}
State: %{}
Well, you can't guess either, but Mnesia must be started as an application. Surprisingly enough, trying to use :mnesia
functions wouldn't raise any errors.
So let's head to our manifest and add :mnesia
in a familiar place:
# In mix.exs
def application do
[
mod: {Clustertest.Application, []},
extra_applications: [:logger, :mnesia]
]
end
Now let's check...
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
schema : with 2 records occupying 554 words of mem
===> System info in version "4.17", debug level = none <===
opt_disc. Directory "/home/makks/code/mbuffa/clustertest/Mnesia.a@127.0.0.1" is NOT used.
use fallback at restart = false
running db nodes = ['b@127.0.0.1','a@127.0.0.1']
stopped db nodes = []
master node tables = []
remote = ['Elixir.Racoon']
ram_copies = [schema]
disc_copies = []
disc_only_copies = []
[{'a@127.0.0.1',ram_copies},{'b@127.0.0.1',disc_copies}] = [schema]
[{'b@127.0.0.1',ram_copies}] = ['Elixir.Racoon']
2 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
See how opt_disc
and running db nodes
changed. But we have no data yet.
Inserting data
We'll add some code and make some changes so that we'll use a struct defined in Types.Racoon
. The naming in my example isn't great, but basically, we're just adding serialization/deserialization functions to manipulate structs in our codebase, while Mnesia stores tuples.
So we'll add two functions, list()
and create()
, and do a few changes on the table name.
defmodule Clustertest.Store.Racoon do
[...]
defmodule Types.Racoon do
defstruct [
:id,
:name,
caretaker_id: nil
]
def decode({__MODULE__, id, name, caretaker_id}) do
%__MODULE__{
id: id,
name: name,
caretaker_id: caretaker_id
}
end
def encode(%__MODULE__{
id: id,
name: name,
caretaker_id: caretaker_id
}) do
{__MODULE__, id, name, caretaker_id}
end
end
[...]
def list() do
{:atomic, list} = :mnesia.transaction(fn ->
:mnesia.match_object({Types.Racoon, :_, :_, :_})
end)
list |> Enum.map(fn x -> Types.Racoon.decode(x) end)
end
def create(%Types.Racoon{ id: id } = state) when is_integer(id) do
IO.puts("Inserting #{inspect state}")
{:atomic, reason} = :mnesia.transaction(fn ->
case :mnesia.read(Types.Racoon, id, :write) do
[] ->
Types.Racoon.encode(state) |> :mnesia.write()
_ ->
:record_exists
end
end)
reason
end
[...]
defp ensure_table_exists() do
:mnesia.create_table(
Types.Racoon,
[
attributes: [
:id,
:name,
:caretaker_id
]
]
)
|> case do
{:atomic, :ok} ->
:ok
{:aborted, {:already_exists, Types.Racoon}} ->
:ok
end
:ok = :mnesia.wait_for_tables([Types.Racoon], 5000)
end
defp ensure_table_copy_exists() do
case :mnesia.add_table_copy(Types.Racoon, node(), :disc_copies) do
{:atomic, :ok} -> :ok
{:aborted, {:already_exists, Types.Racoon, _node}} -> :ok
end
end
Now let's test a few examples, after a REPL reset.
# In your first shell:
iex(a@127.0.0.1)3> Clustertest.Store.Racoon.list()
[]
iex(a@127.0.0.1)4> Clustertest.Store.Racoon.create(%Clustertest.Store.Racoon.Types.Racoon{ id: 1, name: "Ricky", caretaker_id: nil })
Inserting %Clustertest.Store.Racoon.Types.Racoon{caretaker_id: nil, id: 1, name: "Ricky"}
:ok
# In your second shell:
iex(b@127.0.0.1)4> Clustertest.Store.Racoon.list()
[
%Clustertest.Store.Racoon.Types.Racoon{
caretaker_id: nil,
id: 1,
name: "Ricky"
}
]
The update
, read
, and delete
functions are quite straighforward. You can implement them yourself, but I'm adding those as a reference.
defmodule Clustertest.Store.Racoon do
[...]
def update(%Types.Racoon{ id: id } = new_state) when is_integer(id) do
IO.puts("Updating #{inspect new_state}")
{:atomic, reason} = :mnesia.transaction(fn ->
[{Types.Racoon, ^id, _, _,}] = :mnesia.read(Types.Racoon, id, :write)
Types.Racoon.encode(new_state) |> :mnesia.write()
end)
reason
end
def read(id) when is_integer(id) do
IO.puts("Returning #{id}")
{:atomic, result} = :mnesia.transaction(fn ->
:mnesia.read(Types.Racoon, id, :read)
end)
case result do
[] -> nil
list -> list |> List.first() |> Types.Racoon.decode()
end
end
def delete(id) when is_integer(id) do
IO.puts("Deleting #{id}")
{:atomic, :ok} = :mnesia.transaction(fn ->
:ok = :mnesia.delete(Types.Racoon, id, :write)
end)
:ok
end
[...]
end
You can play around with this :) Note that creating data on one node, and creating the identical data on another node doesn't raise any issue, but do not create duplicates either.
There's one issue remaining though: closing both REPLs clear the table. This is because we forgot to specify one option when calling :mnesia.create_table
!
:mnesia.create_table(
Types.Racoon,
[
attributes: [
:id,
:name,
:caretaker_id
],
disc_copies: [Node.self()]
]
)
Now we're good, and :mnesia.info()
doesn't show an empty disc_copies
anymore.
A few important notes for releases...
...and solving the "bad cookie" issue.
Keep in mind that we've been using iex
all along and that running a compiled application will raise a few differences.
Remember the snippet I used for adding :mnesia
to our running application?
# In mix.exs
def application do
[
mod: {Clustertest.Application, []},
extra_applications: [:logger, :mnesia]
]
end
Well, it might cause you some trouble once you compile your release with mix release
and your node starts.
Thing is, adding libraries to extra_applications
would start them automatically before your application does, so we have to specify
that we only want to reference it in our release, to avoid :mnesia creating a schema and starting automatically.
# In mix.exs
def application do
[
mod: {Clustertest.Application, []},
extra_applications: [:logger],
included_applications: [:mnesia]
]
end
Thing is, I also had to revert my changes on create_table
and remove the disc_copies
option.
:mnesia.create_table(
Types.Racoon,
[
attributes: [
:id,
:name,
:caretaker_id
]
]
)
Now, :mnesia.info()
will properly display a populated disc_copies
option.
I do not know why those differences between iex
and compiled code exist. I may be doing something wrong, so please feel free to open an issue on the repository if you find why!
Conclusion
I hope this was neither too tedious or frightening regarding the usage of Mnesia in your project. I thought the "crash course" format to be interesting in this case (ie. amending snippets), because it helps to have beaten that path when things go wrong.
I didn't mention the issue of network partitioning and a possible way to solve it, but this is more related to your cluster configuration. Essentially, network failure may happen inside your cluster, and reconnecting nodes wouldn't know how to handle this, since we're not using a master-replica strategy. I haven't read much about this, but one possible solution would be to pass the cluster size as an environment variable to all nodes, and check the Node.list()
result when a :nodedown
message is received. Afterwards, a simple calculation should be enough to determine if your node is isolated or in a dominant group, allowing you to push the self-destruct red button with, for example, a "liveness" GenServer exposed to your orchestrator, returning HTTP 200 Ok
responses codes until isolation is detected.
Note that there's also the Mnesiac library, which is an Elixir layer on top of Mnesia. I prefer using low-level libraries directly, at least for learning, but it might be a good fit for production though.
Last, but not least, I created a repository with a small and clear commit history, in case you want to tinker with it.
Et voilĂ .