The data model design is key in a Lager application. The good news is: changing the design and shape of the data-model in a value-oriented world tends to be easy. However, for this to be the case, it is important that we follow some principles strictly.
In the previous section we emphasised that the data-model must be a value type. But what is a value type really?
In a purely functional programming language, like Haskell, everything you name is a value. This is, all entities are like mathematical objects: they are immutable and have some definition of equivalence.
In a multi-paradigm language like C++, you have both value types and reference types. Intances of value types define its meaning through the values that they represent, that is why we say, they have value semantics. Each instance of a value type has its own copy of a representation of a value, on which we can operate independently. Most of the time, we do not care about the identity of an instance, but about the content, the value, which can be equal between instances. Reference types, however, reference other mutable objects. The identity of these objects is crucial. We can observe changes to the referred objects through the references. This means that we have to be particularly careful when dealing with reference types.
The canonical reference types in C++ are references (&
) and
pointers (*
). There are many other reference types. For example,
iterators of a container are references that point into the
container. If we change the container, we can observe the changes
through the iterator. Sometimes even, the iterator may become invalid.
In C++, instances of value types are not necessarily immutable, only
the value they represent is. For example, I can mutate a
std::vector through its push_back()
method. Or I can modify a
struct
by assigning to its members. Instead, they are value
types because each instance of the type is independent and is a
attributed meaning through the values it represent. This principle
of independence is highlighted by this code snippet:
auto a = std::vector<int>{};
auto b = a;
foo(b);
assert(a.empty());
Here we have an instance of an empty std::vector
named a
and
copy it into another instance b
. Then we call foo()
and
finally make an assertion about the state of a
. We do not need to
know anything about what foo()
does to reason about the state of
a
. It could have taken b
by reference and changed it, maybe
even from another thread—it does not matter. In any case, we can be
sure that our assertion about a
holds, because b
is an
independent copy of it, so we can not observe any change in b
through a
. This frees our mind, allowing us to reason locally
about the state of the program. This is fundamental to write correct
concurrent programs: we can make sure each thread works on their local
copy of the data to ensure the absence of race conditions.
Spotting reference types
In order to avoid them, it is important to be able to clearly
identify reference types. They are sneaky. For example,
std::vector<T>
is a value type only as long as T
is also a
value type. Same applies to struct
, it is not a value as soon
as a member is not.
Also, there are some C++ types that behave a bit like values, they seem to be copyable at least—but their copies reference external entities that mutate, and these changes can be observed through all these shallow copies! Here are some examples of such types that shall be avoided in our data-model:
int
,
but they are not, they are a reference to a mutable entity.&
-captures.Are pointers always bad?
Not always. For instance, the implementation of std::vector
uses pointers internally. But it defines operator=
to ensure
the independence principle. Also, Sean Parent shows in his famous
talk “Inheritance is the base class of evil” that given a value
type T
, a forest of std::shared_ptr<const T>
is
effectively a value type. However, the immer::box
type from
the Immer library serves the same purpose with safer and more
convenient interface. In the performance section we will
cover the power immutability in more detail.
In general, the application level data model must never use
pointers. It can however use generic container types that
encapsulate the pointers and properly deal with the intricacies of
defining operator=
as to ensure the independence principle.
Further reading
Value-semantic thinking is a vast topic. Here are some resources that can help you adopt this design mindset:
In the functional programming realm these ideas are taken much further:
When writing the model as value types, we soon encounter the problem of dealing with identity. Consider an interactive application shows a moving person. This person changes, it moves around. Our model would be a snapshot of the state of this person. But clearly, the state of the person is different than the person itself:
!=
.==
.In Object Oriented programming, we normally identify a language object with the entity it represents, in this case, the person. The identity of the thing is the memory location of the storage for its state. This means that we need to use mutation to deal with the state, that there is only one state per entity at time, that time only progresses in one direction, that change is an implicit construction, that entities can not be dealt with concurrently. Identity becomes an implicit and flaky construction.
However, in real life, we deal with identity in an explicit way. That is why people have names or passport numbers. These are special values, identity values, that help us identify people. Identity as such serves a double purpose, solving the forementioned state/identity problems:
Considering this duality, when your program deals with changing entities, you will have to think about the domain of entities as a whole, and give each of those entities a different, explicit, identity value. Universally Unique Identifiers are a powerful tool to identify entities not only in the running programm, but also across files and machines. Often though, context will allow us to have more lightweight identity values. In some cases, even its index in a vector might suffice.
References in a value world
Consider this data-structure which implements agenda of people with friends in a referential way:
struct person
{
std::string name;
std::string phone_number;
std::vector<std::weak_ptr<person>> friends;
};
struct agenda
{
std::unordered_set<std::shared_ptr<person>> people;
};
This is not a valid model to use in Lager, because person
and
agenda
are reference semantic types. Not only is the
identification of memory objects with entities problematic from a
conceptual programming sense: there is some extra friction, like
having to allocate each person in a separate memory block (instead
of having a flat std::vector
), and then dealing with the
lifetime of those blocks with shared_ptr
, weak_ptr
, and so
on.
How do we do references with out pointers then? We use explicit identity values:
using person_id = std::string;
struct person
{
std::string name;
std::string phone_number;
std::vector<person_id> friends;
};
struct agenda
{
std::unordered_map<person_id, person> people;
};
Now we have decoupled the identity (person_id
) from the
state (person
). Whenever we want to know the state for a
given person, we can access it through the people
map in the
agenda. Whenever we want to refer to a person, like in the list of
friends, we use a person_id
. We can now have multiple copies
of the whole agenda, and compare how a particular person changes as
we manipulate it. People are not tied to their representation in
memory anymore, so we can be more playful with the data-structures
and apply data oriented design to
reach better cache locality and and overall performance!
After applying the principle of explicit identity to your program, you might realise this insight: the data-model of the application starts to look like a data-base!
And you are correct: the model of our application is an in-memory data-base, and the Lager store, combined with reducers and actions, provide a reproducible, logic aware, event sourced data-store. This means that you can start applying data-base design wisdom to your application, which has been accrued over decades by academics and practitioners. In particular, you may find interesting the notion of database normalization, both the Redux documentation and the Data-Oriented Design book do indeed talk about it:
One strong concern when applying value-semantics for the data-model of big applications is performance. We encourage passing by value around freely, and storing copies of the model values as needed without much concern. This often rises skepticism from experienced developers with an eye for optimization. For big data-models, isn’t that gonna be slow, and even explode the memory usage? Not necessarily.
In C++, we normally associate value-semantics, and in particular the
independence principle, with deep
copying. For standard containers like std::vector
, it is the case
that whenever we pass by value, a new memory object is created where
the whole representation of the value is copied into. This is however
not a consequence of value-semantics, but a consequence of mutability!
If the object that stores the value can mutate arbitrarily, when
passed by value, all of its contents must be copied to ensure that the
new object does not change when the original changes. However, if the
original object is in some way immutable, the immutable parts of
the representation can be internally shared accross all the “copies”
of the value. This property is called structural sharing and makes
value-semantics very efficient!
In the field of persistent data-structures we can find many examples of containers designed with the structural sharing in mind. Today, we have good implementations of some of these data-structures in C++:
Also, in modern C++ one may often avoid copies altogether by leveraging copy ellision and move semantics. It is important to familiarize oneself with these concepts.
In practice, when combining value-oriented design with immutable data-structures, you will find that not only is performance not a problem, but that your programs are faster! This is due to the fact that our data-model becomes more compact, with less pointer chasing and better cache locality, and with flatter call stacks and no traversal of forests of listeners, signals and slots.