System Design

A domian driven and privacy by design system built on a microservice architecture.

Overview

Codr is designed using a strategy called domain driven design, which complements the microservice architecture being used as the foundation for the platform.

Domain driven design allows for better communication between engineers and the business and in result provides a much simpler implementation. For example, when a non-technical business partner asks questions to an engineer about users, everything such as preferences and groups should not matter, so the logic for those entities shouldn't be a part of the "user domain."

To implement this domain driven design strategy, Codr uses microservices. These are extremely small services that implement a specific subset of logic, which results in needing far less unnecessary dependencies and resources compared to a monolith. Each entity within a domain should be its own microservice. Because of this separation of logic, for heavily used services such as the user or annotation services, can be scaled independently and have far more replicas compared to let's say the message template domain that's not use nearly as often.

Finally, privacy by design is a concept described in the GDPR. For the sake of brevity, Article 5 of the GDPR consists of the core concepts used in designing these microservices and database collections. These include: fairness and transparency, purpose limitation, data minimization, accuracy, storage limitation, confidentiality, and accountability. By sticking to these core principles we can ensure more privacy-preserving practices within the codebases.

Purpose

As stated in the overview, the purpose of these design choices are to make implementation much simpler, controlled, and have high availability. Here are the main takeaways for this system design:

Simpler. Simpler implementation is important for easier communication, code readability, and debugging.
Private. Privacy by design is ingrained into the design process ensuring that personal data is only shared and processed only when needed.
High availability. High availability and throughput is crucial. Within research, we are typically dealing with large datasets in the hundreds of thousands. Processing this data as fast as possible is important so that research can be started as soon as possible.

Domains

User domain consists of two data products/entities: user and profile. In terms of business logic, users and profiles are one of the same. However, this doesn't mean we merge the two entities together. In order to control privacy for users, it is easier to manage two separate entities. For the case of performing research, annotators should be anonymous in most cases, meaning they need an account, but they do not need a profile, this is an example of data minimization, purpose limitation, and confidentiality by collecting only the information needed to process the data.

System Communication

There are several ways the microservices that make up Codr interact with each other. For instances where the services need a response, we will be using RESTful APIs in the backend business logic. If an immediate response is not needed, or we need real-time data processing, we will be using Kafka.

RESTful APIs

APIs are very common; they are in almost every application. The use case we want RESTful APIs for (outside of server-client communication) is to solve data duplication and logic separation. An easy explanation is for this is adding a user to a project. When sending this request to the project domain, we may want to add a user by email address. The project domain does not store records of users as that is not it's purpose. To find the user, the project domain will send an authenticated machine-to-machine internal rest call to the user domain to fetch a limited subset of the users data. This data is shall contain the user id so that the project domain can append the user to the project member list.

Kafka

Apache Kafka is an open-source distributed event streaming platform -- Apache Kafka

What does this mean? Look at Kafka as a publish-subscribe architecture where a "producer" publishes a message to a stream and a "consumer" subscribes to the stream. Kafka is used in most companies that rely on real-time data processing on microservice architectures. There are several use cases for Kafka such as system auditing, real-time notifications, and dataset uploading.

More information can be found under Kafka design.

Use Cases

Dataset processing is specific problem needed for this project because they can consist of hundreds of thousands of records. With the power of Kafka, we can publish each sample of a dataset independently as a Kafka state message, and then depending on the topic configuration and number of replicas, we can scale this task horizontally to process the upload upwards of 12x faster (hundreds of samples per second) compared to a singular replica.