Create a Global COVID-19 Vaccinations Progress GraphDB – Introduction to ArangoDB

It’s always a challenge to decide which database technology to choose as part of your architecture. The dilemma mainly architects usually face is whether to work with relational databases or non relational ones. However, when it comes to extracting insights from the databases, graph databases hold a significant advantage. Graph databases are composed of two entities: Nodes and Relationships. The graph databases are purpose-built to handle highly connected data. They have the flexibility, agility and performance capabilities to handle massive amounts of data points and the relationships between them.

ArangoDB is a free and open-source native multi-model database system developed by ArangoDB GmbH. It uses collections to store both nodes and edges and a graph to represent the connections between these nodes. In this blog, we will explore how ArangoDB can be used to show the progress of the COVID-19 vaccination process for each country as a graph.

We will work with ArangoDB Oasis which is the SaaS edition of the ArangoDB. However, ArangoDB is an open source project and can be deployed in many ways (Visit https://www.arangodb.com/ for more information).

[NOTE: This blog assumes you have a working knowledge of GoLang, AWS Lambdas and Databases .]

Common Use-Cases For GraphDB

GraphDB are much more effective solutions for:

  1. Manage social network friends mapping — mutual friends map
  2. Map chains of infections
  3. Customer 360-degree analysis

The vaccinations status is much more dynamic data that changes constantly. Even though, we will customize our data in order to create vaccinations progress graphs interesting yet simple as possible.

Gathering the data

Our World in Data is a scientific online publication that focuses on large global problems such as poverty, disease, hunger, climate change, war, existential risks, and inequality. OWID has a COVID-19 data Github repository which we can get the vaccination data from for each country.

In order to automate the process and keep our data up to date, we can create a Lambda function Lambda function that runs every 24 hours and fetch the total amount of vaccinated people in each country. For each invocation we will have a general node related to a node for each country it examines. In addition, each country’s nodes will be chained together to express the progress. But before we jump to the graph database formation, our Lambda should gather the latest vaccinations status.

Our GetVaccinationStatus function gathers the latest vaccination status for a set of countries. For this blog, we will work with the following configuration:

apiMetadata = &api.ApiMetadata{
URL: "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.json",
CountryCodeKey: "iso_code",
CountriesFilter: []string{"ISR", "USA", "ESP"},
}

Now that we have a reliable way to fetch the required data, we can move to setting up our database.

Designing our graph

As previously mentioned, ArangoDB manages data in in 2 different types of collections: Document collections for nodes, and Edge collections for relationships. We’re going to design our data in the following way:

  1. Create a collection for all the Lambda invocation or Runs document collection
  2. Create a collection for all the vaccinations nodes for each country or Vaccinations document collection
  3. Create relationships between the last run and its latest vaccination nodes or VaccinationEdges edge collection
  4. Create relationship between the latest vaccination status and the previous vaccination status by country
  5. Create a graph to show the relationships between nodes or runs-graph graph.

Graphing 101

First of all, we have to understand that between different Lambda executions, a specific country vaccination status can change in 3 ways:

  1. New — the previous run didn’t have a node for the specific country.
  2. Changed — the previous run had a node for the specific country but the total amount of people changed.
  3. Old — the previous run had a node for the specific country with the same total amount of people.

Each case is handled differently: New nodes will have to create a Vaccination document and an edge connecting them to the current run. As well with Changed nodes. These types of nodes will create an additional edge between the specific country current node and the previous one with the difference.

On the other hand, Old nodes will create only an edge between the current run and the previous Vaccination node of the specific country. The reason for that is due to the fact that the total amount of vaccinated people didn’t change we don’t need to create a redundant node. GraphDB gives us the solution of relating previous run node to the current node.

This kind of architecture might be intimidating. However, with ArangoDB it can be easily implemented. Let’s proceed with handling the vaccination data within our Lambda. We will see the development of the graph as we will simulate all the 3 cases.

The Art of Graphing

After we gathered data for 3 countries (for simplicity), in our first run we will have the following graph:

Each country has its own vaccination node according to the date

We have a node for the run with edges to the nodes of the vaccinations status per country. We don’t have other relationships since all the node were New in our first run.

We managed to create this graph using this code:

First of all we useCreateNewRunNode to create a node for the specific run. Our run identifier its the current date (In the example above 27/08/2021). Afterwards, we use the HandleNewCountries to create for our New entities nodes and edges (using HandleNewEdges ) to the current run.

Before running the second run, we want to change to configuration by adding one more country. The other 3 will be marked as Changed because we increased the amount of people for them with our mock data.

apiMetadata = &api.ApiMetadata{
URL: "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.json",
CountryCodeKey: "iso_code",
CountriesFilter: []string{"ISR", "USA", "ESP", "FRA"},
}

For the second execution (28/08/2021), we need to compare the results with the previous run (27/08/21). In order to get all the related nodes of the previous run we need to execute AQL query:

FOR v IN 1..1 ANY @runDate GRAPH 'runs-graph' FILTER v.collection == 'Vaccinations' RETURN v

We used graph traversals to get all the nodes within 1 depth from our run of type Vaccinations. This query returns all the related vaccination statuses of the given run. Now we can compare both runs to see how many nodes were added (new countries), modified (number of total vaccinations changed) or didn’t changed (number of total vaccinations remained the same).

The compareToPrevRunNodes method runs through the current drift and looks for previous nodes for each country. Once found, it checks whether the amount of people vaccinated changed and then it can determine if the node is Changed or Old . If the country wasn’t part of the previous run it marks it as New .

In our case, we have France as a new country and the other were changed. To handle Changed nods we need to create a new Vaccination document and edges between the new node and its previous and with the current run.

func (graph *ArangoDB) HandleChangedCountries(
ctx context.Context,
logger *zerolog.Logger,
date string,
changedNodes []interface{},
) error {
var newPrevMap = make(map[string]interface{}, len(changedNodes))
var nodes []interface{}
for _, node := range changedNodes {
nodeMap := node.(map[string]interface{})
if prevId, ok := nodeMap["prevAssetId"]; ok {
delete(nodeMap, "prevAssetId")
nodes = append(nodes, nodeMap)
newPrevMap[nodeMap["_key"].(string)] = prevId
}
}
docs, err := graph.createNewVaccinationsDocuments(ctx, logger, nodes)
if err != nil {
logger.Err(err).Int("vaccinations", len(docs)).
Msg("An error occurred while trying to save in collection")
return err
}

var newIds []interface{}
var prevIds []interface{}

for _, node := range docs {
prevId := newPrevMap[node.Key]
newIds = append(newIds, node.ID.String())
prevIds = append(prevIds, prevId)
}
err = graph.HandleNewEdges(
ctx,
logger,
date,
newIds)
if err != nil {
logger.Err(err).Int("count", len(newIds)).Msg("An error occurred while trying to save in collection")
return err
}
err = graph.createEdgeBetweenOldAndNewNodes(
ctx,
logger,
prevIds,
newIds,
)
if err != nil {
logger.Err(err).Interface("new_nodes", newIds).Interface("prev_nodes", prevIds).
Msg("Failed creating edge between old and new nodes")
return err
}
return nil
}

When we created the Changed nodes, we added the prevAssetId so our function can create the edges too for all the changed nodes in 3 batch request:

  1. Creating the Vaccinations documents.
  2. Creating VaccinationsEdges between them and their run.
  3. Creating VaccinationsEdges between them and their previous data.

As a result we are now have the following graph:

All countries besides France have 2 nodes for 2 days.

We can see that France is the only node with one edge because it wasn’t par of our previous run. In addition, for each changed node there is 2 edges: for the last run and for the previous one.

Before running the third run (29/08/2021), we want to mock the data so the USA will be the only country that the amount didn’t changed. The other 4 will be marked as Changed because we increased the amount of people for them with our mock data.

In order to create the graph, we need to handle Old nodes. This is very simple. We need to create an edge between the between the current run to the previous node. So simple yet so effective. We don’t create a new unnecessary node because the amount of people didn’t changed. Creating edges between nodes and run is already implemented for us in HandleNewEdges . Our compareToPrevRunNodes method already marks Old nodes. We just need to call HandleNewEdges once again with the previous nodes ids to get this final graph:

USA doesn’t have new node for the third day. However, it points to the previous run’s node.

As we can see, the USA node from our previous run is now related to our last run too! For the other countries, we created a new nodes with edges to the previous and the current run (The results for the first run were purposely omitted from the picture to simplify).

Voilà

Summary

In this blog we use the COVID-19 vaccinations status to create a GraphDB using the powerful ArangoDB. We worked on limited amount of data to ease the complexity and to focus on the logic. However, GraphDBs are capable to handle much more data and much more relationship. The purpose of the example is to show the potential of such a database. ArangoDB provides a layer which reduces complexity and provides a powerful API to work with such a technology.

Source code: https://github.com/liavyona/vaccinations-graph

Software Engineer