3.7 ArangoDB open source multi-model database released

If open source is the new normal in enterprise software, then that certainly holds for databases, too. In that line of thinking, Github is where it all happens. So to have been favorited 10.000 times on Github must say something about a project. Open source ArangoDB, which also offers an Enterprise version, has hit that milestone recently.

On Aug. 27, ArangoDB announces its new release 3.7, which comes with interesting new features around graph. We take the opportunity to discuss the database market, graph, and beyond, with CEO and co-founder Claudius Weinberger and Head of Engineering and Machine Learning Jörg Schad.

CLOUD AND MACHINE LEARNING READY

ArangoDB was founded in Cologne in 2014 by OnVista veterans Claudius Weinberger and Frank Celler. The team made the headlines in 2019 with their $10 million in Series A funding led by Bow Capital. As Weinberger noted, he and his co-founder have been working together for 20 years, and the decision to pursue their vision was not a spur of the moment idea:

“The main idea for ArangoDB, what is still valid today, is what we call the native multi-model approach. That means that we found a way that we can combine the JSON document data model, the graph model, and the key-value model in one database core with one query language.”

Today ArangoDB is a US company with a German subsidiary, it has a new chief revenue officer, Matt Ekstrom, and a new head of engineering, Schad. Schad joined ArangoDB last year but has been working with ArangoDB for the past four years. With a PhD in database systems, distributed data analytics, and large scale infrastructure container systems, Schad has been switching between databases.

Two key factors made him join the ArangoDB team: Distribution in a cloud setting and machine learning (ML). ArangoDB has been an early adopter of both Apache Mesos / DC/OS and Kubernetes. Eventually, Kubernetes prevailed, and ArangoDB 3.7 comes with the general availability of its Kubernetes operator, which has been developed over the last three years.

ArangoDB’s Kubernetes operator is also the foundation for its managed service Oasis, available in AWS, Azure, and GCP. The new release includes a number of improvements for faster replacement and movement of servers, improved monitoring and cluster health analysis, an advanced inspection of pod failure causes, and overall reduced resource usage. Cluster scalability improvements for on-premise deployment apply too.

ArangoDB is touted as a solution to unify metadata across machine learning pipelines

ArangoDB has been promoting ArangoML: Using ArangoDB as the infrastructure for teams using ML. The idea is that beyond training data, which is a prerequisite for training ML models, metadata is also important, and using ArangoDB is a good match for that. We have long argued for the importance of metadata. But why ArangoDB, and not any other data management system?

Although ArangoDB has its own sui generis approach, we noticed that in the last year or so its messaging has shifted a bit from the multi-model aspect to emphasize graph. Its people confirmed that, mentioning they’re seeing a lot of demand for graph. Many users are coming with a graph use case and expand upon multi-model use cases later on.

The ArangoDB team believes, however, more data models are needed to support efficient and successful graph use cases. Graph and beyond, where graph is a central use case. Up until recently, the hype was all around graph, too. But those who have been into graph before it was cool knew that hypes come and go, and were expecting the hype to subside at some point.

The first sign came last week, with Gartner’s hype cycle for emerging technology in 2020 moving “graphs and ontologies” to the trough of disillusionment. Apart from the fact that conflating graphs and ontologies does not make much sense to us, we see this as a normal phase in the evolution of new, or in this case, not so new but still hyped, technology.

Schad noted that while graph use cases are on the rise, there’s still a lot of trial and error. Although use cases become more mature, some disillusionment in terms of scalability limits does exist. For Weinberger, it’s a good sign that the overall graph story is moving on, but expecting to do everything faster than other databases should not be the main reason people look at graphs.

GRAPH AND BEYOND

ArangoDB 3.7 comes with a number of improvements around graph capabilities. Disjoint SmartGraphs shard large, hierarchical graphs to a cluster and precisely shard each branch of the graph for local query execution. SmartGraphs applies a smart sharding mechanism, where depending on how data is set up, ArangoDB tries to shard it in a way that the number of hops is minimal between nodes.

With Disjoint SmartGraphs, if the resulting sub-graphs are sub-partitioned so they are disjoint, a number of optimizations on the query optimizer can then push down a lot more computation down to the servers. SatelliteGraphs goes in a similar direction: Replicating graphs to each cluster node for local query execution of multi-model queries, using an automatic approach to replicate metadata across the different nodes.

Parallel traversals are slightly different. What this feature does is that it enables starting a number of graph traversals in parallel, for cases where identifying certain patterns across a large graph is needed. Schad said currently this requires user direction, while in the future automatic parallelization will be introduced.

It’s clear that the focus of these features, as well as the overall approach for ArangoDB, is on graph queries and analytics. This is even more evident, considering some form of schema has just been introduced now. In a recent article, ArangoDB expressed the position that a multi-model approach may be beneficial for knowledge graphs.

While the main argument, i.e. that having multi-model capabilities helps with data transformation, is true, it’s hard for us to conceive how it is possible to talk about knowledge graphs without a schema. Furthermore, we don’t see the layered cake introduced, implying that ArangoDB can be a substrate for knowledge graphs, supported by at least some interoperability layer with graph standards at this point.

ArangoDB touts multi-model as a good approach to tackle some of the issues with knowledge graphs. That sounds good, but there are pieces missing from that cake. Image: ArangoDB

When discussing this with ArangoDB’s team, they mentioned that AQL, ArangoDB’s query language, is an integral part of its multi-model capabilities. While SPARQL does not work for ArangoDB, which makes sense considering ArangoDB’s model supports property graphs, ArangoDB participates in the GQL query language standardization effort for property graphs.

Understandably, this may take a while. Equally understandably, ArangoDB’s team expressed the conviction that AQL will still be the preferred way to access data in ArangoDB. They also said that being clear about not being SQL-compatible comes with the territory. What is not understandable to us, however, is the lack of support for interoperability on the graph data import/export level.

Support for RDF import/export, for example, which other graph databases offer, would be an obvious benefit. ArangoDB’s team noted there is community work going on in that area, but it’s not yet open-sourced or included in ArangoDB’s distribution. In terms of graph capabilities, we see ArangoDB as a typical product in the property graph category: more suitable for analytics, less so for data integration/knowledge management.

Overall, ArangoDB’s multi-model capabilities and distributed-first approach make it an interesting offering for a number of use cases. If you are willing to dive into its sui generis approach and have use cases that match it, it’s certainly worth considering.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link