If open source is the new normal in enterprise software, then that certainly holds for databases, too. In that line of thinking, Github is where it all happens. So to have been favorited 10.000 times on Github must say something about a project. Open source ArangoDB, which also offers an Enterprise version, has hit that milestone recently.
On Aug. 27, ArangoDB announces its new release 3.7, which comes with interesting new features around graph. We take the opportunity to discuss the database market, graph, and beyond, with CEO and co-founder Claudius Weinberger and Head of Engineering and Machine Learning Jörg Schad.
CLOUD AND MACHINE LEARNING READY
ArangoDB was founded in Cologne in 2014 by OnVista veterans Claudius Weinberger and Frank Celler. The team made the headlines in 2019 with their $10 million in Series A funding led by Bow Capital. As Weinberger noted, he and his co-founder have been working together for 20 years, and the decision to pursue their vision was not a spur of the moment idea:
“The main idea for ArangoDB, what is still valid today, is what we call the native multi-model approach. That means that we found a way that we can combine the JSON document data model, the graph model, and the key-value model in one database core with one query language.”
Today ArangoDB is a US company with a German subsidiary, it has a new chief revenue officer, Matt Ekstrom, and a new head of engineering, Schad. Schad joined ArangoDB last year but has been working with ArangoDB for the past four years. With a PhD in database systems, distributed data analytics, and large scale infrastructure container systems, Schad has been switching between databases.
Two key factors made him join the ArangoDB team: Distribution in a cloud setting and machine learning (ML). ArangoDB has been an early adopter of both Apache Mesos / DC/OS and Kubernetes. Eventually, Kubernetes prevailed, and ArangoDB 3.7 comes with the general availability of its Kubernetes operator, which has been developed over the last three years.
ArangoDB’s Kubernetes operator is also the foundation for its managed service Oasis, available in AWS, Azure, and GCP. The new release includes a number of improvements for faster replacement and movement of servers, improved monitoring and cluster health analysis, an advanced inspection of pod failure causes, and overall reduced resource usage. Cluster scalability improvements for on-premise deployment apply too.
ArangoDB is touted as a solution to unify metadata across machine learning pipelinesArangoDB has been promoting ArangoML: Using ArangoDB as the infrastructure for teams using ML. The idea is that beyond training data, which is a prerequisite for training ML models, metadata is also important, and using ArangoDB is a good match for that. We have long argued for the importance of metadata. But why ArangoDB, and not any other data management system?
Although ArangoDB has its own sui generis approach, we noticed that in the last year or so its messaging has shifted a bit from the multi-model aspect to emphasize graph. Its people confirmed that, mentioning they’re seeing a lot of demand for graph. Many users are coming with a graph use case and expand upon multi-model use cases later on.
The ArangoDB team believes, however, more data models are needed to support efficient and successful graph use cases. Graph and beyond, where graph is a central use case. Up until recently, the hype was all around graph, too. But those who have been into graph before it was cool knew that hypes come and go, and were expecting the hype to subside at some point.
The first sign came last week, with Gartner’s hype cycle for emerging technology in 2020 moving “graphs and ontologies” to the trough of disillusionment. Apart from the fact that conflating graphs and ontologies does not make much sense to us, we see this as a normal phase in the evolution of new, or in this case, not so new but still hyped, technology.
Schad noted that while graph use cases are on the rise, there’s still a lot of trial and error. Although use cases become more mature, some disillusionment in terms of scalability limits does exist. For Weinberger, it’s a good sign that the overall graph story is moving on, but expecting to do everything faster than other databases should not be the main reason people look at graphs.
GRAPH AND BEYOND
ArangoDB 3.7 comes with a number of improvements around graph capabilities. Disjoint SmartGraphs shard large, hierarchical graphs to a cluster and precisely shard each branch of the graph for local query execution. SmartGraphs applies a smart sharding mechanism, where depending on how data is set up, ArangoDB tries to shard it in a way that the number of hops is minimal between nodes.
With Disjoint SmartGraphs, if the resulting sub-graphs are sub-partitioned so they are disjoint, a number of optimizations on the query optimizer can then push down a lot more computation down to the servers. SatelliteGraphs goes in a similar direction: Replicating graphs to each cluster node for local query execution of multi-model queries, using an automatic approach to replicate metadata across the different nodes.
Parallel traversals are slightly different. What this feature does is that it enables starting a number of graph traversals in parallel, for cases where identifying certain patterns across a large graph is needed. Schad said currently this requires user direction, while in the future automatic parallelization will be introduced.
It’s clear that the focus of these features, as well as the overall approach for ArangoDB, is on graph queries and analytics. This is even more evident, considering some form of schema has just been introduced now. In a recent article, ArangoDB expressed the position that a multi-model approach may be beneficial for knowledge graphs.
While the main argument, i.e. that having multi-model capabilities helps with data transformation, is true, it’s hard for us to conceive how it is possible to talk about knowledge graphs without a schema. Furthermore, we don’t see the layered cake introduced, implying that ArangoDB can be a substrate for knowledge graphs, supported by at least some interoperability layer with graph standards at this point.