Home Data Engineering Data Media Adding Version Control To Data Lake

Adding Version Control To Data Lake

Audio version of the article


Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform.


  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of what LakeFS is and why you built it?
    • There are a number of tools and platforms that support data virtualization and data versioning. How does LakeFS compare to the available options? (e.g. Alluxio, Denodo, Pachyderm, DVC, etc.)
  • What are the primary use cases that LakeFS enables?
  • For someone who wants to use LakeFS what is involved in getting it set up?
  • How is LakeFS implemented?
    • How has the design of the system changed or evolved since you began working on it?
    • What assumptions did you have going into it which have since been invalidated or modified?
  • How does the workflow for an engineer or analyst change from working directly against S3 to running against the LakeFS interface?
  • How do you handle merge conflicts and resolution?
    • What are some of the potential edge cases or foot guns that they should be aware of when there are multiple people using the same repository?
  • How do you approach management of the data lifecycle or garbage collection to avoid ballooning the cost of storage for a dataset that is tracking a high number of branches with diverging commits?
  • Given that S3 and GCS are eventually consistent storage layers, how do you handle snapshots/transactionality of the data you are working with?
  • What are the axes for scaling an installation of LakeFS?
    • What are the limitations in terms of size or geographic distribution of the datasets?
  • What are some of the most interesting, unexpected, or innovative ways that you have seen LakeFS being used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while building LakeFS?
  • When is LakeFS the wrong choice?
  • What do you have planned for the future of the project?

This article has been published from a wire agency feed without modifications to the text. Only the headline has been changed.

Source link

- Advertisment -

Most Popular

- Advertisment -