FollowAug 16 · 5 min read
The industry demand for Data Engineers is constantly on the rise, and with it more and more software engineers and recent graduates try to enter the field. The biggest hurdle for newcomers lies in understanding the Data Engineering landscape and getting hands-on experience with relevant frameworks.
We at Insight offer a 7 week tuition-free Fellowship to transition into Data Engineering and have worked with hundreds of Fellows who had to overcome this exact hurdle. We asked them which resources were particularly helpful in making the transition and the results are in — see below for the top 10 blog posts/resources!
Data Engineering Landscape
Before jumping into a project or choosing frameworks to work with, it is important to take a step and look at the big picture. What is Data Engineering and what is the role of a Data Engineer? Which are the most important concepts and frameworks one should understand? Here is a collection of great articles that shed some light on these questions and have helped our Fellows in the past:
1. Getting Started with Data Engineering
This blog post by Richard Taylor starts with a discussion of what big data and data engineering really mean before delving into an overview of the current landscape. The author strikes a great balance between being concise yet deep enough to cover topics such as the CAP theorem or resource managers.
2. Want to Become a Data Engineer?
The article by Pranav Dar walks through the different technical skills a data engineer should acquire and lists useful resources to get started. It starts very basic with introductory articles to Python, but also includes links to courses covering the Hadoop ecosystem, including Spark and Hive.
3. Distributed Architecture Concepts I Learned While Building a Large Payments System
You can think of this blog post by George Orosz as a notebook the author developed when transitioning into Data Engineering themself. Rather than jumping into the details of specific frameworks, it focuses on basic recurring concepts that are helpful for any tech stack.
4. Data Engineering Cookbook
This cookbook by Andreas Kretz is not yet complete but already has gathered a huge following. This is not surprising, given that it already contains a lot of high quality content starting from a definition of Data Engineering (the author likes to refer to it as plumbing for Data Science) to agile development methodologies to in-depth discussions about Hadoop and Docker. It is definitely worth bookmarking this cookbook as it rapidly evolves to one of the most comprehensive resources for Data Engineers.
Data Engineering Projects
Now that you have gotten an overview of the most relevant concepts and frameworks, you might feel overwhelmed — you are not alone! Our Fellows found that it is extremely helpful to look at some case studies/projects from leading Data Engineering teams to understand how the different pieces can fit together to build a cohesive pipeline/platform.
5. Detecting Image Similarity Using Spark, LSH and TensorFlow
In this article by Andrey Gusev, an engineer from Pinterests Content Quality team, we get an in-depth tour on how Pinterest detects duplicate images using Spark and TensorFlow. In particular, it covers Locality Sensitive Hashing in Spark in great detail — a great example of a more advanced yet still accessible first Spark project!
6. The Lyft Data Platform: Now and in the Future
This one is technically not a blog post but a slide show — but we still wanted to include it as it is an amazing representation of Lyft’s data platform and how it has changed. It does not just illustrate how technologies such as Spark, Presto, Kafka, Hive, Druid or Flink can work together — it also highlights how to build a platform that hosts both streaming and batch applications.
You have an understanding of the big picture and you have learned about examples of cutting edge projects — it is time to build your very own pipeline! Once you are working on a Data Engineering project, you will inevitably look for resources to understand the frameworks of your choice better. Maybe you want to get started with Spark or deploy your first Elasticsearch cluster? Here are some articles that go into more depth about various popular technologies:
7. Cassandra Schemas for Beginners (Like Me)
When Fellows start working with their first NoSQL database, they often need some time to get accustomed to the very different way of defining tables and running queries. This article by Joe Chasinga is a great resource for anyone dabbling into Cassandra and in need of a gentle introduction into basic concepts.
8. Concurrency, MySQL and Node.js: A Journey of Discovery
Once you start scaling your platform, writing to your database can become a huge bottleneck — Karl Düüna’s article is a very approachable journey into concurrency into MySQL and how to enable a high write and read throughout without running into a nightmare of locks.
9. Apache Spark @ Scale
Anyone who has ever written a Spark job knows the feeling: Your code runs just fine with your test data, but the moment you unleash the full data set onto your Spark cluster things start to break! Sital Kedia, Shuojie Wang and Avery Ching from Facebook’s Core Data/Data Infrastructure team walk through a comprehensive case study on how to optimize a Spark job.
10. How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka
Apache Kafka is one of the leading message ingestion frameworks for realtime streaming applications and is a great tool to serve ML models in production to potentially millions of users. If you’re just getting started with serving ML models, look no further than this article to learn best practices from Kai Waehner from Confluent (the company founded by the team that built Kafka).