It has been lately called that ‘data scientist’ is the sexiest job of the 21st century. However, now, data engineering jobs are poised to give data scientists tough competition. Data Engineering Jobs are getting more popular than Data Science jobs.
So once you’ve decided data engineering is the field for you, you need to understand that becoming a great data engineer is a journey and not a destination. Everyone talks about success stories and what to do for it, but nobody talks about the nuances of what not to do and where not to waste time.
It does not come easy. Industry experts keep complaining that there is a large gap between self-educated data engineer’s skills and real-world work in the field of data engineering.
In this article, I will discuss the common mistakes data engineers make in their learning path(I have made some of them myself). I have also provided tips wherever applicable with the aim of helping you avoid these pitfalls on your data engineering journey.
Mistake #1: Not making data fundamentals strong
Mistake #2: Learning outdated/ legacy skills/technologies
Mistake #3: Missing the required depth/ breadth of topics
Mistake #4: Not doing ample hands-on practice
Mistake #5: Unable to visualize and understand the end to end picture
The first and foremost mistake data engineers make is not making their fundamentals base learning strong enough. A data engineer is expected to be reasonably good in coding/scripting and SQL as well. Without being able to work on simple programs if a data engineer directly jumps to write a complex data pipeline, it is definitely going to be a mess of a code.
Also, a data engineer should be conversant enough in the basics of databases and relational database management systems as well. Not understanding the difference between a primary key and a surrogate key is going to create problems even to define a simple data model.
The second common mistake data engineers do is to learn outdated technologies too much in-depth like learning too much in-depth Map Reduce OR Data warehousing concepts in Kimball /Inmon or some DWBI(Data Warehousing Business Intelligence ) tools which are not being used readily in the industry today. Time is a precious thing, learners can’t afford to miss focus on their learning priorities. It’s better to see the job descriptions and pick the most common skills like Spark, Kafka, NoSQL, Flink, etc rather than spending time and effort on outdated tools and techniques. But, do learn how to create Data models on NoSQL and Data lake systems.
I agree there are too many topics to be studied, there is Spark or Hive. Then, there are Kafka, NoSQL databases like Hbase or MongoDB. In-stream analytics, we have Spark streaming or Flink. On the cloud side, we have AWS, Azure, and GCP. So is it mandatory to be thorough in all of these tools and technologies? Absolutely not.
The need is to be proficient in the fundamental concepts in these data processing tools e.g how Spark internals work, how Kafka Pub-Sub mechanism works, and how NoSQL is different from SQL when to use which one. Preferably, we should go with any one of the options rather than focusing on everything.
Personal recommendation is to just learn one programming language: Scala/Python, Kafka, Spark, MongoDB/Hbase, and finally AWS for Cloud. Sometimes it is better to go with tools used in current projects when you don’t have an option.
This is something of paramount importance. Everyone just completes theory by reading documentations and some videos but no one really does the hard work of actually writing an end-to-end pipeline themselves. This not only leads to surprises and hiccups while working on actual projects but also shows the shallow knowledge when the interviewer starts to grill on the project hand on the part.
The recommendation is to start with a public dataset and a real-time API(e.g. Twitter etc). Ingest the dataset into Storage like HDFS and Kafka. Process it using Spark SQL/DS and Streaming(for real-time API data). Finally, presenting the insights in a visualized form like Tableau will add icing to the cake.
Performance optimization of the initial build of pipelines can further increase your chances of cracking the interviews.
Finally, without knowing the end-to-end pipeline just focusing on ingestion or storage or processing will not make the data engineer understand what is going on with his/ her work. Apart from knowing business impact data engineers should also understand the technical architecture and system design of the data pipelines and supporting frameworks.
Things like DevOps, Platform Infrastructure, and Networking are completely ignored by data engineers. These are critical aspects and supporting frameworks that are definitely important to understand the end-to-end picture. A basic overview of these supporting frameworks is definitely important if not in-depth.
Hope you had a good time reading the 5 common mistakes data engineers make, do share your experiences and any questions on the above.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,