Data Engineering Cheat Sheet

9 min readFeb 5, 2021

Data engineering is all about designing, developing, testing and delivering offerings using leading edge and/or proven technologies. Most companies nowadays work Agile, which makes it possible to have a collaborative environment to understand stakeholder requirements and develop valuable features. Data engineering also includes designing, coding, and testing innovative component-level software solutions. You always need to ensure that the implemented solutions are unit tested and ready to be integrated into their product.

As a data engineer, you’re in charge of gathering and collecting the data, storing it, do batch processing or real-time processing on it, and serve it via an API to data scientist who can easily query it. Data engineers may work closely with data architects (to determine what data management systems are appropriate) and data scientists (to determine which data are needed for analysis). They often wrestle with problems associated with database integration and messy, unstructured data sets.

Here’s a list of minimum topics that you need to have deep knowledge about as a data engineer. Keep in mind the intention of this article is not to explain the subjects in details, but to list important points about them. As an #IBM engineer, I catch myself using most of these tips pretty often. Hopefully, they’ll be useful for you as well.

Data Structures

Array

A container object that holds a fixed number of values of a single type. Arrays are good for indexing, but not efficient for searching, inserting, and deleting.

Linear arrays: One dimensional arrays declared with a fixed size.

Dynamic arrays: Hold reserved space for expansion (if it gets full, it copies itself to a larger array).

Multi dimensional arrays: Multi dimensional or nested arrays of arrays.

Linked List

A linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. Linked lists are fast for insertion and deletion, but slow at indexing and searching.

Each node contains a value and a pointer to the next node. Doubly linked list nodes have pointers to the previous node as well. Circularly linked list has a tail (the last node), references the head (the first node).

Stack (LIFO) usually implemented with linked lists (the head being the only place for insertion and removal).

Queues (FIFO) can be implemented with linked lists as well (doubly linked list that only removes from head and adds to tail).

Time Complexity:

Indexing: Linked Lists: O(n)
Search: Linked Lists: O(n)
Optimized Search: Linked Lists: O(n)
Insertion: Linked Lists: O(1)

Hash Table or Hash Map

A data structure that implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found.

Hashing

Uses a hash function accept a key and return an output unique only to that specific key. An input data gets passed to the hash function and a much smaller data gets generated which can be a unique address in memory for that data.

Hashing is not encryption
Hashing functions are typically one-way
Information can be lost when hashing
Two objects that are equal should return the same hash
The same hash may also result from different objects (Hash Collision)
Hash tables are very fast!

Hash collisions are when a hash function returns the same output for two distinct inputs. Linked lists can be used to implement list of objects for the same hash.

Time Complexity:

Indexing: Hash Tables: O(1)
Search: Hash Tables: O(1)
Insertion: Hash Tables: O(1)

Binary Tree

A tree data structure in which each node has at most two children, which are referred to as the left child and the right child.

Binary Search Tree:

A binary tree that uses comparable keys to assign which direction a child is.
Left child has a key smaller than it’s parent node.
Right child has a key greater than it’s parent node.
There can be no duplicate node.

Time Complexity:

Indexing: Binary Search Tree: O(log n)
Search: Binary Search Tree: O(log n)
Insertion: Binary Search Tree: O(log n)

Sorting

Heap Sort

Heap sort is a comparison based sorting technique based on Binary Heap data structure. It is similar to selection sort where we first find the maximum element and place the maximum element at the end. We repeat the same process for the remaining elements.

Heap sort algorithm:

Build a max heap from the input array.
At this point, the largest item is stored at the root of the heap. Replace it with the last item of the heap and reduce the size of heap by 1. Finally, heapify the root of the tree.
Repeat step 2 while size of heap is greater than 1.

Time Complexity: Time complexity of heapify is O(Logn). Time complexity of building max heap is O(n) and overall time complexity of Heap Sort is O(nLogn).

Quick Sort

QuickSort is a Divide and Conquer algorithm. It picks an element as pivot and partitions the given array around the picked pivot. The key process in quickSort is partition(). Target of partitions is, given an array and an element x of array as pivot, put x at its correct position in sorted array and put all smaller elements (smaller than x) before x, and put all greater elements (greater than x) after x.

Worst case: O(n2). when the partition process always picks greatest or smallest element as pivot.
Best case: O(nLogn). when the partition process always picks the middle element as pivot.

Merge Sort

Merge sort divides the input array into two halves, calls itself for the two halves, and then merges the two sorted halves.

Time complexity of Merge Sort is θ(nLogn) in all 3 cases (worst, average and best) as merge sort always divides the array into two halves and takes linear time to merge two halves.

Bubble Sort

Bubble sort repeatedly swaps the adjacent elements if they are in wrong order.

Worst and Average Case: O(n*n). Worst case occurs when array is reverse sorted.
Best Case: O(n). Best case occurs when array is already sorted.

Searching

Breadth First Search

An algorithm for traversing or searching tree or graph data structures. It starts at the tree root and explores all of the neighbor nodes at the present depth prior to moving on to the nodes at the next depth level.

It uses the opposite strategy of depth-first search, which instead explores the node branch as far as possible before being forced to backtrack and expand other nodes.

It finds every node on the same level, most often moving left to right.
While doing this it tracks the children nodes of the nodes on the current level.
When finished examining a level it moves to the left most node on the next level.
The bottom-right most node is evaluated last (the node that is deepest and is farthest right of it's level).
Optimal for searching a tree that is wider than it is deep.
Uses a queue to store information about the tree while it traverses a tree.
Because it uses a queue it is more memory intensive than depth first search.
The queue uses more memory because it needs to store pointers

Time Complexity:

Search: Breadth First Search: O(V + E)
E is number of edges
V is number of vertices

Depth First Search

An algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node and explores as far as possible along each branch before backtracking.

It traverses left down a tree until it cannot go further.
Once it reaches the end of a branch it traverses back up trying the right child of nodes on that branch, and if possible left from the right children.
When finished examining a branch it moves to the node right of the root then tries to go left on all it’s children until it reaches the bottom.
The right most node is evaluated last (the node that is right of all it’s ancestors).
Optimal for searching a tree that is deeper than it is wide.
Uses a stack to push nodes onto.
Because a stack is LIFO it does not need to keep track of the nodes pointers and is therefore less memory intensive than breadth first search.
Once it cannot go further left it begins evaluating the stack.

Time Complexity:

Search: Depth First Search: O(|E| + |V|)
E is number of edges
V is number of vertices

REST APIs

Representational state transfer (REST) is a de-facto standard for a software architecture for interactive applications that typically use multiple Web services.

Uniform stateless operations
No session information is retained by the server
A request should not rely on another request
HTTP methods used in RESTful APIs are GET, PUT (replace resource), POST, PATCH (update resource), DELETE

Advantages:

Client-server architecture (separation of concern)
Statelessness
Cache-ability
Layered system (eg. proxies should not prevent communication and no update to client/server code)
Code on demand

Important Design Patterns

Singleton

limit creation of a class to only one object
Private/Protected constructor
eg. used in caches, thread pools, registries

2. Factory method

Produces class objects
Decouples the code from the implementation class

3. Strategy

Instead of directly implementing a single algorithm, the code receives runtime instructions to decide which of the group of algorithms to run

4. Observer

There are observers (interested in updates) and subjects (generate updates)

5. Builder

Builds more complex objects
Step by step setting of properties vs factory method that creates objects in one go

6. Adaptor

Adapter between two interfaces like translators
eg. converting XML to JSON

7. State

Allows object to alter its behaviour when its internal state changes
eg. context for defining states

System Design

Outline use cases, constraints, and assumptions
- Who, how, what, how much data, how many requests
High level design
- Main components and connections
Design core components
Scale the design
- Load balancer, horizontal/vertical scaling, sharding, caching, etc.

Microservices Architecture

A variant of the service-oriented architecture (SOA) structural style — arranges an application as a collection of loosely coupled services. In a micro services architecture, services are fine-grained and the protocols are lightweight.

Strengths

Independent components
Easier understanding
Better scalability
Flexibility in choosing technology
Higher level of agility

Weaknesses

Extra complexity
System distribution
Cross cutting concerns
Testing

When to choose Microservices architecture?

Expertise availability (DevOps)
A complex and scalable app
Enough engineering skills

Monolithic system

In monolithic systems, functionally distinguishable aspects (for example data input and output, data processing, error handling, and the user interface) are all interwoven, rather than containing architecturally separate components.

Strengths

Less cross cutting concerns
Easier debugging and testing
Simpler to deploy
Simpler to develop

Weaknesses

Gets complicated over time
Making changes are hard (tightly coupled)
Hard/expensive to scale
New technology barriers

When to choose monolithic?

Smaller team
Simpler application
Quicker launching
No micro services expertise

Query Optimization Tips

SQL

Define business requirements and select required fields instead of select *
Select more fields to avoid selecting distinct which is expensive
Use inner joins instead of outer joins if possible
Where is more efficient than Having
Use * at the end of the phrase only
Run analytical queries during off-peak hours
Order select fields based on granularity
Index tables based on int IDs

NoSQL

Query rewrites based on heuristics (remove unnecessary stuff)
Index selection (avoid searching whole tables)
Join reordering
Join types
Add index on collections (eg. ascending vs descending)

Good Habits

Log everything in sys logs
Always have unit tests
Validate data before inserting to DB
Automate everything you do again and again
Don’t delete code, use git
Instead of deleting in DB, hide from user
Using too many globals is usually not a good idea
Skim the whole manual instead of the parts you need
WTFM: Write The Fucking Manual !
If you can’t verify if something is correctly working, it’s probably not working

Interview Coding Canvas

Separate the constraints, limitations, ideas, code, and test cases. For idea generation, try to:

Simplify the task
Try examples to find patterns
Think of suitable data structures

Data Engineering Cheat Sheet

Data Structures

Array

Linked List

Hash Table or Hash Map

Hashing

Binary Tree

Sorting

Heap Sort

Quick Sort

Merge Sort

Bubble Sort

Searching

Breadth First Search

Depth First Search

REST APIs

Important Design Patterns

System Design

Microservices Architecture

Monolithic system

Query Optimization Tips

SQL

NoSQL

Good Habits

Interview Coding Canvas

Written by Shahin Rahbariasl