I built a lot of things at gm. Here are just a few.
denali Link to heading
Denali is a project I started working in late 2020 and continued until 2023. It was the largest effort I lead, it started with just me, then me + one other dev, then at one point growing to ~7 devs working concurrently in the codebase. It is a suite of applications that handle various data pipeline related tasks in the GM data lake, e.g. data ingestion, normalization, compaction and SQL workflows for ETL. It is implemented in Scala, using the Spark and Delta Lake frameworks and executes on a 700 node on-prem Kubernetes cluster. Denali leverages a microservices architecture to manage the jobs, their metadata, and their execution at scale with gRPC communication between services.
It grew out of a larger effort to migrate all of our existing data workloads from a YARN Hadoop cluster over to a brand new Kubernetes cluster. We had a large legacy app, written in Java and using MapReduce, that handled a vast majority of the jobs running on the YARN cluster. Denali was a from-scratch rewrite of that legacy app.
high-level job metrics Link to heading
Denali includes a job scheduler named Archer (which was later enhanced with a better scheduler, Arbalist, written in Rust). Archer gets launch requests from each of the other app types, listed below, determines what resources to allocate the job, and schedules them to run on the cluster (possibily hitting a queue on the way).
app | job executions | desciption |
---|---|---|
traveler | 5,797,975 | base ingestion into silver data layer using delta (merge, append, or overwrite) |
exodus | 4,437,517 | data compaction, vacuums, misc data egress |
wanderer | 2,805,175 | “integrated core” ingestion - our legacy merge pattern |
alchemist | 457,359 | allows SQL devs to easily create their own job DAGs workflows to execute in Spark |
druid | 336,741 | sources and normalizes bronze data for processing by traveler |
cleric | 136,905 | parquet snapshot generator for tools compatibility (hive) |
genesis | 134 | utilities for migrating DDLs and data from the legacy system |
base ingestion metrics - traveler Link to heading
Traveler is one of the most fundamental services within denali. It is responsible for ingesting data incrementally, typically with a merge, into the base layer of GM’s data lake. This app is one of the first in the chain of dependent jobs that gets data where it needs to be: for reporting, for ML workloads, for adhoc analysis, you name it. It processes hundreds of terabytes worth of data every single day. As of June 19th, 2023, it has processed 1.3 quadrillion records of GM’s data.
traveler metrics Link to heading
Data processed per day | Records processsed per day | Records processed all time | Tables merged all time |
---|---|---|---|
254 TB | 3,318,499,448,217 | 1,308,725,640,614,114 | 5,170,960 |
base ingestion metrics - wanderer Link to heading
Wanderer is another pattern, similar to Traveler and sharing a lot of code, but implemented slightly differently in the core ingestion and data loading logic. It was really a carry-over from our legacy system and a “deprecated” pattern that we got up in runnning on our shiny new K8s cluster to support the old workloads. It processes about 10% of the data volumes as Traveler.
wanderer metrics Link to heading
TB processed per day | Records processsed per day | |
---|---|---|
25 | 363,596,920,902 |
sql workflow metrics - alchemist Link to heading
Alchemist is where data goes from base, or silver data into our aggregated “Gold” datasets. This service is super important in that it allows developers to be productive even without knowing the inner workings (and associated black magic) of the Spark framework. Workflows (proper DAGs) can be written as a set of plain SQL scripts that have their dependency relations (edges in the graph) represented in metadata. This allows Alchemist to achieve really high runtime performance by running SQLs in parallel while still providing a simple interface to the devs.
alchemist metrics Link to heading
jobs executions | sql step executions | min ts | max ts |
---|---|---|---|
457,359 | 17,035,890 | 2021-03-22 18:39:43.000000 | 2023-06-19 13:10:10.000000 |
nomad Link to heading
Nomad was one of the first apps I built at gm. I started on it in late 2017. It was my first exposure to microservices.
We needed a better way to copy data from our production cluster to our nonprod clusters for testing and development of new and existing data pipelines. The current patterns were tedious and error-prone, and not particularly user-friendly. (Due to some techincal details on the networking setup, a tool like distcp was not an option for this use case)
I paired up with one of my peers and good friend Steevy Mathew, whom I met on my first day at GM, and we got to work building the app out.
We choose to build it as a webapp to make it as accessible as possible, even to the less technical users at the company. After a few iterations, we landed on a design involving a few Spring boot microservics for the backend accessed through an Angular frontend. We would transfer data between clusters via the SSH protocol. This app quickly caught traction with users on the data teams and eventually spread to be used throughout IT. 6 years later, it’s still in production and being used daily by users at the company.
nomad metrics (since 2017) Link to heading
jobs ran | files copied | gb copied | users |
---|---|---|---|
47,097 | 11,956,264 | 2,080,178 | 1,895 |
aero Link to heading
Aero is an application that ingests data directly from relation databases like Postgres, Oracle, SQLServer, etc, into GM’s datalake. Another brainchild of mine, it taught me a lot about databases and how to quickly pull data off them.
At gm, it replaced our usage of the sqoop (mapreduce) tool. This turned out to be quite nice when we migrated off our legacy YARN/MapReduce cluster and onto a Kubernetes cluster where we purely ran Spark. Aero is written in Spark for the data processing layer with a Spring boot microservice that manages the job lifecycle (creation, configuration, customization, execution, tracking, reporting)
This app serves the dual purpose of regular ingestion for certain (typically smaller) tables with a full refresh of the data at least once a day, and also as a handy tool for getting base tables back in sync in the event of data quality issues.
job executions | MIN_TS | MAX_TS |
---|---|---|
255,229 | 2020-03-02 18:15:34.925000 | 2023-06-19 16:41:26.544045 |
builder - workspaces Link to heading
Another app I built is called Builder. This app provides the backend to another concept called a “Workspace” which is a place that data scientists, data analysts, data engineers, and anyone else interested in working with our data could collaborate in their own sandbox on the production system. Builder facilitated this by creating a custom Hive database for the Workspace and securing it with Apache Ranger policies, so only those users added by the Workspace owner could see what was going on.
Through the Workspaces webapp, users could add production silver & gold tables or any other table available within the datalake table catalog and it would appear to the user that the table was actually in their workspace (in reality, it was a view to the underlying table). Users could then immediately get working in tools like Jupyter, create their own tables, upload their own datasets, add (or remove) other users to the workspace. This was a big win for collaboration & productivity and proved to be a popular service over the years.
builder metrics Link to heading
request type | count fulfilled |
---|---|
ADD_USER | 26,569 |
ADD_TABLE | 23,373 |
CREATE_SCHEMA | 9,183 |
DROP_SCHEMA | 3,922 |
DELETE_USER | 2,204 |
DROP_TABLE | 820 |
CHANGE_OWNER | 222 |