Planning, Building & Using The Analytics Pipeline

Analytics Pipeline.drawio.png

When I first started working at my job, there was a need for an analytics data warehouse where analysts can pull clean & trusted data without having the dreadful task of always searching, asking and validating for nearly every data request. There were a few different projects working separately all generating data in various locations, and was difficult at times to pull things together. A previous analyst named Alex had started this initiative & coined the name for our pipeline the “Analytics Pipeline”, and thus it was named.

This is a proprietary project so I’m going to withhold things like schema names and actual code. I will try to capture the project at a high-level as it takes place over an entire year.

Data Build Tool or DBT is really the magic sauce here. It’s open source & very powerful. I’ve used or tested every feature in DBT’s offering, and I love the tool. I am not going to mention or list all features in this project, only the core ones that I use regularly.

<aside> ℹ️ DBT pipelines are best suited for data Extraction → data Loading → data Transformation (E.L.T).

</aside>

Extracting & Loading

Airflow & GCP, Stitch Data

Airflow is another open source tool used for [CI/CD](https://www.redhat.com/en/topics/devops/what-is-ci-cd#:~:text=CI%2FCD%2C which stands for,a shared source code repository.). There are several legacy processes that run here and is leveraged by the Analytics Pipeline. No additional work was needed, just worth mentioning.

Stitch Data is a paid service that also came with a few legacy processes. Stitch is a great tool based on which sources it has access to, and its ease of use. Fun fact however, Stitch Data is just a wrapper for https://www.singer.io/, another open source tool meant for simply extracting data from source systems into a database.

BigQuery

BigQuery is part of the Google Cloud Platform environment, and is the system we already use so we will continue to do so. Its cheap, fast & integrates pretty much automatically as we’re “google” people.

Data Transformation With DBT

I love DBT, love that it’s open source & love what I can accomplish with it. This tool is ideal for people who write lots of SQL code. With DBT you can really automate lots of the boring and tedious, hard to maintain code behind data warehouses. It also brings improvements to the analytics workflow if you’re into building well-defined KPI’s, Metrics or SQL Models. Every table, column, description, relationship, metric, report, dashboard, notebook all in one place, and version controlled (meaning you can “go back” to old versions, and see whether your change will break anything, before committing to it). Ultimately, if you’re someone working in a data warehouse, responsible for writing SQL & you need a better system, look into using DBT to organize, build & test your data warehouse.

I’ve used DBT extensively in this project, pretty much covering everything that DBT has to offer. I’m going to try and focus on the core stuff in here, so I’ll be skipping things like Docs, Tests, Semantic Models, Metrics and Jinja-Macros. The core of DBT are sources, models, yaml files (how to use them), and lineage.