The Great "Data Model" Debate
How dbt came in and turned the legacy definition on its head
According to Princeton, “A data model organizes data elements and standardizes how the data elements relate to one another”.
According to erwin, a leader in data modeling software, “Data models are visual representations of an enterprise’s data elements and the connections between them”.
According to Wikipedia, “A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities.”
A lot of different companies and engineers view data models differently. And that’s ok.
Some see it as an organizational tool, others as a visual representation, and some as a set of standards.
I was first introduced to the concept of data modeling during my internship at UPS in college. I was a “Data Architecture Intern”, mainly working with erwin to design models for databases that needed to be created.
Now looking back, UPS was behind other technology companies in a lot of ways. They depended on on-prem data storage rather than the cloud. They used a lot of legacy coding languages that you don’t see used anymore.
This isn’t necessarily a bad thing, but it is definitely a slower way of doing things. Don’t get me wrong, UPS was very advanced in its thinking, but not so much in the technologies themselves.
In my first few weeks, when I had no idea what a data model was, I was introduced to erwin as a data modeling tool.
In this role, I learned that a data model was a plan for how the database system was to be created. Each table needed to be planned with the correct column names and data types. Primary and foreign keys had to be defined. Then, relationships like one-to-one or many-to-one needed to be mapped to other related tables.
A diagram ended up looking something like this:

In the years after this internship, I never created another data model again. The other companies I worked for definitely didn’t use software like erwin, let alone have time to even think about designing such an intricate database system on top of what already existed.
I didn’t hear about data models ever again until my second year working as a data engineer at Capital One. I was introduced to a data transformation tool called dbt. dbt was built for modular data modeling. But not the data modeling I had previously learned.
A New Kind of Data Modeling
dbt describes data models as lines of SQL code that transform and standardize your data. With dbt, data modeling is more of a verb than a noun. It’s something you perform rather than a diagram or plan that exists on its own.
As a data engineer, I resonate more with dbt’s definition of a data model. I think it’s used much more frequently rather than the data modeling diagrams I created back at UPS. To me, those are more database designs rather than “data models”.
I get it, some people want the legacy definition of a data model to carry on, rather than evolve. But, isn’t the world of technology meant to constantly evolve and change? Look at where we were ten years ago, or even five years ago, in terms of our data capabilities. Now we are dealing with terabytes of data, increased processing speeds, and data that is always available.
As technology evolves, so does our terminology. When we use words like data model in the right context, other engineers should be able to understand what we are saying, whether we are referring to dbt’s idea of a data model or erwin’s.
Adam Schneider had a good way of framing this on one of my LinkedIn posts:
To me, the term “data model” is intuitive. You shouldn’t have to look up a definition to understand what it means. And this is probably where the confusion and conflicting ideas come from.
We know a model is a representation of something or a system that should be followed. To model something means to give shape or form to it.
When creating data models we are forming data from one thing into another thing.
And if you look back at the definitions I included in the beginning, dbt’s transformations do organize and standardize data elements, just as those legacy definitions point out. They just do so in a different way. A way that is different from how it was done ten or twenty years ago. And that is OK.
Base Models
Base models (or staging models, as dbt now calls them) only include basic transformations and data cleaning but are the layer where standardization of your data happens. Base models are the code that changes an integer from one unit to another, or casts a timestamp from one timezone to another.
Your data is standardized here so isn’t that by definition a data model?
Sure, they include very basic code changes, but who are we to say that these aren’t important enough to be considered data modeling? Data is being changed from one form to another.
Example code:
select
ad_id AS facebook_ad_id,
account_id,
ad_name AS ad_name_1,
adset_name,
month(date) AS month_created_at,
date::timestamp_ntz AS created_at,
spend
from {{ source('facebook', 'basic_ad')}}
In this base model we are better defining column names, using a raw data column to create a new one, and casting a date column to a timestamp.
Intermediate and Core Models
Intermediate and core models are made up of more complex SQL code. They contain joins and sometimes aggregations. They make it possible to take your dataset from one thing into something completely different.
These models produce the datasets that are then used by business stakeholders. They’re the pretty ones that contain all of the metrics at the right granularity that business leaders need them.
dbt transformed the way data models were written because of its ability to write code modularly. What was previously thousands of lines of SQL code, is now multiple “chunks” (or models) of SQL code, each with a digestible amount of code. These “chunks” (aka models) can be referenced in other data models so that the code doesn’t need to be rewritten or run more than once in production.
Example code for an intermediate model:
with
adwords_spend_summed AS (
select
created_at AS spend_date,
sum(cost) AS spend
from {{ ref('stg_google_campaigns') }}
where cost != 0
group by
created_at
)
select * from adwords_spend_summed
Here we are aggregating our spend by day so that it can then be used in a downstream model that will combine spend from all different marketing sources.
Conclusion
I know we all won’t agree on this topic. It’s most likely that those who have been in the data industry for 20 years will have conflicting views from the newbies like me. But I do think it’s important that we respect one another’s experiences and views.
There is never a set way of doing anything. We all have the power to join an industry and disrupt it if there’s room and that’s what needs to be done. Imagine if we continued to do everything in the data world the way it was done 20 years ago. Where would we be right now?
Healthy debate is good. It’s the sharing of different ideas that helps us learn and progress in ways we didn’t even know were possible. I challenge you to leave a comment disagreeing with something I, or anyone, has said.
But do so in a respectable way. Shooting someone down and calling them stupid is not a way to get others to listen to what you have to say. It’s rude and quite honestly devalues EVERYTHING you are saying.
Unfortunately, this is something I’ve had to deal with often being a woman in tech. The reason I started writing in the first place was to make this industry more inviting for anyone who has an interest in working in it. We should all feel like our opinion is valued.
Let’s work to expand ideas by trying to understand where one another is coming from, rather than dragging each other down.
Until next week!
Madison Mae
Thanks Madison! As a data modeler with 20 years experience, it's wonderful to hear from someone with such a fresh perspective. Recently there has been an explosion of fresh interest in data modeling, especially with the rise of cloud data warehouses replacing hadoop and dbt replacing legacy transformation/ETL tools. One important thing I wish more people understood is that engineering is a design discipline. Design isn't just something "designers" or "architects" do, especially in analytics and software. dbt has been a gigantic step forward in transformation modeling and physical modeling of analytics platforms. Only sometimes there's a need for a modeling tool like my current favorite sqlDBM, a cloud-first tool that integrates beautifully with dbt and cloud data platforms. Visual modeling tools are useful to extend dbt for non-trivial UX design of consumption models when analysts and decision makers need an intuitive interface to the data, often a galaxy of many star schemas. When the required ease of use for self service justifies the extra effort to design a model from the business perspective, a visual tool can be useful to perform the conceptual and logical modeling and then generate a physical model in DDL, which the physical modeler (engineer) can modify and fill, preferably with a well-designed modular model using dbt. One more note, I love the advancement in thinking about how star schemas should relate to each other - we can decouple them with data mesh architecture using business keys instead of using master conformed dimensions with surrogate keys, which dbt thankfully doesn't even support. Then each data product in the data mesh can be designed in any of many different ways, whichever makes sense for its consumers - people or machines. The focus becomes which model provides the best interface, not turf wars and guild gate keeping over terminology and the supposed "right" way of doing it. I love how you put that! Thanks again!