Testing for Table-Specific Data Issues
How to use singular tests in dbt to test tables at the source
Trying to make sense of database tables that have been around since the start of a company is a daunting task. Often times these are tables created without scale in mind, engineers just trying to put together a scrappy solution so the business can get off its feet.
But, years later, when we realize we have all this data that we can barely make sense of, it becomes a huge problem. The business is ready to scale. We want to understand where values are coming from, how rows of data are being generated, and where their place is within the business.
A few weeks ago I embarked on the project of documenting all the tables within a database and mapping their relationship to one another. Documenting the tables used all the time was the easy part. The hard part was getting to the purpose of the tables I never even knew existed.
With investigation comes problems bubbling to the surface. Whoever says their data is perfect is in the dark about something.
I found an issue in the mappings of two of the tables. For every one record in table a, there were two records from table b. One of those records in table b shouldn’t have been there. But, low and behold, it’s there.
Of course, these are two records that can only be updated manually in order to solve the issue. Even data people can’t automate everything!
Discovering which record is correct and which one should be deleted is a conversation between the data team and the corresponding stakeholder. Many times stakeholders aren’t aware of these issues and, quite frankly, have no idea what you’re talking about. It’s amazing how certain parts of a business can function without anyone’s knowledge.
Testing as a best practice
Every time I find a unique issue in my data, that I don’t want to happen going forward, I write a test. Tests are the best way to alert you of problems at the source, before they trickle downstream into a stakeholder’s path. Tests alert you when something has gone wrong. They are a key part of high-quality data.
If you’re a dbt user, then you’re probably familiar with dbt tests. dbt has a bunch of pre-written tests like not_null
, unique
, and accepted_values
that you can take advantage of. But, did you know you can write your own custom tests?
Singular tests in dbt
The tests you are familiar with are called generic tests. They are “generic”, meaning they can be applied to whatever asset you specify. It is a parameterized query that accepts arguments. You simply apply them to models, columns, seeds, or sources in your src.yml
file, and the parameters are passed.
You can think of singular tests as your one-off tests that can only be applied to certain resources. You can’t apply them to a bunch of different models because they aren’t parameterized.
Singular tests are stored as SQL files within your tests folder in your dbt project. They are simply a SQL query. When you run dbt test
this SQL code will also get run. If it returns any data, the test will fail.
This means you want to write a SQL query that will fail. If it fails, you’re alerted. If it doesn’t, then your data passes the data quality check you put in place.
How to use singular tests
I implement singular tests whenever I want to check for some obscurity in my data. In the situation I mentioned earlier, I want to look for duplicates when joining two tables. If I join table a with table b, there should only be one record for each unique id in table a.
Here, I want to write a query that will fail this check. I need to join the tables on their common field, group them by the id that should be distinct, and then check how many of the other table’s values exist for that unique id.
select
a.id
from {{ ref(‘table_a’) }} a
left join {{ ref(‘table_b’) }} b
on a.b_id = b.id
group by a.id
having count(b.id)>1
I am grouping by id
so I can then count the number of duplicates created from the join. This tells me that something is wrong with the data. I then add a having
clause to filter out the non-dups.
This is a SQL hack for writing all of your code in one query, rather than having to use a CTE. When you use group by
and having
in one query, you are doing the same thing as using group by
and count
and then filtering by count
><= some value in a CTE.
This query will produce the id
from table a that has multiple values of the id
from table b. Because it is returning data, the test will fail.
Always remember, you are writing singular tests in dbt to fail. When they fail, you will be alerted.
Why use singular tests
Writing tests like these is the key to confidence in your data. You don’t want to worry about whether or not a stakeholder really changed how they manually input their data if their process is what caused it in the first place. Or if an issue you already put lots of time into is once again going about unresolved.
Singular tests allow you to test one-off issues within data tables. They help make tests personal to your specific data landscape. Sometimes a generic test doesn’t fit your needs. And that’s ok. That’s why we have both!
If you have any advice on testing different situations in your data, leave a comment below. We would all love to learn!
In the meantime, check out my two latest articles on data quality. Both discuss the importance of documentation and the power of a data catalog.
📚 What is a data catalog and why is it important?
✏️ Data Documentation Best Practices
Have a great week!
Madison Mae