Hi,
first of all: thank you so much for your feedback 🙏, I really appreciate when people share their view and experience. Knowledge and experience sharing is very important and helps the data community in general!
Let me share some thoughts from my side on this:
You are absolutely right! To simply aim for de-normalization is a sometimes dangerously oversimplistic claim. It is always good to evaluate the data model with proper testing. De-normalization is not always the solution but should be considered at least when designing a warehouse model. The answer on how to model the data properly to be cost and performance efficient is "it depends" as so often unfortunately.
Especially when having small tables on the right side of the JOIN, BigQuery utilizes Broadcast Joins to broadcast the full dataset of the table to each slot which processes the larger table. That way, normalization has no negative impact on performance. Actually, the opposite is the case and due to reduced data redundancy.
When BigQuery is not using the Broadcast Join, it uses the Hash Join approach as you mentioned. In this case, BigQuery uses hash and shuffle operations so that matching keys are processed in the same slot in order to perform a local join. However, compared to a Broadcast Join, this can be a an expensive operation as data needs to be moved.
If you find yourself in a situation where Hash Joins are being used, there are still ways to potentially improve performance. At least aim for defining the join columns as cluster columns. This colocates data in the same columnar file, reducing the impact of shuffling.
In the end, it depends on the data model and how big the normalized tables are. If redundancy can be reduced with a normalized structure while keeping the size of the JOIN tables small, so that Broadcas Joins are used, this is the better solution than enforcing a de-normalized approach. For tables bigger than 10G however, this should be evaluated with concrete benchmarks, which is always recommended: never simply built up the data model on pure theory, benchmark it for your use-case to find the best solution while considering all options, including de-normalization, normalization and techniques like nested, repeated columns.
To make this more clear, I extended the article in the de-normalization section. Thank you so much again for contributing, to make this an even better source for knowledge sharing!
Enjoy ✌️.