November 30th,2023
I want to talk about this interesting small SQL Optimisation I have read in PlanetScale newsletter.
Let's say there are 2 tables. Users and posts and you want to get all the users who have written popular(Posts. viewCount > 90000) posts. Both of them are linked together by user_id.
You would normally write it like this
SELECT distinct user_id FROM Users
LEFT JOIN Posts
ON Users.user_id = Posts.user_id
WHERE Posts.viewCount > 9000
Here, we are joining the two tables which will produce duplicate entries if a user has written multiple popular posts, filtering them and finally we are deduplicating them. That's a bit too much Instead, we can use SubQueries to improve performance and reduce Cognitive Overhead(IMO)
SELECT user_id FROM Users WHERE user_id IN (
SELECT user_id FROM Posts WHERE viewCount > 9000)
Here, we do not need to JOIN and de-duplicate. If we do not need the Posts Table data, it is kind of pointless to do a join and discard all of it.
TLDR: If you need to filter a table based on a related Table and you do not need any of the data in a related table, just do a subquery instead of a join