This tutorial is part of the SQL Made Easy series of posts.
In this tutorial we will discuss:
- Why to use the DISTINCT command
- The syntax involved
- Working with counts
- Working with multiple columns
Sometimes when working with large tables in SQL you will come across columns that contain duplicate values. Most of the time these duplicates aren’t an issue as they aren’t meant to be a unique identifier or primary key, but if you are looking to query the table in a way that only uses these columns it is often best to make things as simple as possible – this is where the SQL SELECT DISTINCT command comes in.
SELECT vs SELECT DISTINCT
For this tutorial we are going to look at a dataset of top selling video games that is available on Kaggle.
This file contains information on the top selling video games globally by name, platform, year, genre and publisher.
If we were to write a query to select the top 10 ranking games we could go with the following:
SELECT Rank, Name, Platform, Global_Sales FROM vgsales LIMIT 10;
This would result in the below table:
|2||Super Mario Bros.||NES||40.24|
|3||Mario Kart Wii||Wii||35.82|
|4||Wii Sports Resort||Wii||33|
|5||Pokemon Red/Pokemon Blue||GB||31.37|
|7||New Super Mario Bros.||DS||30.01|
|9||New Super Mario Bros. Wii||Wii||28.62|
If we wanted to see a list of which publishers for example appear within the top 100 sales the result with a simple select query wouldn’t look as readable:
SELECT Publisher FROM vgsales WHERE Rank <= 100
This would result in the below output:
|16||Microsoft Game Studios|
This is where the DISTINCT command makes life far easier. In the above example we don’t have a huge amount of unique publisher names if we wanted to manually count and make note of them one by one, but datasets rarely work with such small volumes.
The DISTINCT command is used at the very start of the SELECT query before the column name we want to work with. For example:
SELECT DISTINCT Publisher FROM vgsales WHERE Rank <= 100
This gives us a far smaller output of each individual publisher name:
|Microsoft Game Studios|
|Sony Computer Entertainment|
Working with aggregated results
The DISTINCT command can also be used when working with aggregate functions such as counts. Let’s say you simply want to know how many unique publishers appear within the top 100 sales but don’t really need to know the names of all of them. In this case you would simply include DISTINCT publisher within the count as below:
SELECT count(DISTINCT Publisher) FROM vgsales WHERE Rank >= 100
How to work with querying multiple columns
When working with multiple columns, the DISTINCT command will output a unique combination of results in the order we choose. Let’s say for example we wanted to query a list of platforms and any years they had a game sell more than 10 million units. In a standard SELECT query this would result in duplicates of the same lines if that platform had multiple games sell well that year.
When working with multiple columns, especially in this context working with dates, it is worth using the ORDER BY function as well to ensure the results display in a way that is easy to read. In this case a simple order by of platform alphabetically, followed by year in chronological order is the best approach. See below examples.
SELECT platform, year FROM vgsales WHERE global_sales >= 10 ORDER BY platform, year
SELECT DISTINCT platform, year FROM vgsales WHERE global_sales >= 10 ORDER BY platform, year
The second result gives a far better snapshot of what we want.
A similar result can be achieved using the GROUP BY command, which as we start working with more complex queries is typically a better approach as it is more powerful in the way it displays results.
For more tutorials in the SQL Made Easy series check out the below: