|

SQL SELECT DISTINCT Explained

This tutorial is part of the SQL Made Easy series of posts.

In this tutorial we will discuss:

  • Why to use the DISTINCT command
  • The syntax involved
  • Examples
  • Working with counts
  • Working with multiple columns

Sometimes when working with large tables in SQL you will come across columns that contain duplicate values. Most of the time these duplicates aren’t an issue as they aren’t meant to be a unique identifier or primary key, but if you are looking to query the table in a way that only uses these columns it is often best to make things as simple as possible – this is where the SQL SELECT DISTINCT command comes in.

SELECT vs SELECT DISTINCT

For this tutorial we are going to look at a dataset of top selling video games that is available on Kaggle.

This file contains information on the top selling video games globally by name, platform, year, genre and publisher.

If we were to write a query to select the top 10 ranking games we could go with the following:

SELECT Rank, Name, Platform, Global_Sales
FROM vgsales
LIMIT 10;

This would result in the below table:

RankNamePlatformGlobal_Sales
1Wii SportsWii82.74
2Super Mario Bros.NES40.24
3Mario Kart WiiWii35.82
4Wii Sports ResortWii33
5Pokemon Red/Pokemon BlueGB31.37
6TetrisGB30.26
7New Super Mario Bros.DS30.01
8Wii PlayWii29.02
9New Super Mario Bros. WiiWii28.62
10Duck HuntNES28.31

If we wanted to see a list of which publishers for example appear within the top 100 sales the result with a simple select query wouldn’t look as readable:

SELECT Publisher
FROM vgsales
WHERE Rank <= 100

This would result in the below output:

RankPublisher
1Nintendo
2Nintendo
3Nintendo
4Nintendo
5Nintendo
6Nintendo
7Nintendo
8Nintendo
9Nintendo
10Nintendo
11Nintendo
12Nintendo
13Nintendo
14Nintendo
15Nintendo
16Microsoft Game Studios
17Take-Two Interactive
18Take-Two Interactive
19Nintendo
20Nintendo

This is where the DISTINCT command makes life far easier. In the above example we don’t have a huge amount of unique publisher names if we wanted to manually count and make note of them one by one, but datasets rarely work with such small volumes.

The DISTINCT command is used at the very start of the SELECT query before the column name we want to work with. For example:

SELECT DISTINCT Publisher
FROM vgsales
WHERE Rank <= 100

This gives us a far smaller output of each individual publisher name:

Publisher
Activision
Atari
Bethesda Softworks
Electronic Arts
Microsoft Game Studios
Nintendo
Sega
Sony Computer Entertainment
SquareSoft
Take-Two Interactive
Ubisoft

Working with aggregated results

The DISTINCT command can also be used when working with aggregate functions such as counts. Let’s say you simply want to know how many unique publishers appear within the top 100 sales but don’t really need to know the names of all of them. In this case you would simply include DISTINCT publisher within the count as below:

SELECT count(DISTINCT Publisher)
FROM vgsales
WHERE Rank >= 100

Output:

Count Publisher
11

How to work with querying multiple columns

When working with multiple columns, the DISTINCT command will output a unique combination of results in the order we choose. Let’s say for example we wanted to query a list of platforms and any years they had a game sell more than 10 million units. In a standard SELECT query this would result in duplicates of the same lines if that platform had multiple games sell well that year.

When working with multiple columns, especially in this context working with dates, it is worth using the ORDER BY function as well to ensure the results display in a way that is easy to read. In this case a simple order by of platform alphabetically, followed by year in chronological order is the best approach. See below examples.

SELECT only

SELECT platform, year
FROM vgsales
WHERE global_sales >= 10
ORDER BY platform, year

Result:

PlatformYear
3DS2011
3DS2011
3DS2013
3DS2014
DS2004
DS2005
DS2005
DS2005
DS2005
DS2005

SELECT DISTINCT

SELECT DISTINCT platform, year
FROM vgsales
WHERE global_sales >= 10
ORDER BY platform, year

Result:

PlatformYear
3DS2011
3DS2013
3DS2014
DS2004
DS2005
DS2006
DS2009
DS2010
GB1989
GB1992

The second result gives a far better snapshot of what we want.

A similar result can be achieved using the GROUP BY command, which as we start working with more complex queries is typically a better approach as it is more powerful in the way it displays results.

For more tutorials in the SQL Made Easy series check out the below:

SQL Made Easy.SQL SELECT DISTINCT

Similar Posts