For many years my go to platform of choice for anything related to data was Microsoft Excel. Hours spent downloading various sources of data into a spreadsheet and running sums, counts, vlookups, countifs, pivot tables and lots and lots of formatting.
As requirements got more complicated and datasets grew larger, Excel started to display some limitations. Files would crash (losing hours of work when inevitably forgetting to save), and some formulas would get so convoluted it was near impossible to troubleshoot if any numbers looked wrong.
SQL (Structured Query Language – pronounced ‘sequel’ ) is used by a wide range of industries for various functions, but one of it’s key uses is in data storage and transformation.
SQL has been around for a very long time and likely won’t be going anywhere anytime soon. Because of this it’s always a great time to learn as it can make for an invaluable skill. Other coding languages such as Python and R have similar data transformation abilities (more on that in another post), but if you only want to learn one coding language for data and analytics learn SQL.
Working with databases and SQL rather than individual spreadsheets has a whole range of benefits from time saving, automation, accurate outputs, scalable results, stronger team collaboration, and on top of all this actually works well alongside Excel!
With this in mind, here is why you should stop relying on Excel and learn SQL:
SQL is really easy to pick up
One of the nice things about Excel is that it’s quite simple. Most people can pick up the basic functions pretty easily, while the idea of writing code can be intimidating.
The great thing about SQL is compared to other coding languages is that it is really easy to understand, largely due to how literal a lot of the statements are.
For example. Lets say you have a set of data containing some figures on all of the US states and wanted to see the top 10 capital cities ranked by highest population. All you would need to do is type the below query:
SELECT State, Capital, Population FROM UnitedStatesDataSet ORDER BY Population LIMIT 10;
As you can see – very literal! First you select the columns you want to display, where the data is, how to order it, and how many records you want. In this case if you don’t choose to limit you would just get a table of all 52 states. This is handy when working with larger data sets and want to quickly see samples.
The resulting data would look something like this:
It’s so much faster
So often in Excel I eventually get to the point where the file becomes completely unusable. There are multiple tabs, each with thousands of lines of data, countless extra columns added with formulas, lots of formatting that has been applied etc. and it never ends well.
When running queries with SQL on a database your computers power isn’t nearly as important largely due to where the data itself sits. In some businesses it will be stored on site on a server, or more and more often these days will be saved on a cloud database (the big ones being Google, Amazon and Microsoft’s platforms).
Depending on where the database is hosted SQL isn’t always going to be lightning fast, but it is infinitely more stable and efficient than Excel when working with complicated queries – for example joins on joins (think multiple series of vlookups referencing various tabs!)
This leads nicely into the next point…
You can work with more data
This advantage will totally depend on what kind of data you are working with and how granular it goes, but one of the biggest limitations of Excel is that it does not allow you to work with more than 1,048,576 rows. CSV files themselves are unlimited in size due to the nature of the file itself, but Excel won’t read anything beyond that number.
With SQL you don’t have this limit, largely because you don’t have each row and cell ‘active’ as you are working. You type the query into the interface of whatever SQL platform you are using and the data itself only appears when you need to display the final result. You also have no real ability to interact with that table result outside of altering the code and running it again, or exporting back to a CSV so it is far less resource intensive.
No original data is ever modified
Alright this one isn’t always true, but in an ideal world it would be. Nothing is really stopping you from importing a dataset and modifying it in its original state, however if you stick to best practice techniques (such as not modifying raw data) then any mistakes you may make along the way will never impact the original data. It will always be there ready to run a new set of queries against for whatever purpose you want.
Every single action is logged and repeatable
This one isn’t as clear of a benefit initially if you are used the process of manually editing fields in Excel, but once you understand why it is so important and beneficial it becomes very clear why SQL (and other languages like Python as this applies there too) is a good idea even with smaller data sets.
Depending on your project, sometimes you may need to take over other peoples work, or hand over your own for support. Running data processing in SQL makes collaboration really easy.
If you were working in Excel, unless you had a notepad doc open as you worked logging every single step – every time you overwrote a value in a cell, added a new column, wrote a formula etc. then it would be quite challenging to really hand over a piece of work to someone else if you were unavailable and the work needed to be completed or reproduced from scratch.
In SQL these steps are logged by default simply due to the nature of being able to read the code that was written. You can see exactly what has been done from start to finish (assuming you write the code in order – surprisingly not as common as it sounds).
What makes SQL even better is the ability to add comments. SQL (and other languages) have a built in comment function to make notes about what is going on in each step and something that is invaluable the more complex a piece of work gets.
What if excel is required in a business?
All of this being said, Excel, and spreadsheets in general, absolutely still have their place in the world. Sometimes with very small data sets its honestly easier to just jump into a spreadsheet and run a few quick formulas and make a pivot table.
In some cases Excel is going to be required. Maybe a report needs to be sent out in Excel format as its what your business uses. Luckily, SQL and Excel can also compliment each other nicely!
There are a couple of ways you can still leverage SQL when you are required to work with Excel:
Copy data from a query result to a CSV
- If the final output of a SQL query is a nice clean table it is very easy to copy this over to a spreadsheet file and make some quick pivot tables and charts with it as long as the total number of rows stays under one million.
Excel has the ability to connect directly to SQL!
- A lot of the time when working with larger data sets in Excel the best approach to generating a clean and easy to read output is to use pivot tables.
- Excel has a feature that lets you connect directly to a SQL database and even run a basic query right in the software itself that will pull in a set of data to make pivots out of.
- This option is great as it gives you the best of both worlds – powerful and efficient data manipulation with SQL, and easy to generate reporting with Excel!
Keep an eye out for an upcoming post with instructions on how to utilise the combination of SQL and Excel