Today I want to talk about an important and not fully understood part of SQL Server, collation. Collation effects everything from ordering to query performance.
What is it?
Collation is a reference to a certain set of rules that ultimately determine how all data is sorted and compared. Character data is sorted by the correct sequence which is defined by the rules of the collation and comes with options such as case sensitivity, accent sensitivity, Japanese kana character types, and character width.
More simply put, collation can be thought of as a sort order. In English, collation can be a fairly simplistic as the ordering is done via ASCII code. Using other languages such as Spanish can become more difficult as accented characters and character groupings are now added into the rules (example: “á” or “ll”) and need to be sorted as if they were the same letter.
How does it work?
Since collation is a set of rules for how strings of character data are sorted and compared it’s safe to say that we utilize this mostly at run-time when the ORDER BY clause is utilized. For example, from the results of the ORDER BY clause, an English speaker (utilizing SQL_Latin1_General_CP1_CI_AS) would see “Chiapas” before “Colima” in ascending order. A Spanish speaker (utilizing Traditional_Spanish_CI_AS) on the other hand would expect words beginning in”Ch” to appear after words beginning in “C” thus resulting in “Colima” sorting before “Chiapas”.
Collation works a little differently between Unicode strings and non-Unicode strings. When a collation is specified for non-Unicode character data (CHAR, VARCHAR, & TEXT) there is a particular code page that is associated with the collation, which can also be shared between multiple collations.
Unicode character data (NCHAR, NVARCHAR, & NTEXT) do not have associated code pages because Unicode data handles most universal characters.
What does each part of the collation description mean?
There are multitudes of collations that exist within SQL Server and picking the collation that’s right for your business can be tricky. For most American companies SQL_Latin1_General_CP1_CI_AS is the default. Luckily for us Microsoft has provided us with a system function to display the collations and their descriptions (sys.fn_helpcollations()).
SELECT name, description
FROM sys.fn_helpcollations()
You can see that each collation has a general description, a case sensitivity setting, an accent sensitivity setting, a kanatype sensitivity setting, a width sensitivity setting for Unicode data, and a sort order and code page for non-Unicode data.
For SQL_Latin1_General_CP1_CI_AS the description is as follows:
- Latin1-General
- Case-insensitive
- Accent-sensitive
- Kanatype-insensitive
- Width-insensitive for Unicode data
- Sort Order 52 on code page 1252 for non-Unicode data
How can we use it?
SQL Server allows us to not only set a default collation for each database but we are able to alter collation on the fly! Thankfully Microsoft built in a function known as COLLATE.
DECLARE @String VARCHAR(10)
SET @String = ‘abcdefg’
IF @String = ‘ABCDEFG’
PRINT ‘Match’
ELSE
PRINT ‘No Match’
If your SQL Server default collation is set to SQL_Latin1_General_CP1_CI_AS then this query will result in a “Match”. But what if we use collate to set the collation on the fly?
DECLARE @String VARCHAR(10)
SET @String = ‘abcdefg’
IF @String = ‘ABCDEFG’ COLLATE SQL_Latin1_General_CP1_CS_AS
PRINT ‘Match’
ELSE
PRINT ‘No Match’
This now results in a “No Match” due to the collation changing from case-insensitive to case-sensitive at run-time. If we change the value of @String to all uppercase then we will receive a match again using this collation change. If any part of the string does not match the case of the argument then it will return as “No Match”.