SQL Collation

Today I want to talk about an important and not fully understood part of SQL Server, collation. Collation effects everything from ordering to query performance.


 

What is it?

Collation is a reference to a certain set of rules that ultimately determine how all data is sorted and compared. Character data is sorted  by the correct sequence which is defined by the rules of the collation and comes with options such as case sensitivity, accent sensitivity, Japanese kana character types, and character width.

More simply put, collation can be thought of as a sort order. In English, collation can be a fairly simplistic as the ordering is done via ASCII code. Using other languages such as Spanish can become more difficult as accented characters and character groupings are now added into the rules (example: “á” or “ll”) and need to be sorted as if they were the same letter.


 

How does it work?

Since collation is a set of rules for how strings of character data are sorted and compared it’s safe to say that we utilize this mostly at run-time when the ORDER BY clause is utilized. For example, from the results of the ORDER BY clause, an English speaker (utilizing SQL_Latin1_General_CP1_CI_AS) would see “Chiapas” before “Colima” in ascending order. A Spanish speaker (utilizing Traditional_Spanish_CI_AS) on the other hand would expect words beginning in”Ch” to appear after words beginning in “C” thus resulting in “Colima” sorting before “Chiapas”.

Collation works a little differently between Unicode strings and non-Unicode strings. When a collation is specified for non-Unicode character data (CHAR, VARCHAR, & TEXT) there is a particular code page that is associated with the collation, which can also be shared between multiple collations.

Unicode character data (NCHAR, NVARCHAR, & NTEXT) do not have associated code pages because Unicode data handles most universal characters.


 

What does each part of the collation description mean?

There are multitudes of collations that exist within SQL Server and picking the collation that’s right for your business can be tricky. For most American companies SQL_Latin1_General_CP1_CI_AS is the default. Luckily for us Microsoft has provided us with a system function to display the collations and their descriptions (sys.fn_helpcollations()).

SELECT name, description
FROM sys.fn_helpcollations()

Collation

You can see that each collation has a general description, a case sensitivity setting, an accent sensitivity setting, a kanatype sensitivity setting, a width sensitivity setting for Unicode data, and a sort order and code page for non-Unicode data.

For SQL_Latin1_General_CP1_CI_AS the description is as follows:

  • Latin1-General
  • Case-insensitive
  • Accent-sensitive
  • Kanatype-insensitive
  • Width-insensitive for Unicode data
  • Sort Order 52 on code page 1252 for non-Unicode data

 

How can we use it?

SQL Server allows us to not only set a default collation for each database but we are able to alter collation on the fly! Thankfully Microsoft built in a function known as COLLATE.

DECLARE @String VARCHAR(10)
SET @String = ‘abcdefg’

IF @String = ‘ABCDEFG’
PRINT ‘Match’
ELSE
PRINT ‘No Match’

If your SQL Server default collation is set to SQL_Latin1_General_CP1_CI_AS then this query will result in a “Match”. But what if we use collate to set the collation on the fly?

DECLARE @String VARCHAR(10)
SET @String = ‘abcdefg’

IF @String = ‘ABCDEFG’ COLLATE SQL_Latin1_General_CP1_CS_AS
PRINT ‘Match’
ELSE
PRINT ‘No Match’

This now results in a “No Match” due to the collation changing from case-insensitive to case-sensitive at run-time. If we change the value of @String to all uppercase then we will receive a match again using this collation change. If any part of the string does not match the case of the argument then it will return as “No Match”.

SQL Collation

Visibility with multivalued parameters

Every SQL Server Reporting Services developer knows of the magical “hidden” property that lies within each object placed in your report. Using this property will allow you to hide certain objects on command and can be used for many things. You can choose to hide columns in a tablix, hide your charts based off of a parameter, etc. In one of my latest reports I had to get SSRS to show/hide details based off of a multi-valued parameter. Hiding off of a single parameter using an expression is easy and everyone knows it, but showing/hiding based off of how many selections were made in a single parameter? I didn’t know how and I needed to find out. After some research I came across article after article explaining various methods to achieve this but none of them seemed to work except for 2. These methods are as follows:


Method #1: Using COUNTROWS()

This method worked great up until I ran into the caveat which was an all or none scenario. Either I choose all and hide the details or I choose any other number and the details are shown. This isn’t entirely what I was hoping for as I wanted to show all details when only ONE value was chosen and no details when 2 or more were chosen. This method would show all details as long as the count of selected values was less than the count of rows in the dataset. This could be useful in some situations but not in mine. The following expression is as follows:

=IIF(COUNTROWS("dataset").Equals(Parameters!parameter.Count),true,false)


Method #1: Using a boolean expression

This method was much more simplistic and met my requirements easily. In the code from method #1 you will notice the “Parameters!parameter.Count” section. This function counts the number of selections made within a parameter. Since my results are based off of a boolean value anyway why don’t we just create a boolean expression and let the results determine the outcome? This theory worked perfectly, code for example:

(Parameters!parameter.Count) > 1)

Such a simple answer and I didn’t even know it. But now we know and knowing is half the battle.

psa gi joe fensler body massage

Visibility with multivalued parameters

CTEs and Recursive CTEs

When programming in T-SQL there’s always a use for temp tables and table variables but SQL Server has also introduced another tool that works similarly to the previously mentioned table types, the Common Table Expression (CTE). I really like to think of a CTE as an inline temp table that’s created on the fly and is very useful for data manipulation on an initial dataset to produce your final dataset. The example of a CTE below uses the AdventureWorks2014 database.

;
WITH CTE
AS ( SELECT MaritalStatus ,
YEAR(BirthDate) AS YearBorn ,
SUM(VacationHours) AS VacationTotal ,
SUM(SickLeaveHours) AS SickLeaveTotal
FROM HumanResources.Employee
GROUP BY YEAR(BirthDate) ,
MaritalStatus
)
SELECT *
FROM CTE;

CTEs are split up into two main groups; the inner query and the outer query. The inner query generates the dataset in which you wish to alter for your final results and the outer query does the alterations for your final results. CTEs are also must be designated as the start of a batch so make sure to properly mark your batches done with a semicolon so you don’t throw an error.

One of my recent tasks included creating a rolling 12 month analysis of sales on a SQL Server 2005 instance (yes some people are still in the stone ages) which means that I don’t get to have the pleasure of using all of today’s fancy functions, I have to create the logic the old school way in my queries. But sure enough CTEs are able to be used and luckily CTEs are also cool enough to reference themselves inside itself. Utilizing the concept of recursion you are able to create a base query (known as the “anchor” query) and reference that query inside of another query (the recursive query) inside of the CTE…RECURSION!

Let’s use this tactic to generate a list of dates and a Month-Year concatenated name for the last 12 months including the current month.

;
WITH Last12Months ( CurrDate, MonthLabel )
AS (
–Anchor Query
SELECT CurrDate = GETDATE() ,
MonthLabel = DATENAME(MONTH, GETDATE()) + ‘ ‘
+ CONVERT(VARCHAR(4), YEAR(GETDATE()))
UNION ALL
–Recursive Query
SELECT CurrDate = DATEADD(MONTH, -1, cte.CurrDate) ,
MonthLabel = DATENAME(MONTH,
DATEADD(MONTH, -1, cte.CurrDate))
+ ‘ ‘ + CONVERT(VARCHAR(4), YEAR(DATEADD(MONTH, -1,
cte.CurrDate)))
FROM Last12Months cte
WHERE cte.CurrDate > DATEADD(MONTH, -11, GETDATE())
)
SELECT *
FROM Last12Months;

Recursion

And now, the proper Inception meme to go with today’s post.

CTEs and Recursive CTEs

ASCII() Function

Hey look at that! A function that returns the number code of a specified character! Doesn’t this sound familiar? This function works just as the CHAR() function does except the complete opposite. ASCII stands for American Standard Code for Information Interchange. When going down to the most basic levels of computing, computers run off of the good old binary code (a series of 1s or 0s, yes or no) which also means that computers only understand numbers. ASCII was developed to communicate text to computers.

SELECT  ASCII(‘R’) AS R, ASCII(‘r’) AS r

Rr

They are broken down into 3 categories:

  1. ASCII control characters (code 0-32)
    • These characters are control codes used to control peripherals
  2. ASCII printable characters (code 33-127)
    • These represent letters, digits, punctuation marks, and some other symbols. All of these will be found on your common keyboard.
  3. ASCII extended printable characters (128-255)
    • The extended codes will be based off of whatever collation your database is set to. The following table is based off of the ISO Latin-1 collation.

Let’s have a little fun with this knowledge! The first part will be using a WHILE loop to grab each character in a string and show the ASCII and CHAR codes for each character.

DECLARE @String VARCHAR(50) = ‘SQLBLOG’;
DECLARE @Location INT = 0;

WHILE @Location < LEN(@String)
BEGIN

SELECT ASCII(SUBSTRING(@String, @Location + 1, @Location + 2)) AS [ASCII Code] ,
CHAR(ASCII(SUBSTRING(@String, @Location + 1, @Location + 2))) AS [CHAR Code];

SET @Location = @Location + 1;

END;

ASCII_SQLBlog

This gives us individual results containing the ASCII number and the character it refers to in the designated string (“SQLBLOG”). Since there are seven characters in this string there will be seven results returned. This is a good practice exercise but it’s not very practical for every day use. But what if we wanted to use this same logic to determine what characters cannot go into a table? Well besides using a complicated check constraint let’s use a trigger that’s activated on insert or update of the table.

CREATE TRIGGER [dbo].[InvalidCharacterInName] ON [dbo].[Employees]
INSTEAD OF INSERT, UPDATE
AS
BEGIN
DECLARE @Location INT = 0;
DECLARE @String VARCHAR(100) = ( SELECT [Inserted].[EmployeeName]
FROM [Inserted]
);

WHILE @Location < LEN(@String)
BEGIN
IF ( ASCII(SUBSTRING(@String, @Location + 1, @Location + 2)) > 124
AND ASCII(SUBSTRING(@String, @Location + 1, @Location + 2)) <= 255
)
BEGIN
RAISERROR(‘There is an invalid character in the string you are entering. Please remove invalid characters.’,10,1);
RETURN;
END;

ELSE
BEGIN

SET @Location = @Location + 1;

END;
END;

INSERT INTO dbo.Employees
( EmployeeName )
VALUES ( @String );

END;

Now when we run the following code we should receive an error stating that there is an invalid character and it needs to be fixed.

INSERT INTO dbo.Employees
( EmployeeName )
VALUES ( CONCAT(‘Green’,CHAR(212),’Lantern’) );

TriggerError

Obviously the invalid character is the CHAR(212) or Ô. Removing this will allow the transaction to complete and add the name to the table.

Before:

TableBefore

Code:

INSERT INTO dbo.Employees
( EmployeeName )
VALUES ( ‘Green Lantern’ );

After:

TableAfter

ASCII() Function

Function to convert text to title case

A cool little function that will convert any text you feed it into a title case (I.E. “this is a title” into “This Is A Title”). Very useful for on the fly formatting and to make your strings look great!

CREATE FUNCTION udf_ConvertStringToTitleCase
(
@InputString VARCHAR(4000)
)
RETURNS VARCHAR(4000)
AS
BEGIN
DECLARE @Index INT;
DECLARE @Char CHAR(1);
DECLARE @OutputString VARCHAR(255);
SET @OutputString = LOWER(@InputString);
SET @Index = 2;
SET @OutputString = STUFF(@OutputString, 1, 1,
UPPER(SUBSTRING(@InputString, 1, 1)));
WHILE @Index <= LEN(@InputString)
BEGIN
SET @Char = SUBSTRING(@InputString, @Index, 1);
IF @Char IN ( ‘ ‘, ‘;’, ‘:’, ‘!’, ‘?’, ‘,’, ‘.’, ‘_’, ‘-‘, ‘/’,
‘&’, ””, ‘(‘ )
IF @Index + 1 <= LEN(@InputString)
BEGIN
IF @Char != ””
OR UPPER(SUBSTRING(@InputString, @Index + 1, 1)) != ‘S’
SET @OutputString = STUFF(@OutputString,
@Index + 1, 1,
UPPER(SUBSTRING(@InputString,
@Index + 1, 1)));
END;
SET @Index = @Index + 1;
END;
RETURN ISNULL(@OutputString,”);
END;
GO

Since this is a function we are able to reference it within a SELECT statement.

SELECT  dbo.udf_ConvertStringToTitleCase(‘this string is all lower case’) AS [ConvertedString];

ConvertedString

Function to convert text to title case

Different Forms of DBMS

The database management system (DBMS) is the heart of a today’s analytical and operational systems. Every good organization will collect their data for some form of use in the future. Data is the life blood of each organization and the DBMS is the central location in which data is stored, managed, secured, and delivered a multitude applications and users. There are many variations of DBMS that exist in today’s world, each containing unique strengths and weaknesses.

Database1

The three major types are relational databases (RDBMS), NoSQL technologies, and in-memory databases. RDBMS types have been the norm in IT for more than 30 years now. This was due to server technology decreasing in price and becoming more affordable for even the smallest of companies. In today’s world the RDBMS will always have its place but the new and upcoming NoSQL technologies (“Big Data”) such as Apache Hadoop or MongoDB will attempt to process unstructured data or data sets that have a highly variable structure. There are also applications that will require considerably high speed transactions to occur with almost instantaneous response rates or applications that constantly perform complex analytics on data in real time or near real time. The speed and functionality for these must be at the absolute peak without creating problems and in-memory databases can handle this feat flawlessly.

It is quite common for an organization to utilize multiple of these database technologies depending on their processing needs. Choosing the proper DBMS technology can make a huge impact the success/failure of an organization’s IT projects and systems. Understanding the different types and categories of DBMS combined with the “when” and “why” to use them is crucial to making your decisions. Let’s take a look at some of these categories and models.


RDBMS

Database2

Until around 2009 then RDBMS was the the only real category of DBMS worth considering. The “Big Data” trend has brought new competition for this category as some of the software released can actually compete well with today’s RDBMS in certain cases.

Regardless of the new competition the RDBMS is, by far, the undisputed leader when it comes to revenue, processing, and reliability. RDBMS are based off of strong, set theory and very sound logic. The relational databases provide data storage, access, and protection with great performance for operational and analytical applications. The three giants of the RDBMS world are Microsoft SQL Server, Oracle, and DB2 (IBM). My personal favorite is Microsoft SQL Server. This concept is extremely adaptable and flexible and is utilized for years in both fortune 500 companies and smaller organizations as well. The only downfall is that RDBMS products are NOT cheap.

To run an RDBMS system you need at least one server to house your data and be the front door to your applications. Since business processes can heavily burden the server you will more likely need to build it based on your business needs and current/future processes. The main bottleneck here is that you’ll hit physical limitations per server (Each server can only hold X amount of RAM, the server build can only handle X number of cores, etc). After that you will need licensing for each product and machines to access the data that is being stored. You will also need to construct your database to fit your business needs so having a database administrator or architect is highly recommended if you don’t fully understand how to construct it yourself. Fully understanding the layout of the relational database and it’s possibly limitations and implications before putting anything in place is important to retaining ACID compliance.

The ACID properties of an RDBMS is an important and compelling feature. RDBMS will ensure transactional Atomicity, Consistency, Isolation, and Durability to your database providing a very high level of data integrity. This means that the RDBMS guarantees that all transactions are completed correctly and if a transaction fails to complete, the database is returned to its previous state with no change from the transaction.

The RDBMS method utilizes real-time processing, more commonly known as Online Analytical Processing (OLAP). Changes to the database are instantaneous and returning data is relatively quick due to database engine execution plans that vary based off of your structured query language (SQL). Many RDBMS platforms offer forms of in-memory processing and caching to help bridge the gap between standard RDBMS and in-memory databases which also increases return and processing speed.


NoSQL DBMS

Database3

If you have a rapidly changing and dynamic data environment where you don’t care as much about quick processing and consistency then NoSQL may be advantageous for you. You must think carefully though as the RDBMS and NoSQL are two completely different beasts. Unlike RDBMS, he NoSQL database permits a flexible schema in which every data element doesn’t necessarily need to exist for every entity. NoSQL can also be a more practical solution for loosely defined data structures that can/will evolve over time.

Another large difference is how data consistency is managed. The RDBMS has always been able to ensure that the data that is being stored is consistent and most NoSQL DBMS products offer more of an eventual approach. The RDBMS can also use certain levels of locking and isolation to implement this but no RDBMS user would really utilize this in a common scenario. NoSQL is attempting to keep up by releasing options to support full ACID rules.

NoSQL also allowed working with large amounts of sparse data easier. Data is usually considered as “sparse” when there is a lot of empty space between actual values and not many elements are populated. Regardless of certain data types benefiting from using NoSQL it still comes with the price of eliminating transactional integrity, ease of querying, and indexing.

NoSQL platforms will utilize at least one of four primary DBMS offerings:

  1. Key-value
  2. Document
  3. Column-Store
  4. Graph

Every one of these categories utilizes a different data model all with unique strengths and weaknesses and should be used in specific cases. Hadoop, for example, utilizes the Key-value method more so than the rest.


In-Memory Databases

The final major category is the in-memory database (IMDBMS). An IMDBMS relies mainly on storing data in memory as opposed to disk. This offers large strength in speed due to greatly reduced I/O latency but, because it’s stored in memory, is more volatile. There is no mechanical disk movement, seek times, etc because the data is immediately accessed from this memory. This will also lower overhead because the algorithms used to retrieve the data are much simpler and don’t require nearly as much CPU power.

As you move forward you will encounter more issues that require consideration. You should always think about whether or not the DBMS you’re thinking of utilizing works on your platform as not all of them are able to universally run on each platform. Vendor support is another one. In the NoSQL world many of the DBMS are completely open source which means that, even though initial costs will be lower, there is always a significant lack of support. Adding in any additional support could be costly in the long run.

The final thing you could potentially do it move all of your data to the cloud, which is an increasingly popular concept. Instead of implementing a DBMS at your organization you can simply pay a cloud-based service provider to handle all of your data for you, like Microsoft Azure.

Database4

Different Forms of DBMS