
Some days ago, one of my customer claimed that searching for “Muller” doesn’t return “Mueller” and “Müller” as well!
This is typically an issue due to the collation of the SQL Server database, but how to resolve this problem?
The collation of the database is Latin1_General_CI_AS, so a case insensitive and accent sensitive collation.
If a run the following query:
Select name from Person where name like 'muller'
I get only “Muller” which is normal as I use an accent sensitive collation so u is not the same as ü…
Next, if I execute the same query by using an accent insensitive collation:
Select name from Person where name like 'muller' collate Latin1_GENERAL_CI_AI
I have as result:
This time, “Muller” and “Müller” are retrieved as my collation is fully insensitive. For Latin1_General and AI (Accent Insensitive) collation u = ü, o = ö, a = ä…
But I get not yet “Mueller” which is a synonym of “Müller” without using a ü in German writing.
So I decided to use a German collation to see if it could solve my issue by returning my three forms of “Muller”. In this phonebook collation, ü is sorted like ue, ä like ae…
Select name from Person where name like 'muller' collate German_PhoneBook_CI_AI
As expected, I received just “Muller” which is quite normal as “Muller” in German speaking is not “Müller”…
Let’s try with:
Select name from Person where name like 'müller' collate German_PhoneBook_CI_AI
This result is consistent with the German speaking where “Mueller” and “Müller” are the same. But I cannot yet get my three forms of “Muller”…
Getting the result excepted by my customer seems like an impossible task by just changing the column collation.
Another possibility is to use the SOUNDEX string function. This function converts an alphanumeric string to a four-character code based on how the string sounds when spoken. So, let’s try with those queries:
select * from Person where soundex(Name) like soundex('muller')
select soundex('muller'), soundex('müller'), soundex('mueller')
This string function was able to retrieve all forms of “muller” without any collation change. I saw that all version of “Muller” is converted to the same SOUNDEX code. The only problem is the utilization of indexes by this function which is not ensure.
Finally, I took a look at the FullText catalog feature which can be accent insensitive and that will include a FullText index with German language:
CREATE FULLTEXT CATALOG FTSCatalog WITH ACCENT_SENSITIVITY=OFF AS DEFAULT
CREATE FULLTEXT INDEX ON Person(name LANGUAGE German) KEY INDEX PK_Person
GO
After I used the following queries based on the contains clause and the Formsof predicate with the inflectional option for my different forms of Muller:
Select name from Person where contains(name,'FORMSOF(INFLECTIONAL,"muller")')
Select name from Person where contains(name,'FORMSOF(INFLECTIONAL,"müller")')
Select name from Person where contains(name,'FORMSOF(INFLECTIONAL,"mueller")')
As expected the result was consistent with the other ones as we don’t have all forms when searching for “muller”. In contrary searching for “müller” or “Mueller” gives me all the results.
In conclusion, the FullText capabilities of SQL Server is certainly the best solution as it will be also faster with a huge number of rows and give the possibility to not change the collation which could be sometimes a real nightmare but we have to use “Müller” instead of “muller” to retrieve all the expected results.