When you work with databases sooner or later you will need to get in touch witch collations. It might be that you never needed to think about it as initdb picked a value that just worked for you and nobody in your database ever tried to use a collation that is not available in your installation. Once someone is trying that or there is the requirement to use a specific collation from the beginning you need to know at least the basics and this is the topic of this post. In contrast to Oracle, PostgreSQL relies on the operating system for collation support. As always, a demo says more than thousands words, so lets go.
I am running CentOS 8 (8.1.1911 (Core)), minimal installation plus the packages required for installing PostgreSQL from source code. To get a default environment lets create brand new user:
postgres@centos8pg:/home/postgres/ [pgdev] sudo useradd pg postgres@centos8pg:/home/postgres/ [pgdev] sudo su - pg
Once we switch to that user and check the LANG and LC_* variable this is what we get:
[pg@centos8pg ~]$ env | grep LC [pg@centos8pg ~]$ env | grep LANG LANG=en_US.UTF-8
When we run initdb to create a new cluster this is what gets picked up and will become the default for the databases:
[pg@centos8pg ~]$ /u01/app/postgres/product/12/db_0/bin/initdb -D /var/tmp/pg [pg@centos8pg ~]$ export PGPORT=9999 [pg@centos8pg ~]$ /u01/app/postgres/product/12/db_0/bin/pg_ctl -D /var/tmp/pg start [pg@centos8pg ~]$ /u01/app/postgres/product/12/db_0/bin/psql -p 9999 postgres psql (12.0) Type "help" for help. postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+-------+----------+-------------+-------------+------------------- postgres | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/pg + | | | | | pg=CTc/pg template1 | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/pg + | | | | | pg=CTc/pg (3 rows)
With PostgreSQL you can define the default collation for a database at the time you create a database so creating a new database with the same collation as the ones already existing is not an issue at all:
[pg@centos8pg ~]$ /u01/app/postgres/product/12/db_0/bin/psql -p 9999 postgres postgres=# create database tmp with LC_COLLATE = "en_US.UTF-8"; CREATE DATABASE
The trouble starts when you want to use a collation that is not currently available from the operating system:
postgres=# create database tmp2 with LC_COLLATE = "de_DE.UTF-8"; ERROR: invalid locale name: "de_DE.UTF-8"
At this point PostgreSQL can not create the database for you as it does not know anything about the German locale specified in the create database command. Apparently my CentOS installation does not have support for this collation. How then can you check what the operating system currently provides? The command to use on Linux flavors that use systemd is localectl:
[pg@centos8pg ~]$ localectl list-locales C.utf8 en_AG en_AU en_AU.utf8 en_BW en_BW.utf8 en_CA en_CA.utf8 en_DK en_DK.utf8 en_GB en_GB.iso885915 en_GB.utf8 en_HK en_HK.utf8 en_IE en_IE.utf8 en_IE@euro en_IL en_IN en_NG en_NZ en_NZ.utf8 en_PH en_PH.utf8 en_SC.utf8 en_SG en_SG.utf8 en_US en_US.iso885915 en_US.utf8 en_ZA en_ZA.utf8 en_ZM en_ZW en_ZW.utf8
Given that output I can only use the various “en_*” and the “C” collations but nothing else. Let’s try to use a collation from the list above:
postgres=# create database tmpX with lc_collate='en_NZ.utf8'; ERROR: new collation (en_NZ.utf8) is incompatible with the collation of the template database (en_US.UTF-8) HINT: Use the same collation as in the template database, or use template0 as template.
That fails because there already could be some data in template1 which is not compatible with the locale specific in the create database command. As template0 is read only (more on the template databases here) this one should be used when you want to create database using a locale which is different from template1:
postgres=# create database tmpX with lc_collate='en_NZ.utf8' template=template0; CREATE DATABASE postgres=# create database tmpXX with lc_collate='C' template=template0; CREATE DATABASE postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+-------+----------+-------------+-------------+------------------- postgres | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/pg + | | | | | pg=CTc/pg template1 | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/pg + | | | | | pg=CTc/pg tmp | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | tmpx | pg | UTF8 | en_NZ.utf8 | en_US.UTF-8 | tmpxx | pg | UTF8 | C | en_US.UTF-8 | (7 rows)
Basically we can use every locale that the operating provides but as there is nothing for German we can not use it. The question now is how we can install support for other locales on the operating system. For Red Hat 8 or CentOS 8 based systems this is quite easy as dnf can be used to search for, and then install the language packs you need:
[pg@centos8pg ~]$ dnf search locale | grep de Last metadata expiration check: 1:23:33 ago on Wed 04 Mar 2020 01:53:43 AM CET. perl-Encode-Locale.noarch : Determine the locale encoding perl-Locale-Codes.noarch : Distribution of modules to handle locale codes perl-Locale-US.noarch : Two letter codes for state identification in the United States and vice versa rubygem-locale.noarch : Pure ruby library which provides basic APIs for localization glibc-langpack-de.x86_64 : Locale data for de glibc-langpack-de.i686 : Locale data for de langtable.noarch : Guessing reasonable defaults for locale, keyboard layout, territory, and language.
Once you’ve identified the correct package just install it:
[pg@centos8pg ~]$ sudo dnf install -y glibc-langpack-de
If we ask the operating system supported for supported locales once more, the German ones now show up:
[pg@centos8pg ~]$ localectl list-locales | grep de de_AT de_AT.utf8 de_AT@euro de_BE de_BE.utf8 de_BE@euro de_CH de_CH.utf8 de_DE de_DE.utf8 de_DE@euro de_IT de_IT.utf8 de_LI.utf8 de_LU de_LU.utf8 de_LU@euro
This should give PostgreSQL all it needs to create the database with the German locale. Before we proceed with creating that database have a look at the pg_collation catalog table:
postgres=# select * from pg_collation ; oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion -------+-----------------+---------------+-----------+--------------+---------------------+--------------+-----------------+-----------------+------------- 100 | default | 11 | 10 | d | t | -1 | | | 950 | C | 11 | 10 | c | t | -1 | C | C | 951 | POSIX | 11 | 10 | c | t | -1 | POSIX | POSIX | 12326 | ucs_basic | 11 | 10 | c | t | 6 | C | C | 12327 | C.utf8 | 11 | 10 | c | t | 6 | C.utf8 | C.utf8 | 12328 | en_AG | 11 | 10 | c | t | 6 | en_AG | en_AG | 12329 | en_AU | 11 | 10 | c | t | 8 | en_AU | en_AU | 12330 | en_AU.utf8 | 11 | 10 | c | t | 6 | en_AU.utf8 | en_AU.utf8 | 12331 | en_BW | 11 | 10 | c | t | 8 | en_BW | en_BW | 12332 | en_BW.utf8 | 11 | 10 | c | t | 6 | en_BW.utf8 | en_BW.utf8 | 12333 | en_CA | 11 | 10 | c | t | 8 | en_CA | en_CA | 12334 | en_CA.utf8 | 11 | 10 | c | t | 6 | en_CA.utf8 | en_CA.utf8 | 12335 | en_DK | 11 | 10 | c | t | 8 | en_DK | en_DK | 12336 | en_DK.utf8 | 11 | 10 | c | t | 6 | en_DK.utf8 | en_DK.utf8 | 12337 | en_GB | 11 | 10 | c | t | 8 | en_GB | en_GB | 12338 | en_GB.iso885915 | 11 | 10 | c | t | 16 | en_GB.iso885915 | en_GB.iso885915 | 12339 | en_GB.utf8 | 11 | 10 | c | t | 6 | en_GB.utf8 | en_GB.utf8 | 12340 | en_HK | 11 | 10 | c | t | 8 | en_HK | en_HK | 12341 | en_HK.utf8 | 11 | 10 | c | t | 6 | en_HK.utf8 | en_HK.utf8 | 12342 | en_IE | 11 | 10 | c | t | 8 | en_IE | en_IE | 12343 | en_IE@euro | 11 | 10 | c | t | 16 | en_IE@euro | en_IE@euro | 12344 | en_IE.utf8 | 11 | 10 | c | t | 6 | en_IE.utf8 | en_IE.utf8 | 12345 | en_IL | 11 | 10 | c | t | 6 | en_IL | en_IL | 12346 | en_IN | 11 | 10 | c | t | 6 | en_IN | en_IN | 12347 | en_NG | 11 | 10 | c | t | 6 | en_NG | en_NG | 12348 | en_NZ | 11 | 10 | c | t | 8 | en_NZ | en_NZ | 12349 | en_NZ.utf8 | 11 | 10 | c | t | 6 | en_NZ.utf8 | en_NZ.utf8 | 12350 | en_PH | 11 | 10 | c | t | 8 | en_PH | en_PH | 12351 | en_PH.utf8 | 11 | 10 | c | t | 6 | en_PH.utf8 | en_PH.utf8 | 12352 | en_SC.utf8 | 11 | 10 | c | t | 6 | en_SC.utf8 | en_SC.utf8 | 12353 | en_SG | 11 | 10 | c | t | 8 | en_SG | en_SG | 12354 | en_SG.utf8 | 11 | 10 | c | t | 6 | en_SG.utf8 | en_SG.utf8 | 12355 | en_US | 11 | 10 | c | t | 8 | en_US | en_US | 12356 | en_US.iso885915 | 11 | 10 | c | t | 16 | en_US.iso885915 | en_US.iso885915 | 12357 | en_US.utf8 | 11 | 10 | c | t | 6 | en_US.utf8 | en_US.utf8 | 12358 | en_ZA | 11 | 10 | c | t | 8 | en_ZA | en_ZA | 12359 | en_ZA.utf8 | 11 | 10 | c | t | 6 | en_ZA.utf8 | en_ZA.utf8 | 12360 | en_ZM | 11 | 10 | c | t | 6 | en_ZM | en_ZM | 12361 | en_ZW | 11 | 10 | c | t | 8 | en_ZW | en_ZW | 12362 | en_ZW.utf8 | 11 | 10 | c | t | 6 | en_ZW.utf8 | en_ZW.utf8 | 12363 | en_AU | 11 | 10 | c | t | 6 | en_AU.utf8 | en_AU.utf8 | 12364 | en_BW | 11 | 10 | c | t | 6 | en_BW.utf8 | en_BW.utf8 | 12365 | en_CA | 11 | 10 | c | t | 6 | en_CA.utf8 | en_CA.utf8 | 12366 | en_DK | 11 | 10 | c | t | 6 | en_DK.utf8 | en_DK.utf8 | 12367 | en_GB | 11 | 10 | c | t | 16 | en_GB.iso885915 | en_GB.iso885915 | 12368 | en_GB | 11 | 10 | c | t | 6 | en_GB.utf8 | en_GB.utf8 | 12369 | en_HK | 11 | 10 | c | t | 6 | en_HK.utf8 | en_HK.utf8 | 12370 | en_IE | 11 | 10 | c | t | 6 | en_IE.utf8 | en_IE.utf8 | 12371 | en_NZ | 11 | 10 | c | t | 6 | en_NZ.utf8 | en_NZ.utf8 | 12372 | en_PH | 11 | 10 | c | t | 6 | en_PH.utf8 | en_PH.utf8 | 12373 | en_SC | 11 | 10 | c | t | 6 | en_SC.utf8 | en_SC.utf8 | 12374 | en_SG | 11 | 10 | c | t | 6 | en_SG.utf8 | en_SG.utf8 | 12375 | en_US | 11 | 10 | c | t | 16 | en_US.iso885915 | en_US.iso885915 | 12376 | en_US | 11 | 10 | c | t | 6 | en_US.utf8 | en_US.utf8 | 12377 | en_ZA | 11 | 10 | c | t | 6 | en_ZA.utf8 | en_ZA.utf8 | 12378 | en_ZW | 11 | 10 | c | t | 6 | en_ZW.utf8 | en_ZW.utf8 | (56 rows)
You will not see any of the collations you installed after the cluster was created. To actually make them visible in the catalog you need to manually add the collations from operating system:
postgres=# select pg_import_system_collations('pg_catalog'); pg_import_system_collations ----------------------------- 24 (1 row)
You actually can create databases with new collations without importing them into the catalog table but we recommend to keep the catalog and the collations provided by the operating system in sync. Asking the catalog table once more the new “de_* collations will show up:
postgres=# select * from pg_collation where collname ~ '^de_'; oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion -------+------------+---------------+-----------+--------------+---------------------+--------------+-------------+------------+------------- 16386 | de_AT | 11 | 10 | c | t | 8 | de_AT | de_AT | 16387 | de_AT@euro | 11 | 10 | c | t | 16 | de_AT@euro | de_AT@euro | 16388 | de_AT.utf8 | 11 | 10 | c | t | 6 | de_AT.utf8 | de_AT.utf8 | 16389 | de_BE | 11 | 10 | c | t | 8 | de_BE | de_BE | 16390 | de_BE@euro | 11 | 10 | c | t | 16 | de_BE@euro | de_BE@euro | 16391 | de_BE.utf8 | 11 | 10 | c | t | 6 | de_BE.utf8 | de_BE.utf8 | 16392 | de_CH | 11 | 10 | c | t | 8 | de_CH | de_CH | 16393 | de_CH.utf8 | 11 | 10 | c | t | 6 | de_CH.utf8 | de_CH.utf8 | 16394 | de_DE | 11 | 10 | c | t | 8 | de_DE | de_DE | 16395 | de_DE@euro | 11 | 10 | c | t | 16 | de_DE@euro | de_DE@euro | 16396 | de_DE.utf8 | 11 | 10 | c | t | 6 | de_DE.utf8 | de_DE.utf8 | 16397 | de_IT | 11 | 10 | c | t | 8 | de_IT | de_IT | 16398 | de_IT.utf8 | 11 | 10 | c | t | 6 | de_IT.utf8 | de_IT.utf8 | 16399 | de_LI.utf8 | 11 | 10 | c | t | 6 | de_LI.utf8 | de_LI.utf8 | 16400 | de_LU | 11 | 10 | c | t | 8 | de_LU | de_LU | 16401 | de_LU@euro | 11 | 10 | c | t | 16 | de_LU@euro | de_LU@euro | 16402 | de_LU.utf8 | 11 | 10 | c | t | 6 | de_LU.utf8 | de_LU.utf8 | 16403 | de_AT | 11 | 10 | c | t | 6 | de_AT.utf8 | de_AT.utf8 | 16404 | de_BE | 11 | 10 | c | t | 6 | de_BE.utf8 | de_BE.utf8 | 16405 | de_CH | 11 | 10 | c | t | 6 | de_CH.utf8 | de_CH.utf8 | 16406 | de_DE | 11 | 10 | c | t | 6 | de_DE.utf8 | de_DE.utf8 | 16407 | de_IT | 11 | 10 | c | t | 6 | de_IT.utf8 | de_IT.utf8 | 16408 | de_LI | 11 | 10 | c | t | 6 | de_LI.utf8 | de_LI.utf8 | 16409 | de_LU | 11 | 10 | c | t | 6 | de_LU.utf8 | de_LU.utf8 | (24 rows)
Creating the new database using one of “de_*” collations does work now:
postgres=# create database tmp2 with LC_COLLATE = "de_DE.UTF-8" template=template0; CREATE DATABASE postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+-------+----------+-------------+-------------+------------------- postgres | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/pg + | | | | | pg=CTc/pg template1 | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/pg + | | | | | pg=CTc/pg tmp | pg | UTF8 | en_US.UTF-8 | en_US.UTF-8 | tmp2 | pg | UTF8 | de_DE.UTF-8 | en_US.UTF-8 | tmpx | pg | UTF8 | en_NZ.utf8 | en_US.UTF-8 | tmpxx | pg | UTF8 | C | en_US.UTF-8 |
With PostgreSQL you can also specify the collation on a column level, so by now we can have a column using an “en” location and another one using a “de” collation:
postgres=# create table t1 ( a text COLLATE "de_DE.utf8" postgres(# , b text COLLATE "en_US.utf8" postgres(# ); CREATE TABLE postgres=# \d t1 Table "public.t1" Column | Type | Collation | Nullable | Default --------+------+------------+----------+--------- a | text | de_DE.utf8 | | b | text | en_US.utf8 | |
If you want to use another collation which is currently not provided by the operating system the same rules apply as above:
postgres=# create table t1 ( a text COLLATE "de_DE.utf8" postgres(# , b text COLLATE "en_US.utf8" postgres(# , c text COLLATE "fr_FR.utf8" postgres(# ); ERROR: collation "fr_FR.utf8" for encoding "UTF8" does not exist LINE 3: , c text COLLATE "fr_FR.utf8"
You need to install support for that before you can use it. The same applies for queries, of course:
postgres=# SELECT a < b COLLATE "de_DE" FROM t1; ?column? ---------- (0 rows) postgres=# SELECT a < b COLLATE "fr_FR" FROM t1; ERROR: collation "fr_FR" for encoding "UTF8" does not exist LINE 1: SELECT a < b COLLATE "fr_FR" FROM t1;
That’s it for the basics. More to come in another post.
Cet article Collations in PostgreSQL – basics est apparu en premier sur Blog dbi services.