PostgreSQL SQL Tricks II

Z PostgreSQL
Skočit na navigaci Skočit na vyhledávání

SQL is a language where one task can be solved multiple ways with different efficiency. You can see effective use of PostgreSQL's possibilities on this page.

Google AdSense

Older tricks Newer tricks

subscript shift to zero

Implementation of these features is simple in C and it's little bit slow in PL/pgSQL (based on iterated shift). However, we can use the SQL parser capability to recognize an array range:

CREATE OR REPLACE FUNCTION shift_idx(anyarray) 
RETURNS anyarray AS $$
DECLARE 
  mi int := array_upper($1,1); 
  offset int := array_lower($1,1); 
BEGIN
  RETURN '[0:' || mi - offset || '] = ' || $1::varchar; 
END 
$$ LANGUAGE plpgsql IMMUTABLE STRICT;
CREATE FUNCTION

postgres=# SELECT shift_idx(array[1,2,3]);
   shift_idx   
---------------
 [0:2]={1,2,3}
(1 row)

Author: Pavel Stěhule

Record evolution

When we have functions that return a record but cannot use derived tables, we must use following syntax:

CREATE FUNCTION foo(OUT a int, OUT b int, IN c int)
RETURNS record AS $$
BEGIN
  a := c + 1; b := c + 2;
  RETURN;
END;
$$ LANGUAGE plpgsql;

CREATE TABLE t(c int);
INSERT INTO t VALUES(10),(20);

postgres=# SELECT c, foo(c) FROM t;
 c  |   foo   
----+---------
 10 | (11,12)
 20 | (21,22)
(2 rows)

postgres=# SELECT c, (foo(c)).* FROM t;
 c  | a  | b  
----+----+----
 10 | 11 | 12
 20 | 21 | 22
(2 rows)

Conversion between hex and dec numbers

source: http://www.varlena.com/GeneralBits/104.php.

=# select to_hex(11);
    to_hex
   --------
    b
   (1 row)
postgres=# create or replace function to_dec(text) 
           returns integer as $$
           declare r int; 
           begin 
             execute E'select x\''||$1|| E'\'::integer' into r; 
             return r; 
           end
           $$ language plpgsql;
CREATE FUNCTION
postgres=# select to_dec('ff');
 to_dec 
--------
    255
(1 row)

Faster variant (Michael Glaesemann):

CREATE FUNCTION hex2dec(in_hex TEXT)
RETURNS INT
IMMUTABLE STRICT LANGUAGE sql AS $body$
  SELECT CAST(CAST(('x' || CAST($1 AS text)) AS bit(8)) AS INT);
$body$;

Variant that is able to handle arbitrary length input (output as varchar, adapt if needed):

CREATE OR REPLACE FUNCTION hex2dec (varchar)
RETURNS varchar as 
$$
DECLARE 
	_x numeric;
	_count int;
	_digit int;
BEGIN
	_x := 0;
	for _count in 1..length($1) loop 
		EXECUTE E'SELECT x\''||substring($1 from _count for 1)|| E'\'::integer' INTO _digit;
		_x := _x * 16 + _digit ;
	end loop;
	return _x::varchar;
end
;
$$ language plpgsql immutable
;

Fast interval (of time or ip addresses) searching with spatial indexes

Searching intervals is usually slow, because the optimizer don't use an index. The reason lies in the dependency between the start and end columns. One solution is based on spatial indexes: it allows working with two dependent values as if they were a single value:

postgres=# EXPLAIN ANALYZE SELECT * FROM testip WHERE 19999999 BETWEEN startip AND endip;
                           QUERY PLAN
----------------------------------------------------------------
 Seq Scan on testip  (cost=0.00..19902.00 rows=200814 width=12) (actual time=3.457..434.218 rows=1 loops=1)
   Filter: ((19999999 >= startip) AND (19999999 <= endip))
 Total runtime: 434.299 ms
(3 rows)

Time: 435,865 ms

postgres=# CREATE INDEX ggg ON testip USING gist ((box(point(startip,startip),point(endip,endip))) box_ops);
CREATE INDEX
Time: 75530,079 ms
postgres=# EXPLAIN ANALYZE 
              SELECT * 
                 FROM testip 
                WHERE box(point(startip,startip),point(endip,endip)) @> box(point (19999999,19999999), point(19999999,19999999));
                                                                                                QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testip  (cost=60.50..2550.14 rows=1000 width=12) (actual time=0.169..0.172 rows=1 loops=1)
   Recheck Cond: (box(point((startip)::double precision, (startip)::double precision), point((endip)::double precision, (endip)::double precision)) @> '(19999999,19999999),(19999999,19999999)'::box)
   ->  Bitmap Index Scan on ggg  (cost=0.00..60.25 rows=1000 width=0) (actual time=0.152..0.152 rows=1 loops=1)
         Index Cond: (box(point((startip)::double precision, (startip)::double precision), point((endip)::double precision, (endip)::double precision)) @> '(19999999,19999999),(19999999,19999999)'::box)
 Total runtime: 0.285 ms
(5 rows)

Time: 2,805 ms

Short form for insert ROW variables to table

postgres=# CREATE TABLE foo(a integer, b integer);
CREATE TABLE
postgres=# CREATE OR REPLACE FUNCTION fx() 
           RETURNS void as $$
             DECLARE r foo; 
           BEGIN 
             SELECT INTO r * FROM foo; 
             INSERT INTO foo VALUES(r.*); 
             RETURN; 
           END; 
           $$ LANGUAGE plpgsql;
CREATE FUNCTION
postgres=# SELECT fx();
 fx 
----
 
(1 row)

ALTER TABLE ALTER COLUMN USING

I hadn't known about the USING clause in ALTER TABLE ALTER COLUMN. I thought the old type had to be castable to the new type. But it's not true. Czech boolean literals cannot be cast directly to PostgreSQL's bool type, but, with the USING clause, I can do an ALTER COLUMN TYPE from varchar to bool:

postgres=# CREATE TABLE foo(a varchar);
CREATE TABLE
postgres=# INSERT INTO foo VALUES ('ano'); -- czech yes
INSERT 0 1
postgres=# ALTER TABLE foo ALTER COLUMN a TYPE boolean ;
ERROR:  column "a" cannot be cast to type "pg_catalog.bool"
postgres=# ALTER TABLE foo 
              ALTER COLUMN a TYPE boolean 
             USING CASE a 
                       WHEN 'ano' THEN true 
                       ELSE false END;
ALTER TABLE
postgres=# SELECT * FROM foo;
 a 
---
 t
(1 row)

Quote_ident for schema.name

Using single quotes is one protection against SQL injection. The function quote_ident checks its argument and, when the argument contains problematic chars, it returns the argument surrounded by single quotes. It's quite easy and useful. Problem is with schema.name pair, because the dot is a watched char:

postgres=# select quote_ident('public.foo');
 quote_ident  
--------------
 "public.foo"
(1 row)

postgres=# select * from "public.foo";
ERROR:  relation "public.foo" does not exist
postgres=# select * from public."foo";
 a 
---
(0 rows)

postgres=# select * from "public"."foo";
 a 
---
(0 rows)

We can get around this limitation with custom functions:

CREATE OR REPLACE FUNCTION quote_array(text[]) 
RETURNS text AS $$
SELECT array_to_string(array(SELECT quote_ident($1[i]) 
                                FROM generate_series(1, array_upper($1,1)) g(i)),
                       '.') 
$$ LANGUAGE SQL IMMUTABLE;

CREATE OR REPLACE FUNCTION quote_schema_ident(text) 
RETURNS text AS $$
SELECT quote_array(string_to_array($1,'.'))
$$ LANGUAGE SQL IMMUTABLE;

postgres=# select quote_schema_ident('public.foo tab');
 quote_schema_ident 
--------------------
 public."foo tab"
(1 row)

Autor: Pavel Stehule.

Fast searching of longer prefix

Searching longest prefix is common task - mainly in communications. We search for the longest prefix for some number from from set of prefixes. I solved this task some years ago with stored procedure and modified method of halving of interval. Maybe little bit slower but simple and robust method was described by Hubert Lubaciewski (Depesz).

--table contains about 5000 prefixes
postgres=# select * from prefixesx limit 5;
   costcode_name    | costcode 
--------------------+----------
 Afghanistan        | 93
 Afghanistan Mobile | 9370
 Afghanistan Mobile | 9379
 Afghanistan Mobile | 9380
 Alaska (USA)       | 1907
(5 rows)

Simple but slow solution with LIKE:

postgres=# SELECT * 
              FROM prefixesx 
             WHERE '420724181000' LIKE costcode || '%' 
             ORDER BY length(costcode) DESC 
             LIMIT 1;
          costcode_name          | costcode 
---------------------------------+----------
 Czech Republic Mobile - EuroTel | 42072

Fast solution is generating all possible prefixes from number and search this prefixes:

CREATE OR REPLACE FUNCTION prefixes(varchar) 
RETURNS varchar[] AS $$
SELECT ARRAY(SELECT substring($1 FROM 1 FOR i) 
                FROM generate_series(1, length($1)) g(i))::varchar[]; 
$$ LANGUAGE sql IMMUTABLE;

and use query:

postgres=# SELECT * 
              FROM prefixesx 
             WHERE costcode = ANY (prefixes('420724191000')) 
             ORDER BY length(costcode) DESC 
             LIMIT 1;
          costcode_name          | costcode 
---------------------------------+----------
 Czech Republic Mobile - EuroTel | 42072

On 5000 rows is this query 4 times faster than LIKE.

Little bit slower is query with LIKE and LIKE firstchar || %

SELECT * 
   FROM prefixesx
  WHERE '420724181000' LIKE costcode || '%'
    AND costcode LIKE substring('420724191000' FROM 1 FOR 1) || '%'  
  ORDER BY length(costcode) DESC 
  LIMIT 1;

Get count of substrings in string

This function is based on popular trick - using replace and length function without cycle, that is typical for interprets where built-in functions are faster than iteration:

CREATE OR REPLACE FUNCTION CountInString(text,text)
RETURNS integer AS $$
 SELECT(Length($1) - Length(REPLACE($1, $2, ''))) / Length($2) ;
$$ LANGUAGE SQL IMMUTABLE;

Sent by Rodrigo E. De León Plicet. Another my (array based) solution:

CREATE OR REPLACE FUNCTION CountInString(text, text) 
RETURNS integer AS $$
  SELECT Array_upper(String_to_array($1,$2),1) - 1;
$$ LANGUAGE SQL IMMUTABLE;

Timestamp recalculation for different timezone and its indexing

The simplest specification of time at some timezone is using AT TIME ZONE operator. We can't indexing timestamp with timezone, because we can change current timezone, so we can create index only for timestamp at some specified timezone. Source: pgsql-general, Tom Lane

CREATE INDEX created_tz_paris ON data((created_on AT TIME ZONE 'Europe/Paris'));

Using IMMUTABLE functions as hints for the optimizer

For highly specific data distributions, we can have a problem with prediction and generation of suboptimal execution plans. These problems we can solve by breaking a SQL query into more dependent queries and evaluating step by step. The following queries use real results (not estimations) precedent queries, and queries are executed optimally. Tom Lane's proposal is based on using IMMUTABLE functions that are evaluated before generating an execution plan:

CREATE OR REPLACE FUNCTION getTypesLieuFromTheme(codeTheme text) 
RETURNS text[] AS $f$
SELECT ARRAY(SELECT codetylieu::text FROM rubtylieu WHERE codeth = $1);
$f$ LANGUAGE SQL IMMUTABLE;

in query
SELECT ...
  WHERE ... AND gl.codetylieu = ANY(getTypesLieuFromTheme('RES'))

look execution plan doesn't contain function call and query to table rubtylieu too
->  Bitmap Heap Scan on genrelieu gl (cost=2.23..6.22 rows=88 width=4) (actual time=0.022..0.075 rows=88 loops=1)
       Recheck Cond: ((codetylieu)::text = ANY ('{RES}'::text[]))
       ->  Bitmap Index Scan on ind_genrelieu2 (cost=0.00..2.21 rows=88 width=0) (actual time=0.016..0.016 rows=88 loops=1)
                Index Cond: ((codetylieu)::text = ANY ('{RES}'::text[]))

Use this trick as last step. Test, please, by increasing the statistic number beforehand.

Autonomous transaction in PostgreSQL

Autonomous transaction is usable feature mainly for logging. This feature isn't supported in PostgreSQL yet. With untrusted languages we are able to emulate it. You can see solution by Jon Roberts.

CREATE OR REPLACE FUNCTION fn_log_error(p_function varchar, p_location int, p_error varchar) 
RETURNS void AS $$
DECLARE
  v_sql varchar;
  v_return varchar;
  v_error varchar;
BEGIN
  PERFORM dblink_connect('connection_name', 'dbname=...');

  v_sql := 'INSERT INTO error_log (function_name, location, error_message, error_time) '
           || 'VALUES (''' || p_function_name || ''', ' 
           || p_location || ', ''' || p_error || ''', clock_timestamp())';
  SELECT INTO v_return * 
     FROM dblink_exec('connection_name', v_sql, false);

  --get the error message
  SELECT INTO v_error * 
     FROM dblink_error_message('connection_name');

  IF position('ERROR' in v_error) > 0 OR position('WARNING' in v_error) > 0 THEN
    RAISE EXCEPTION '%', v_error;
  END IF;

  PERFORM dblink_disconnect('connection_name');
EXCEPTION
  WHEN others THEN
    PERFORM dblink_disconnect('connection_name');
    RAISE EXCEPTION '(%)', SQLERRM;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

Optimization of repetitive assignment to varchar variable with arrays

plpgsql isn't a good language for intensive non SQL operations. It's best as glue for SQL statements or for prototyping. But sometimes plperl or plpython can be significantly faster. Plpgsql does not like cumulative iterations over varchar or array variables. When we can't use Perl, .. we could use SQL:

CREATE OR REPLACE FUNCTION SlowList(int) -- slow function, usable for N <= 100
RETURNS varchar AS $$
DECLARE s varchar = '';
BEGIN
  FOR i IN 1..$1 LOOP
    s := '<item>' || i || '</item>';  -- slow is s := s || ..
  END LOOP;
  RETURN s;
END; $$ LANGUAGE plpgsql IMMUTABLE;

CREATE OR REPLACE FUNCTON FastList(int) -- fast function
RETURNS varchar AS $$
BEGIN
  RETURN array_to_string(ARRAY(SELECT '<item>' || i || '</item>'
                                  FROM generate_series(1, $1) g(i)),
                         '');
END; $$ LANGUAGE plpgsql IMMUTABLE;

For N < 100 execution time is less then 4ms (so isn't any reason for dirty trick). For bigger N is execution time different (SlowList(10000) - 4000ms, FastList(10000) - 52ms). Author - Pavel Stěhule.

Use IS DISTINCT FROM without COALESCE

Sim Zacks reported slow queries after migration to 8.2.4. This version has problem with prediction of result of expression COALESCE(column, false) = false, and then execution plan is suboptimal. After correction PostgreSQL choose optimal execution plan.

SELECT ...
   FROM some_tab
        LEFT JOIN
        some_tab2
        ON sometab2.col IS DISTINCT FROM true; --> join false or NULL 

So don't use COALESCE in WHERE clause

Next week somebody reported similar problem:

-- execution plan is suboptimal
((COALESCE((at_type)::integer, 1) = 1) AND (COALESCE(at_language, 0::numeric) = 0::numeric))

-- execution plan is optimal
(at_type = 1 or at_type is null) AND (at_language = 0 or at_language is null)

Port to Oracle, problems with to_date function

Peter Eisentraut reported incompatibility to_date function between PostreSQL and Oracle.

SQL> select to_date('31-DEC-200700:00:00', 'dd-mon-yyyy hh24:mi:ss') from
dual;

TO_DATE('
---------
31-DEC-07

On PostgreSQL:

select to_date('31-DEC-200700:00:00', 'dd-mon-yyyy hh24:mi:ss');
  to_date
--------------
 200700-12-31

Oracle ignores spaces in format string, but PostgreSQL requires exact format. This behavior can be changed in custom function that drops spaces (Author: Jon Roberts):

CREATE OR REPLACE FUNCTION fn_to_date(p_date varchar, p_format varchar)
RETURNS timestamp AS $$
 SELECT to_timestamp(replace($1, ' ', ''), replace($2, ' ', ''));
$$ LANGUAGE SQL STRICT IMMUTABLE;

# select fn_to_date('31-DEC-200700:00:00', 'dd-mon-yyyy hh24:mi:ss');

    fn_to_date
---------------------
 2007-12-31 00:00:00
(1 row)

Identification and correction of wrong chars in PostgreSQL dump

8.2 is stricter than 8.1 for correct chars in dump. Reason is in SQL injection security fix (via incorrect UTF8 chars). This change of behavior can cause problems with migration. Fix is simple. We have to find wrong char and fix it.

$ iconv -f utf-8 -t utf-8 dump.sql > /dev/null
iconv: illegal input sequence at position 1000
$ head -c 1010 dump.sql | tail -c 20 | od -c
0000000 h . B u t i s n 222 t i t
0000020 h i s ?
0000024

Source: http://lca2007.linux.org.au/att_data/Miniconfs(2f)PostgreSQL/attachments/professional.pdf

Be carefull when using the NOT IN operator with NULL

Some SQL developer dislike NULL. Other like it. I like it, too. NULL is necessary and saves a lot of work. NULL in a result expression clearly signals some problem in the input data. People who dislike NULL don't understand it and use the following argument, citing an "anomaly":

postgres=# SELECT * 
              FROM (VALUES(10),(20),(30)) v(a) 
             WHERE a IN (10, 20, NULL); --> all is ok
 a  
----
 10
 20
(2 rows)

postgres=# SELECT * 
              FROM (VALUES(10),(20),(30)) v(a) 
             WHERE a NOT IN (10, 20, NULL);
 a 
---
(0 rows) --> anomaly, expects 30

But it isn't an anomaly. The NOT IN operator checks each item of a list, because the left-hand value must be different than all of them. Since we can never have a comparison with NULL that results in anything but NULL, no value can fulfill this predicate.

Fast specification of first rows of some table field

My task is specific. I have address book with 100 000 items and I have to get all first chars of surnames. This task is one from a few tasks, where correlated subquery is faster than everything else. Because seq scan and agg function need scan all 100 000 records, but correlated subquery need only 32 access to functional index:

SELECT * 
   FROM (
         SELECT substring('abcdefghijklmnopqrstuvwxyzěščřžýáíé' 
                          FROM i For 1) 
            FROM generate_series(1,32) g(i)
        ) p(onechar) 
  WHERE EXISTS(
               SELECT 1 
                  FROM address_book 
                 WHERE substring(surname from 1 for 1) = p.onechar  
                   and active
              );

Version for support czech char CH (and some others: úůňľó):

   SELECT * 
   FROM
     ( 
         SELECT substring('abcdefghijklmnopqrstuvwxyzěščřžýáíéúůňľó' FROM i For 1) FROM generate_series(1,40) g(i) UNION SELECT 'ch' 
     ) p(onechar) 
   WHERE
     EXISTS( SELECT 1 FROM a_osoba WHERE lower(substring(prijmeni from 1 for char_length(p.onechar))) = p.onechar );

An employee with highest compensation

It's typical school query: select the highest-paid employees in some group. This query can be solved with derivated table or with correlated subquery. We could use the nonstandard clause DISTINCT ON in PostgreSQL:

postgres=# SELECT * FROM employees ;
   name    | surname   | department | payment  
-----------+-----------+------------+-------
 Pavel     | Stehule   |          1 | 10000
 Zdenek    | Stehule   |          1 |  9000
 Vladimira | Stehulova |          2 |  9000
(3 rows)

postgres=# SELECT DISTINCT ON department * 
              FROM employees 
             ORDER BY department, payment DESC;
   name    | surname   | department | payment  
-----------+-----------+------------+-------
 Pavel     | Stehule   |          1 | 10000
 Vladimira | Stehulova |          2 |  9000
(2 rows)

Attention, DISTINCT ON isn't portable. A big disadvantage of this solution is that it shows only one person from every department, even when more employees has the same, highest, payment, yielding an incomplete result.

Any other session variables

PostgreSQL doesn't support server session variables. This mean, so we have to write stored procedures more often, because there are variables. Some substitution is module variables. These variables has to allowed in configuration. Sent by Andreas Kretschmer:


- define in your postgresql.conf:
 custom_variable_classes = 'myvar'

- use within psql:

test=# set myvar.benutzer = 'foo';
SET
test=*# select * from foo;
 id | name
----+------
 1 | foo
 2 | bar
(2 rows)

test=*# select * from foo where name=current_setting('myvar.benutzer');
 id | name
----+------
 1 | foo
(1 row)

test=*# set myvar.benutzer = 'none';
SET
test=*# select * from foo where name=current_setting('myvar.benutzer');
 id | name
----+------
(0 rows)

xpath function indexing

8.3 has integrated xpath function. There is gap in XML support, because XML type isn't supported with GIST or GIN index. So xpath function returns array of xml values. But we can write custom casting to int array:

CREATE OR REPLACE FUNCTION xml_list_to_int(xml[])
RETURNS int[] AS $$
SELECT ARRAY(SELECT to_number(($1[i])::text,'999999.99')::int
                FROM generate_series(1, array_upper($1,1)) g(i))
$$ LANGUAGE SQL IMMUTABLE;

CREATE CAST (xml[] AS int[]) WITH FUNCTION xml_list_to_int(xml[]);

-- array of integers are supported with GIST
CREATE INDEX fx ON foo USING
GIN((xpath('//id/text()',order_in_xml)::int[]));

psql together with less

psql has paging support, but it isn't comparable with the options offered by less. Here is example on how to setup psql to use less as the pager. Author: Merlin Moncure

#to profile
export PAGER=less
export LESS="-iMSx4 -FX"

#to .psqlrc
\timing
\pset pager always

Intersection of arrays

Very nice trick by David Fetter. It's based well known trick (universal sort of array):

CREATE OR REPLACE FUNCTION array_intersect(ANYARRAY, ANYARRAY)
RETURNS ANYARRAY
LANGUAGE SQL
AS $$
  SELECT ARRAY(
               SELECT $1[i] AS "the_intersection"
                  FROM generate_series(
                                       array_lower($1,1),
                                       array_upper($1,1)
                                      ) AS i
               INTERSECT
               SELECT $2[j] AS "the_intersection"
                  FROM generate_series(
                                       array_lower($2,1),
                                       array_upper($2,1)
                                      ) AS j
              );
$$;

An alternative and possibly faster way:

CREATE OR REPLACE FUNCTION array_intersect(ANYARRAY, ANYARRAY)
RETURNS ANYARRAY
LANGUAGE SQL
AS $$
  SELECT ARRAY(SELECT UNNEST($1) INTERSECT SELECT UNNEST($2))
$$;

VOID function in SQL language

SQL functions can contain only SQL statements. The results of the last statements are results of the functions. The type of a result have to be compatible with the function's type. Because VOID is fictive type, no statement can return this type with one exception. We can cast NULL to VOID.

note: void returning functions are not allowed to be used through the binary protocol

CREATE OR REPLACE FUNCTION none() 
RETURNS VOID AS $$ SELECT NULL::VOID; $$
LANGUAGE SQL;

Fast select of next row

For older application (ISAM appliaction works on series of rows) we have be able to search next row specified with some combination of values (a1, b1, c1). Traditional solution:

SELECT * 
   FROM data
  WHERE (a > a1)
     OR (a = a1 AND b > b1)
     OR (a = a1 AND b = b1 AND c > c1)
  ORDER BY a, b, c
  LIMIT 1;

Merlin Moncure noted simplified form that use row equation. This form is more readable and allows using multicolumn index:

SELECT *
   FROM data
  WHERE (a, b, c) > (a1, b1, c1)
  ORDER BY a, b, c
  LIMIT 1; 

ISAM access downgrade SQL serer performance. For effective SQL we have to operate over sets, but when we cannot rewrite application, we have not choice.

Older tricks Newer tricks


Google AdSense