Skip to main content

Overview

Copyright © 2022 PawSQL

Audit Rules

Avoid using =NULL

=null or case when null cannot judge whether the expression is null. To determine whether the expression is null, use is null. The appearance of =null or case when null in SQL is most likely an error made by developers because they are always judged as false, equivalent to 1=0. Therefore, PawSQL checks and reminds developers of this type of writing, and performs rewriting.

For example, the following SQL,

select case c_phone when null then 0 else 1 end from customer;

will be rewritten by PawSQL as

select case when c_phone is null then 0 else 1 end from customer;

Learn more about Avoid using =NULL.

Avoid Using SELECT *

The disadvantages of using SELECT * in queries are as follows:

  • If SELECT * contains unused large fields, especially TEXT/CLOB type fields, it is easy to cause unnecessary disk IO and network overhead.
  • Using SELECT * increases the effort of maintenance, such as inconsistent with the resultMap configuration when adding or deleting fields, and incorrect field mapping when using insert into select *.
  • The database optimizer cannot take advantage of the coverage index.
  • PawSQL index recommendation cannot recommend coverage index.

Learn more about Avoid Using SELECT *.

Avoid Using Random Functions for Sorting

MySQL's rand function or PostgreSQL's random function returns a random floating-point number between 0 and 1.0. Sometimes we use the following query statement to obtain a random sample of the data set.

select * from orders order by rand() limit 1; 

If the customer table has fewer than 10,000 rows, this method works well. But when you have 1,000,000 rows, the cost of sorting becomes unacceptable. The reason is obvious: we sort all rows, but keep only one of them. There are more efficient ways to achieve the goal. Learn more about Avoid Using Random Functions for Sorting.

Use UNION ALL instead of UNION

When a SQL statement use UNION to combine two result sets, the UNION operator will eliminate duplicates in the combined result set, which is implemented internally in the database through sorting or hashing, both of which require significant computing resources. If it can be logically guaranteed that there are no duplicate data in the two result sets, using UNION ALL instead of UNION can significantly improve performance.

Limit Nesting Level of Subqueries

Nesting subqueries make SQL complex, and too complex SQL can make the database optimizer take a long time to generate execution plans and generate poorly performing execution plans. Therefore, PawSQL detects whether the nesting level of subqueries exceeds a certain threshold and issues a warning. In PawSQL, the default value of this threshold is 2, and users can modify it when creating optimization tasks.

Limit Number of Joins

In query execution planning, the order and method of table joins are two of the main parts for the database optimizer. Increasing the number of table joins will exponentially increase the search space for the database optimizer to find the optimal execution plan, leading to longer time to generate the execution plan, and a higher chance of generating execution plans with poor performance. Therefore, PawSQL checks whether the number of table joins in the query exceeds a certain threshold and reminds the user of possible risks. In PawSQL, the default value of the threshold is 5.

Implicit Type Conversion Disables Index Usage

When the data types in the condition expression are different, some implicit data type conversions will be performed during the query execution process. Sometimes type conversion is applied to constants in the condition, and sometimes it is applied to columns in the condition. When type conversion is applied to columns, indexes won't be used, which may cause serious performance problems. For example, in the following SQL:

select count(*) from ORDERS where O_ORDERDATE = current_date();

If the data type of the O_ORDERDATE column is CHAR(16), the index on O_ORDERRuleDiffDataTypeInPredicateWriteDATE will not be used, resulting in a full table scan. There are usually two solutions: one is to use ALTER TABLE to change the data type of O_ORDERDATE, and the other is to force current_date to be converted to the CHAR type (PawSQL provides this rewrite suggestion).

select count(*) ORDERS where ORDERS.O_ORDERDATE = cast(current_date() as CHAR(16));

Learn more about Implicit Type Conversion Disables Index Usage.

Type Mismatch in Join Conditions

When the data types in the condition expression are different, some implicit data type conversions will be performed during the query execution. When type conversion is applied to columns, indexes cannot be used during query execution, which may cause serious performance problems. PawSQL will check for such syntax and issue a warning.

Avoid FOR UPDATE in SELECT

Adding FOR UPDATE in SELECT statements will lock tables or data rows, affect query concurrency, and cause blocking and performance degradation, so it should be used with caution. PawSQL will check for such syntax and issue a warning.

Avoid using LIMIT in UPDATE

Using LIMIT in UPDATE statements may cause unpredictable updated data and should be used with caution. PawSQL will check for such syntax and issue a warning.

Avoid using LIMIT in SELECT w/o ORDER BY

Using LIMIT in a SELECT statement without an ORDER BY clause can cause inconsistent results upon each execution. PawSQL detects this type of usage and issues a warning.

Avoid using LIMIT in UPDELETE w/o ORDER BY

Using LIMIT in UPDATE or DELETE statements without ORDER BY will cause inconsistent results for different execution. PawSQL will check for such syntax and issue a warning.

PostgreSQL/Opengauss does not support the ORDER BY clause in UPDATE or DELETE statements.

Avoid SELECT w/o Conditions and Grouping

Queries without conditions and grouping syntax will cause full table scans and huge result sets, which will consume large portion of system resources. PawSQL will check for such syntax and issue a warning.

Avoid UPDELETE w/o conditions

UPDATE or DELETE statements without query conditions or with conditions that always evaluate to true can update or delete all data records, which is a very dangerous operation. PawSQL will check for such syntax and issue a warning.

Limit Number of Values in INSERT

Bulk insertion of values can effectively improve the efficiency of data insertion, as shown below:

insert into customer(c_custkey, lastname, firstName)
values(1, 'Dan', 'Mike'),(2, 'Chaw', 'Tomas'),(3, 'Wang', 'Nancy');

However, if the amount of data being inserted is too large and exceeds the database limit (MySQL: max_allowed_packet), the database server will report an error. In PawSQL, this syntax will be checked, and an alert will be issued for SQL that exceeds the threshold (default is 500).

No Column Names in INSERT

INSERT statements should specify column names to reduce the possibility of misalignment between inserted values and the columns of the target table.

insert into customer value(1, 'Dan', 'Mike');

The following syntax can reduce the possibility of misalignment between inserted values and the columns of the target table, and making the code easier to maintain.

insert into customer(c_custkey, lastname, firstName) value(1, 'Dan', 'Mike');

Limit OFFSET values

In SQL queries, the LIMIT clause is used to limit the number of query results, while the OFFSET clause is used to specify which row to start returning data from in the query result set. When the OFFSET value is large, the query engine must scan more and more data to find the data rows after the offset. In the case of a large data set, this can cause the query to become very slow and may consume a lot of system resources.

Avoid LIKE Condition starts with %

In SQL queries, the LIKE operator is used to match strings. If the pattern string starts with % (for example, LIKE '%ABC'), the database optimizer cannot use an index to filter data, which can easily lead to a full table scan. In the absence of other filtering conditions, this may have a significant impact on query performance and efficiency. Therefore, you should try to avoid LIKE conditions that start with %. If you must use it, consider creating a full-text index to improve query performance.

Rewrite Optimization

Join Elimination

Join Elimination is an optimization technique that simplifies SQL and improves query performance by removing joins from the query that do not affect the final result. Typically, this optimization can be applied when a query includes a primary-foreign key join and only references the primary key column of the parent table.

Consider the following example:

select o.* from orders o inner join customer c on c.c_custkey=o.o_custkey

In this query, the orders table is joined with the customer table on the primary-foreign key relationship, and c_custkey is the primary key. Therefore, the customer table can be eliminated from the query, and the rewritten SQL would be:

select * from orders where o_custkey

Learn more about Join Elimination;

Outer2Inner Join Conversion

Outer join optimization refers to the conversion of an outer join that satisfies certain conditions (the outer table has a NULL rejection condition) into an inner join, allowing the database optimizer to choose a more optimal execution plan and improve the performance of SQL queries.

Consider the following example:

select c_custkey from orders left join customer on c_custkey=o_custkey where C_NATIONKEY  < 20

C_NATIONKEY < 20 is a NULL rejection condition on the customer table, so the left outer join can be rewritten as an inner join:

select c_custkey from orders inner join customer on c_custkey=o_custkey where C_NATIONKEY  < 20

Learn more about Outer2Inner Join Conversion.

SAT-TC Rewrite

SAT-TC (SATisfiability-Transitive Closure) rewrite optimization refers to analyzing a set of related query conditions to discover contradictory, to simplify, or to infer more conditions, helping the database optimizer choose a better execution plan to improve SQL performance.

Consider the following example:

select c.c_name FROM customer c where c.c_name = 'John' and c.c_name = 'Jessey'

Because the conditions are contradictory, the rewritten SQL is:

select c.c_name from customer as c where  1 = 0 

Learn more about SAT-TC Rewrite.

Query Folding

Query folding refers to extending views, CTEs, or DT subqueries and merging them with the query statement that references them to reduce the serialization of intermediate result sets or trigger better optimization techniques for table joins.

Consider the following example:

SELECT * FROM (SELECT c_custkey, c_name FROM customer) AS derived_t1;

The rewritten SQL is:

SELECT c_custkey, c_name FROM customer

Learn more about Query Folding.

Projection Pushdown

Projection pushdown removes useless columns (not used in the outer query) from DT subqueries to reduce the cost of I/O and network, while increasing the chances of adopting index-only optimization strategy to improve query performance.

Consider the following example:

SELECT count(1) FROM (SELECT c_custkey, avg(age) FROM customer group by c_custkey) AS derived_t1;

The rewritten SQL is:

SELECT count(1) FROM (SELECT 1 FROM customer group by c_custkey) AS derived_t1;

Learn more about Projection Pushdown.

Nullable IN-Subqueries Rewrite

For the following SQL query which intends to find out customers without orders:

select * from customer where c_custkey not in (select o_custkey from orders)

If the result set of the subquery contains NULL rows, this SQL query will always return an empty result. The correct way is to add a is not null condition to the subquery, like this:

select * from customer where c_custkey not in (select o_custkey from orders where o_custkey is not null)

Learn more about Nullable IN Subqueries Rewrite.

HAVING Condition Pushdown

Logically, conditions on the HAVING clause are executed after GROUP BY, and the conditions on the WHERE clause can be executed during table access (index access) or after table access and before GROUP BY, which are less costly.

Consider the following example:

select c_custkey, count(*) from customer group by c_custkey having c_custkey < 100

The rewritten SQL is:

select c_custkey, count(*) from customer where c_custkey < 100 group by c_custkey

Learn more about HAVING Condition Pushdown.

ALL Qualified Subqueries Rewrite

Assuming the following SQL is used to obtain registered users after the order system is closed:

select * from customer where c_regdate > all(select o_orderdate from orders)

If there is a NULL value in the result of the subquery, this SQL always returns empty result. The correct method is to add a is not null condition in the subquery or rewrite to the max/min subquery.

select * from customer where c_regdate > (select max(o_custkey) from orders)

PawSQL uses the second. Learn more about ALL Qualified Subqueries Optimization.

MAX/MIN Subquery Rewrite

For subqueries that use MAX/MIN, such as

select * from customer where c_custkey = (select max(o_custkey) from orders)

can be rewritten in the following way to avoid aggregation and take advantage of the index's ordering:

select * from customer where c_custkey = (select o_custkey from orders order by o_custkey desc null last limit 1)

Learn more about MAX/MIN Subquery Rewrite.

COUNT SSQ Rewrite

For the following scalar subquery(SSQ),

select * from customer where (select count(*) from orders where c_custkey=o_custkey) > 0

it can be rewritten to following EXISTS query which avoids the aggregation:

select * from customer where exists(select 1 from orders where c_custkey=o_custkey)

Learn more about COUNT SSQ Optimization.

SELECT w/ OR Conditions Rewrite

If a query statement uses OR conditions, the database optimizer may not be able to use an index to complete the query. For example:

select * from lineitem where l_shipdate = date '2010-12-01' or l_partkey<100

If both fields have indexes, you can rewrite the query statement as a UNION query to use the index to improve query performance:

select * from lineitem where l_shipdate = date '2010-12-01' 
union select * from lineitem where l_partkey<100

If the database supports INDEX MERGING (please refer to How to create efficient indexes), you can also adjust the database-related parameters to enable the INDEX MERGING optimization strategy to improve database performance.

Learn more about SELECT with OR Conditions Rewrite .

UPDELETE w/OR Conditions Rewrite

If an UPDATE or DELETE statement uses an OR condition, the database optimizer may not be able to take advantage of indexes.

delete from lineitem where l_shipdate = date '2010-12-01' or l_partkey<100

If there are indexes on both l_shipdate and l_shipdate, about statement can be rewritten into multiple DELETE statements, using indexes to improve query performance.

delete from lineitem where l_shipdate = date '2010-12-01';
delete from lineitem where l_partkey<100;

DELETE w/o Condition

DELETE statements without a query condition or with a condition that is always evaluated to true will delete all data in the table. DELETE statements require writing logs for transaction function and data synchronization. For large tables, this may cause database locking and transaction blocking, as well as consuming a large amount of log space. If the data in the table is confirmed to be no longer needed, you can use TRUNCATE table instead of the DELETE statement. TRUNCATE is faster than DELETE because it does not record each deleted row, but directly clears the table and releases space.

delete from lineitem

to

truncate lineitem

Calculation Disables Index Usage

Calculations on indexed columns will disable index usage, which can lead to full table scans and serious performance problems. Therefore, calculations on indexed columns should be converted to the constant side as much as possible. For the following SQL as an example.

select * from tpch.orders where adddate(o_orderdate,  INTERVAL 31 DAY) =date '2019-10-10'  

The adddate function will disable the usage of the index on o_orderdate. Query can be rewritten to the following equivalent SQL and is able to use the index to improve query efficiency.

select * from tpch.orders where o_orderdate = subdate(date '2019-10-10' , INTERVAL 31 DAY);

PawSQL can help convert a large number of functions and operations related to +, -, *, / operators.

Learn more about Calculation Disables Index Usage.

IN Subquery Optimization

An IN subquery is a type of subquery that takes the following form. An IN subquery can be rewritten as an equivalent correlated EXISTS subquery or inner join, which can create a extra filtering condition. If the filtering condition has an appropriate index or is recommended by the PawSQL index recommendation engine, better performance can be achieved.

(expr1, expr2...) [NOT] IN (SELECT expr3, expr4, ...)
  • IN Subquery to EXISTS conversion

For example, the following IN subquery is used to retrieve user information for users who have placed orders within the past year:

select * from customer where c_custkey in (select o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year)

It can be rewritten as an EXISTS subquery, which creates a extra filtering condition (c_custkey = o_custkey):

select * from customer where exists (select * from orders where c_custkey = o_custkey and O_ORDERDATE>=current_date - interval 1 year)
  • IN Subquery to INNER join conversion

If the query result of a subquery is distinct, then the IN subquery can be rewritten as a join between two tables. This allows the database optimizer to plan a better table join sequence and also enables PawSQL to recommend better optimization methods.

For example, consider the following SQL query where c_custkey is the primary key of the customer table:

SELECT * FROM orders WHERE o_custkey IN (SELECT c_custkey FROM customer);

Here, orders and customer are two related tables, where c_custkey is the primary key of the customer table and o_custkey is the foreign key in the orders table that relates to the customer table.

If the query result of the subquery is distinct, we can rewrite the query as a JOIN query, as shown below:

SELECT * FROM orders,customer where orders.o_custkey = customer.c_custkey;

By rewriting the IN subquery as a JOIN query, the database optimizer can plan a better table join sequence and choose a better execution plan while executing the query. In addition, such a query is also easier to understand and maintain.

Learn more about IN Subquery Optimization

Converting Exists Subqueries to Joins

Exists subqueries return a boolean value to the outer query indicating whether any rows meet the condition. Under certain conditions, EXISTS subqueries can be converted to JOINs, allowing the database optimizer more options for choosing the driving table and generating a better query plan.

For example, for the following query:

select * from lineitem l where exists (select * from part p where p.p_partkey=l.l_partkey and p.p_name = 'a')

If the subquery returns at most one row for each l.l_partkey (i.e. there is a uniqueness constraint on the columns (p_partkey, p_name)), then the subquery can be rewritten as:

select l.* from lineitem as l, part as p where p.p_partkey = l.l_partkey and p.p_name = 'a'

Eliminating DISTINCT in Subqueries

For subqueries that only perform existence checks, DISTINCT can usually be eliminated to avoid an extra deduplication operation.

For example, in an IN subquery:

SELECT * FROM customer WHERE c_custkey IN (SELECT DISTINCT o_custkey FROM orders);

Can simplify to:

SELECT * FROM customer WHERE c_custkey IN (SELECT o_custkey FROM orders); 

Eliminating Order By in Subqueries Without LIMIT

If a subquery does not have a LIMIT clause, then ordering is meaningless and can be removed without changing the final results. Some examples:

  • EXISTS subquery:
select * from lineitem l where exists (select * from part p where p.p_partkey=l.l_partkey and p.p_name = 'a' order by p_name)  

Can be rewritten as:

select * from lineitem l where exists (select * from part p where p.p_partkey=l.l_partkey and p.p_name = 'a')

Rewrite Optimization for Order By/Group By from Different Tables

If the ORDER BY or GROUP BY columns are from different tables, the optimizer cannot utilize index ordering to avoid a sort. If equality conditions exist in the WHERE or HAVING clauses, the ORDER BY or GROUP BY columns can be swapped to be from the same table to leverage indexes to avoid a sort.

For example, the following query:

select o_custkey, c_name, sum(o.O_TOTALPRICE) from customer c, orders o where o_custkey = c_custkey group by o_custkey, c_name

Can be rewritten as:

select c_custkey, c_name, sum(o.O_TOTALPRICE) from customer c, orders o where o_custkey = c_custkey group by c_custkey, c_name

Adjust Ordering of Groupby Columns

Adjust the columns in group by clause to the ordering of order by clause to avoid a extra ordering operation.

For example, the following SQL:

select c_custkey, c_name, sum(o.O_TOTALPRICE) from customer c, orders o where o_custkey = c_custkey  group by c_custkey, c_name order by c_name;

It can be rewritten as:

select c_custkey, c_name, sum(o.O_TOTALPRICE) from customer c, orders o where o_custkey = c_custkey  group by c_name, c_custkey order by c_name;

Optimizing ANY/SOME/ALL Subqueries

Subqueries modified by ANY/SOME/ALL originate from the SQL-92 standard and are typically used to check the relationship between a value and all or any values returned by the subquery. They tend to have poor performance since they require row-by-row comparison against the result set. Query rewrites can improve the efficiency.

For example, the following SQL:

select * from orders where o_orderdate < all (select o_orderdate from orders where o_custkey > 100)

For MySQL, can be rewritten as:

select * from orders where o_orderdate < (select o_orderdate from orders where o_custkey > 100 order by o_orderdate asc limit 1)

For PostgreSQL or Oracle, it can be rewritten as:

select * from orders where o_orderdate < (select o_orderdate from orders where o_custkey > 100 order by o_orderdate asc nulls first limit 1)

About PawSQL

PawSQL is dedicated to automatic and intelligent database performance optimization. The products provided by PawSQL include:

  • PawSQL Cloud, an online automated SQL optimization tool that supports SQL auditing, intelligent query rewriting, cost-based index recommendations, suitable for database administrators and data application developers.
  • PawSQL Advisor, an IntelliJ plugin that is suitable for data application developers and can be installed via the IDEA/DataGrip marketplace by searching for "PawSQL Advisor" by name.
  • PawSQL Engine, which is the backend optimization engine of the PawSQL series of products, can be installed and deployed independently, and provides SQL optimization services through http/json interfaces. PawSQL Engine is provided for deployment and installation as a docker image.

Contact Us

Email: service@pawsql.com

Twitter: https://twitter.com/pawsql