Lecture
The considered query transformations were based on the semantics of the query language, but they did not use the semantics of the database to which the query is addressed. Any conversion can be performed no matter what specific database is meant. In fact, with each true relational database, some semantic information is stored, which determines, for example, the integrity of the database.
If we talk about databases supported by System R, then this information is stored in the system database directories in the form of specified integrity constraints. Since the DBMS guarantees the integrity of the database, integrity constraints can be viewed as axioms, in the environment of which database queries are formulated.
As initial examples of query transformations based on semantic information, let us consider the approaches of the well-known System R DBMS and INGRES to ensure work with databases through views. These transformations were not focused on query optimization, but are directly related to it.
The database view (view) in terms of the SQL and QUEL languages is a named cataloged query, which is, from the users' point of view, the same database object as the relation. Views can be used in queries in the same way as stored relationships (with restrictions on their use in database modification operators).
The semantics of representations require that they be accurate: at the time when the query is executed on the representation, the information obtained must exactly correspond to the current state of the stored part of the database. Execution of a query on a view requires its materialization, i.e. calculating the relationship defined by the query that is associated with the view.
The System R and INGRES approach is based on the fact that representations are stored in database directories in an internal form, obtained after performing a grammatical analysis (and also, possibly, logical optimization) of the corresponding query. When processing a request on a view, before completing the logical optimization phase, the internal forms of the request and the presentation are merged. Some new transformed internal form is formed, and all subsequent processing of the request is performed on it, including logical optimization and selection of the optimal execution plan for the request. At the same time, the optimizer gets more information about this query and can make better decisions. Let's give a simple example.
Let the presentation be defined as
DEFINE VIEW V (C2) AS SELECT C1 FROM R WHERE C1> 6.
The request is formulated as follows:
SELECT C2 FROM V WHERE C2 <6.
If the query were compiled to explicitly materialize the view, a method would be chosen based on sequential scanning of V with a choice of tuples that satisfy the condition. The query would be executed in order to eventually produce an empty result. If we merge the internal forms of the request and the presentation, we would receive an internal form equivalent to the request
SELECT C1 FROM R WHERE C1 <6 AND C1> 6.
Already in the logical optimization phase, one could establish that it is equivalent to the query
SELECT C1 FROM R WHERE FALSE,
which means that the result of the query is empty.
Obviously, in some cases, this method of processing queries over database views makes it possible to achieve more efficient query execution by providing the optimizer with more information. The same idea lies in the basis of semantic query optimization: the use of semantic information stored in a database, regardless of a given query, when optimizing a query.
Another example of query transforms that are directly relevant to optimization is that for database modification queries to satisfy existing integrity constraints in a database. This approach was first used in the INGRES DBMS, but is also used in other systems, for example, in System R.
Integrity constraint is a logical expression stored in database directories composed of predicates over database objects. The database is in a consistent state if all specified integrity constraints are satisfied.
If a query is specified to modify the database, including integrity constraints, the satisfaction of which is required when performing any modification, then you can add predicate constraints to the logical condition of the query.
Let the database characterizing the structure of the organization consist of the EMP and DEPT relationship. For EMP, employees of the organization are registered. Diagram of this EMP relationship (EMP #, EMPNAME, DEPT #); The EMP # field contains a unique employee number, the EMPNANE field is the name of the employee, and DEPT # is the department number of the organization in which the employee works. Relationship DEPT stores information about departments and has a DEPT scheme (DEPT #, EMPCOUNT), where the total number of employees of this department is stored in the EMPCOUNT field. Let integrity constraint be set
ASSERT B IMMEDIATE ON EMP: EMP.DEPT # = 6.
This restriction means the prohibition of adding, deleting, and modifying tuples with respect to EMP with a DEPT # field value other than 6. Let the request for deleting a tuple be processed
DELETE EMP WHERE EMPNAME = 'Brown'.
If during the compilation of the request the integrity constraint is not taken into account, then the correct way to execute such a request is the following: sequentially select tuples for which the EMPNAME field value is 'Brown'; check the satisfaction of the next tuple with integrity constraints, and if the test is satisfactory, delete the tuple.
If integrity constraints are taken into account during compilation, then you can merge the internal forms of the query and the corresponding integrity constraints, and then perform a joint optimization. In our case, after the merge, an internal form is formed, equivalent to the request
DELETE EMP WHERE EMPNAME = 'Brown' AND DEPT # = 6.
When executing such a request, no additional calls to the checks of integrity constraints of the second type will be needed, since they are all already included in the logical condition for the removal operation. In addition, the optimizer gets more freedom in choosing how to perform the query. For example, if there are few departments, and for the EMP relationship, the index is maintained on the DEPT # field, then perhaps the most optimal way to perform the query would be to scan the relationship through the DEPT # index with the DEPT # = 6 condition and remove any tuple thus selected with the field value EMPNAME equal to 'Brown'. Thus, actions that transform a query and are not aimed specifically at optimization can contribute to a more efficient execution of the query. The efficiency of query execution can be improved by using knowledge that is independently stored in the database.
The considered examples show that the idea of semantic optimization is in principle not new. There is also a fundamental difference between the query transformations discussed above and those that are produced with semantic optimization. The main difference is that when you merge the internal request form with the internal presentation form or internal integrity constraints, we must fully and uniquely use external information, regardless of whether this will help optimize the request: the view selection condition must be completely added through AND to the query query condition; the same applies to the set of logical conditions for integrity constraints. Otherwise, the semantics of the query on the view or database modification operator will be broken.
Semantic optimization is based on the presence in the database of semantic information, which is not necessary to use when processing a query, but the use of which can lead to its more optimal execution. If semantic optimization deals only with knowledge presented in the form of database integrity constraints, then semantic optimization produces many transformed internal representations of the query; each conversion uses a certain subset of integrity constraints. If, for example, in the database two integrity constraints A and B are defined with logical conditions F1 and F2 and a query with sample condition F is processed, then during semantic optimization internal representations will be obtained that are equivalent to queries with conditions F, F AND F1, F AND F2 and F AND F1 AND F2.
Each of the received internal representations is subjected to complete further processing, including logical optimization and selection of the optimal execution plan; from the received plans (all of them are semantically equivalent, that is, they produce the same result), the cheapest one is chosen, which becomes the real plan for the execution of the original query.
For a sensible use of semantic optimization, it is convenient to present database integrity constraints in an implicative form. Then you can achieve a more meaningful order of transformation. For example, suppose that in our database on the structure of an organization, the EPM relationship is expanded by the STATUS and SALARY fields. The values of the STATUS field characterize the position of the employee, and the SALARY field characterizes his salary. Integrity restrictions may be imposed, resulting in the fact that a salary exceeding 40.000 can be assigned only to heads of departments:
ASSERT A ON EMP: IF SALARY> 40.000 THEN STATUS = 'Manager'.
Suppose a request is being processed
SELECT EMPNAME, STATUS FROM EMP WHERE SALARY = 20.000.
If we do not take into account the implicative nature of integrity constraints, then according to the general rules, the internal representations of the request will be merged and the integrity constraints, which obviously will do nothing. If we consider integrity constraint as a conversion rule, and the left part of the implication as a condition for the rule to be applied, then there will be no merging, which generally saves the time of processing the request. But for the request
SELECT EMPNAME, STATUS FROM EMP WHERE SALARY> 40.000
conversion rule is applicable and results in a semantically equivalent query
SELECT EMPNAME, STATUS FROM EMP WHERE STATUS = 'Manager',
the implementation of which may be more efficient.
After the query is converted according to some rule, another rule may be applicable to the received representation, etc. Perhaps the appearance of cycles of transformations. The problem of building a complete set of semantically equivalent representations of a query based on a given set of rules is in general very complex. An exact solution to this problem may require costs commensurate with the cost of fulfilling a request for the most optimal plan. Therefore, generally speaking, it is necessary to use heuristic algorithms that reduce the search space, i.e. specifying the condition for stopping the generation of new views. Heuristics are based on minimizing the weighted sum of the values of semantic optimization and query execution. An example of coarse heuristics can be the following: optimization is performed until its costs exceed the estimated cost of the most efficient of all the found plans for the query.
Comments
To leave a comment
Databases IBM System R - relational DBMS
Terms: Databases IBM System R - relational DBMS