You get a bonus - 1 coin for daily activity. Now you have 1 coin

6.1. Designing relational databases using normalization

Lecture



First, a classical approach will be considered, in which the entire design process is carried out in terms of the relational data model using the method of successive approximations to a satisfactory set of relationship schemes. The starting point is the representation of the domain in the form of one or more relationships, and at each design step a certain set of relationships schemes are produced with the best properties. The design process is a process of normalization of relationship schemes, each subsequent normal form having better properties than the previous one.

Each normal form corresponds to a certain defined set of constraints, and the relation is in some normal form if it satisfies its own set of constraints. An example of a set of constraints is the constraint of the first normal form — the values ​​of all attribute relations are atomic. Since the requirement of the first normal form is the basic requirement of the classical relational data model, we will assume that the original set of relations already meets this requirement.

In the theory of relational databases, the following sequence of normal forms is usually distinguished:

  • first normal form (1NF);
  • the second normal form (2NF);
  • the third normal form (3NF);
  • normal Boys-Codd form (BCNF);
  • fourth normal form (4NF);
  • the fifth normal form, or the normal form of the projection-compound (5NF or PJ / NF).

The basic properties of normal forms:

  • each following normal form is better in some sense than the previous one;
  • when passing to the next normal form, the properties of the previous normal properties are preserved.

The design process is based on the method of normalization, decomposition of a relation that is in a previous normal form, into two or more relationships that satisfy the requirements of the next normal form.

The most important in practice, the normal forms of relationships based on the fundamental in the theory of relational databases, the concept of functional dependence . For further discussion, we will need several definitions.

Definition 1. Functional Dependence

In relation to R, the attribute Y is functionally dependent on the attribute X (X and Y can be composite) if and only if exactly one value of Y corresponds to each value of X: RX (r) RY

Definition 2. Complete functional dependence

The functional dependence RX (r) RY is called complete if the attribute Y does not depend functionally on any exact subset of X.

Definition 3. Transitive functional dependency.

The functional dependency RX -> RY is called transitive if there is such an attribute Z that there are functional dependencies RX -> RZ and RZ -> RY and there is no functional dependency RZ -> RX (In the absence of the last requirement, we would have uninteresting transitive dependencies in any respect with multiple keys.)

Definition 4. Non-key attribute

A non-key attribute is any relationship attribute that is not part of the primary key (in particular, the primary key).

Definition 5. Mutually Independent Attributes

Two or more attributes are mutually independent if none of these attributes are functionally dependent on the others.

6.1.1. Second normal form

Consider the following example of a relationship diagram:

EMPLOYEES-DEPARTMENTS-PROJECTS

(SOTR_NOMER, SOTR_ZARP, OTD_NOMER, PRO_NOMER, SOTR_ZADAN)

Primary key:

SOTR_NOMER, PRO_NOMER

Functional dependencies:

SOTR_NOMER -> SOTR_ZARP

SOTR_NOMER -> SEND_NUMER

OTD_NOMER -> SOTR_ZARP

SOTR_NOMER, PRO_NOMER -> SOTR_ZADAN

As you can see, although the primary key is the compound attribute SOTR_NOMER, PRO_NOMER, the attributes SOTP_ZARP and OTD_NOMER functionally depend on the part of the primary key, the attribute SOTR_NOM. As a result, we will not be able to insert into the relation EMPLOYEES-DEPARTMENTS-PROJECTS a tuple describing an employee who has not yet completed any project (the primary key cannot contain an undefined value). When you delete a tuple, we not only destroy the connection of this employee with this project, but lose information that he works in a certain department. When transferring an employee to another department, we will be forced to modify all the tuples describing this employee, or we will get an uncoordinated result. Such unpleasant phenomena are called relationship circuit anomalies. They are eliminated by normalization.

Definition 6. The second normal form (in this definition it is assumed that the only key of the relationship is the primary key)

The relation R is in the second normal form (2NF) if and only if it is in 1NF, and each non-key attribute completely depends on the primary key.

It is possible to make the following decomposition of a relationship EMPLOYEES-DEPARTMENTS-PROJECTS into two relationships EMPLOYEES-DEPARTMENTS and EMPLOYEES-PROJECTS:

EMPLOYEES-DEPARTMENTS (SOTR_NOMER, SOTR_ZARP, OTD_NOMER)

Primary key:

SOTR_NOMER

Functional dependencies:

SOTR_NOMER -> SOTR_ZARP

SOTR_NOMER -> SEND_NUMER

OTD_NOMER -> SOTR_ZARP

EMPLOYEES-PROJECTS (SOTR_NOMER, PRO_NOMER, SOTR_ZADAN)

Primary key:

SOTR_NOMER, PRO_NOMER

Functional dependencies:

SOTR_NOMER, PRO_NOMER -> COTR_ZADAN

Each of these two relations is in 2NF, and the anomalies noted above have been eliminated (it is easy to check that all the above operations are performed without problems).

Assuming the presence of several keys, the definition of 6 will take the following form:

Definition 6 ~

The relation R is in the second normal form (2NF) if and only if it is in 1NF, and each non-key attribute completely depends on each key R.

Hereinafter we will not give examples for relations with several keys. They are too cumbersome and relate to situations that are rarely found in practice.

6.1.2. Third normal form

Consider once again the relationship EMPLOYEES-DEPARTMENTS located in 2NF. Note that the COTP_NOMER -> COTP_ZARP functional dependence is transitive; it is a consequence of the functional dependencies COTP_NOMER -> OUT_NUMBER and OTD_NOMER -> COTP_ZARP. In other words, the employee’s salary is actually not a characteristic of the employee, but of the department in which he works (this is not a very natural assumption, but sufficient for example).

As a result, we will not be able to enter into the database information describing the department’s salary until at least one employee appears in this department (the primary key cannot contain an undefined value). When deleting a tuple that describes the last employee of this department, we will lose information about the department’s wages. In order to consistently change the wages of the department, we will be forced to first find all the tuples describing the employees of this department. Those. there are still anomalies with regard to the staff departments. They can be eliminated by further normalization.

Definition 7. The third normal form. (Again the definition is given on the assumption of the existence of a single key.)

The relation R is in the third normal form (3NF) if and only if it is in 2NF and each non-key attribute is non-transitively dependent on the primary key.

You can decompose relationships EMPLOYEES-DEPARTMENTS into two relationships EMPLOYEES and DEPARTMENTS:

EMPLOYEES (COTR_NOMER, OTD_NOMER)

Primary key:

SOTR_NOMER

Functional dependencies:

SOTR_NOMER -> SEND_NUMER

DEPARTMENTS (OTD_NOMER, SOTR_ZARP)

Primary key:

OUT_NUMBER

Functional dependencies:

OTD_NOMER -> SOTR_ZARP

Each of these two relationships is in 3NF and is free of the marked anomalies.

If we abandon the restriction that the relation has a single key, then the definition of 3NF will take the following form:

Definition 7 ~

The relation R is in the third normal form (3NF) if and only if it is in 1NF, and each non-key attribute is not transitively dependent on any key R.

In practice, the third normal form of relationship schemes is sufficient in most cases, and the process of designing a relational database usually ends with a reduction to a third normal form. However, it is sometimes useful to continue the normalization process.

6.1.3. Normal form of Boyce-Codd

Consider the following example of a relationship diagram:

EMPLOYEES-PROJECTS (SOTR_NOMER, SOTR_NAME, PRO_NOMER, SOTR_ZADAN)

Possible keys:

SOTR_NOMER, PRO_NOMER

COTTER_NAME, PRO_NOMER

Functional dependencies:

SOTR_NOMER -> COTR_NAME

SOTR_NOMER -> PRO_NOMER

SOTR_NAME -> COTR_NOMER

SOTR_NAME -> PRO_NOMER

SOTR_NOMER, PRO_NOMER -> COTR_ZADAN

SOTR_NAME, PRO_NOMER -> COTR_ZADAN

In this example, we assume that the employee’s identity is fully determined by both his number and his name (this is again not a very vital assumption, but sufficient for an example).

In accordance with definition 7 ~, the ratio EMPLOYEES-PROJECTS is located in 3NF. However, the fact that there are functional dependencies of the attribute of a relation on an attribute that is part of the primary key leads to anomalies. For example, in order to change the name of an employee with a given number in a consistent manner, we will need to modify all the tuples that include his number.

Definition 8. Determinant

A determinant is any attribute on which some other attribute is fully functionally dependent.

Definition 9. Boys-Codd Normal Form

The relation R is in Boyce-Codd normal form (BCNF) if and only if each determinant is a possible key.

Obviously, this requirement is not fulfilled for the relationship EMPLOYEES-PROJECTS. It can be decomposed into relations EMPLOYEES and EMPLOYEES-PROJECTS:

EMPLOYEES (COTR_NOMER, COTR_NAME)

Possible keys:

SOTR_NOMER

CLEAR_NAME

Functional dependencies:

SOTR_NOMER -> COTR_NAME

SOTR_NAME -> SOTR_NOMER

EMPLOYEES-PROJECTS (SOTR_NOMER, PRO_NOMER, SOTR_ZADAN)

Possible key:

SOTR_NOMER, PRO_NOMER

Functional dependencies:

SOTR_NOMER, PRO_NOMER -> COTR_ZADAN

Alternative decomposition is possible if you choose SOTR_IMA as the basis. In both cases, the resulting relationship EMPLOYEES and EMPLOYEES-PROJECTS are located in BCNF, and these anomalies are not characteristic of them.

6.1.4. Fourth normal form

Consider an example of the following relationship scheme:

PROJECTS (PRO_NOMER, PRO_SOTR, PRO_ZADAN)

Attitudes PROJECTS contains project numbers, for each project a list of employees who can carry out the project, and a list of tasks envisaged by the project. Employees can participate in several projects, and different projects can include the same tasks.

Each tuple of a relationship links a certain project with an employee participating in this project and the task that the employee performs in the framework of this project (we assume that any employee participating in the project performs all the tasks provided for by this project). Due to the conditions formulated above, the only possible key to the relationship is the composite attribute PRO_NOMER, PRO_COTR, ​​PRO_ZADAN, and there are no other determinants. Therefore, the ratio of PROJECTS is in BCNF. But at the same time it has flaws: if, for example, a certain employee joins this project, it is necessary to insert as many tuples as PROJECTS in the relation as there are tasks in it.

Definition 10. Many-valued dependencies

In relation to R (A, B, C), there is a multi-valued relationship RA -> -> RB if and only if the set of values ​​of B, corresponding to a pair of values ​​of A and C, depends only on A and does not depend on C.

Regarding PROJECTS, there are two ambiguous dependencies:

PRO_NUMER -> -> PRO_COMP

PRO_NUMER -> -> PRO_ZADAN

It is easy to show that, in the general case, with respect to R (A, B, C), there is a multi-valued dependence RA -> -> RB if and only if there is a multi-valued dependence RA -> -> RC

Further normalization of relationships similar to those of PROJECTS is based on the following theorem:

Fagin's Theorem

The ratio R (A, B, C) can be projected without loss in the ratio R1 (A, B) and R2 (A, C) if and only if MVD A -> -> B | C.

By lossless projection, we mean such a way of decomposition of a relationship in which the original relationship is completely and without redundancy restored by the natural combination of the relations obtained.

Definition 11. The fourth normal form.

The relation R is in the fourth normal form (4NF) if and only if, in the case of the existence of a multivalued dependence A -> -> B, all other attributes of R functionally depend on A.

In our example, you can decompose the relationship PROJECTS into two relations PROJECTS-EMPLOYEES and PROJECTS-TASKS:

PROJECTS-EMPLOYEES (PRO_NOMER, PRO_COMP)

PROJECTS-TASKS (PRO_NOMER, PRO_ZADAN)

Both of these relationships are in 4NF and free from the marked anomalies.

6.1.5. Fifth normal form

In all the normalizations considered up to this point, a decomposition of one relation into two was performed. Sometimes this cannot be done, but decomposition into a larger number of relations is possible, each of which has the best properties.

Consider, for example, the relationship

EMPLOYEES-DEPARTMENTS-PROJECTS (COTR_NOMER, OTD_NOMER, PRO_NOMER)

Suppose that the same employee can work in several departments and work in each department on several projects. The primary key of this relationship is the complete set of its attributes; there are no functional and multivalued dependencies.

Therefore, the ratio is in 4NF. However, there may be anomalies that can be resolved by decomposing into three relationships.

Definition 12. Connection Dependency.

The ratio R (X, Y, ..., Z) satisfies the dependencies of the compound * (X, Y, ..., Z) if and only if R is recovered without loss by connecting its projections to X, Y, ..., Z.

Definition 13. The fifth normal form.

The relation R is in the fifth normal form (the normal form of the projection-connection is PJ / NF) if and only if any connection dependence in R follows from the existence of some possible key in R.

We introduce the following names of compound attributes:

CO = {SOTR_NOMER, DEPT_NUMER}

SP = {SOTR_NOMER, PRO_NUMER}

OP = {OTD_NUMER, PRO_NOMER}

Suppose that in relation to the EMPLOYEES-DEPARTMENTS-PROJECTS there is a connection dependence:

* (CO, SP, OP)

It is easy to show with examples that problems can arise when inserting and deleting tuples. They can be eliminated by decomposing the original relationship into three new relationships:

EMPLOYEES-DEPARTMENTS (SOTR_NOMER, DEPARTMENT_NUMER)

EMPLOYEES-PROJECTS (SOTR_NOMER, PRO_NUMER)

DEPARTMENTS-PROJECTS (DEPARTMENT NUMBER, PRO_NUMBER)

The fifth normal form is the last normal form that can be obtained by decomposition. Its conditions are rather nontrivial, and in practice 5NF is not used. Note that the dependence of a compound is a generalization of both a multi-valued dependency and a functional dependency.


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

IBM System R — реляционная СУБД

Terms: IBM System R — реляционная СУБД