I. Configuration Techniques
A) Use Raw Partitions instead the file system.
B) Eliminate Fragmentation
1) Defining Fragmentation
2) Preventing Fragmentation
3) Dealing with Existing Fragmentation
C) Table Clustering.
1) Table Index Clustering
a) Clustering about the join of 2 or more tables.
b) Clustering about 1 table.
2) Hash Clustering
D) Maximize Size of System Global Area
II. SQL Tuning and Stored Procedures
A) Reducing Number of Indexed Keys.
1) Index Performance Tradeoffs.
1) Alternatives to Indexes
B) Tuning individual SQL Queries
C) Using Stored Procedures
III. Application Level Techniques
A) Application Cacheing of Smaller Tables
B) Report Data Warehousing
C) Collection of Statistics
IV. Techniques Affecting Data Model
A) Third Normal Form
B) Collections - Implement Many <=> 1 relationships
1) VARRAYS - Atomic MultiValues withing a single column.
2) Nested Tables - MultiValues of another table's rows
a) Evaluating Nested Table Candidates
b) Performance Considerations of Nested Tables
This document takes a full system approach to improving database
performance. A single technique approach, like tuning SQL queries
using tools such as Explain Plan, can yield some
significant improvements in the short term. However, optimizing SQL
queries will only help to the extent that the SQL queries are the
problem. As this document will show, there are many performance
tuning options that range from the relatively quick fix of
Eliminating Fragmentation to the more
involved Report Data Warehousing.
Role of Database Designer
Since this web page is targetted to Database Designers, the first
subject covered will be to define just what it means to do database
design.
The role of the relational database designer is ambiguous. He or she is expected to design the Data Model, consisting of the Entity/Relationship model, if one is used, and the Relational Model (the tables definitions). Outside of the Data Model, the other expectations of the database designer tend to vary with the particular project.
The diagram below is intended to illustrate
the role of the database designer in relation to the various tasks
associated with database design and implementation:
Figure 1 is intended to be a Venn diagram using boxes instread of
circles. The area enclosed within larger gray box represents what
should be the realm of responsibility, or at least involvement, of
the database designer (or architect). The area enclosed within the
green box represents the responsibilities of the DBA. Thus, it is
intended that figure 1 illustrates that Configuration should be a
shared responsibility. (Please note that in this diagram the size
of the separate boxes has no implications. The DBA got a small
box, but also got top billing.)
The arrows in figure 1 are intended to show dependency, such as the Application Code being dependent upon the the Data Model and/or the Functional Model. But even in the application code dependency there is ambiguity because there may or may not be a distinct Functional Model. On some projects, the application developers may access the table directly by using ODBC and SQL, or some similar method. If there is a Functional Model, some of the application code might bypass it. Or another alternative is that there is no Functional model per se, but the database designer may be expected to assist the application developers with database access, or just code it up himself on an ad hoc basis.
The dotted line around the Business Rules is meant to show that altho it would never expected for the database to enforce all the business rules, some of the business rules can be enforced by the database. For example, encapsulating separate credit and debit operations within a single transaction is applying a business rule. Stored procedures can be used to apply business rules.
Consequently, before getting into the particulars of increasing
database performance by using the listed techniques, this document
advocates clear role definitions for the various players that are
involved with the database and its associated application. This
document also promotes the idea that the Database Designer should at
least be involved, if not directly responsible for, all the areas
within the gray outline of figure 1. Achieving
optimum database performance will usually require a holistic approach
to all the factors affecting the database performance. Without
someone given the means to coordinate all the factors affecting database
performance, the odds of achieving optimum performance are remote.
Dependency Considerations
Altho role definitions and development processes may vary or be
ambiguous, functional dependency is clear. For example, one cannot
design the the SQL for queries or updates (ie Functional Model)
to a database before the tables are defined (Relational
Model).
Since dependencies are very relevant to the effort needed to implement a
particular technique, this document will start with the least intrusive
performance techniques and progress to the most intrusive techniques.
Since database configuration is independent of the dependency tree,
changes to the configuration are the least intrusive techniques. The
most intrusive techniques would be at the base of the dependency tree,
those involved with modifying the Data Model.
Configuration Techniques are Dependency Independent
Refering again to figure 1, database
Administration and Configuration, typically the duties of
the DBA, is independent of the dependency tree. This independence
implies that any techniques applied to the Configuration to improve
performance can be done without affecting the application design.
Thus, techniques applied to the Configuration, the least intrusive
techniques, will be covered first.
RAW Partitions
Using raw partitions instead of a file system simply removes a layer
(the file system) between the raw disk blocks and the database package.
The most profound effect of using raw partitions is most likely to be
the lessening of fragmentation.
Eliminate Fragmentation
Defining Fragmentation
The most optimal physical layout of a database is for all of its
tables (and other objects) to be stored in contiguous disk blocks.
Any time there is a gap between the disk blocks of a given
table, fragmentation has occured. The more gaps that exist
within the disk blocks of tables, the more fragmentation there is.
Fragmentation causes additional disk head movements, which significantly increase the amount of time it takes to perform database operations. These head movements are in the form of seeks, which are the moving of the disk head from one disk track to another. A seek will typically take at least 5 milliseconds (1/200th of a second) or longer.
1/200th of a second may not seem like much, but it snowballs. Not
only can there be many seeks be involved in a database transaction,
the disk driver software may not immediately perform a required seek.
The optimization algorithms in the disk driver may interrupt your disk
operation to do other disk operations to minimize disk head movement.
The moral of the story is that fragmentation is a bad thing.
Preventing Fragmentation
The best prevention of fragmentation is the proper setting of the
INITIAL and NEXT parameters of the storage clause,
which itself is part of the extent specs. The storage clause
is a Data Definition (DDL) SQL statement clause. The storage clause
or its extent specs container is found in the CREATE CLUSTER, CREATE
INDEX, CREATE TABLE and CREATE TABLESPACE DDL SQL statements. The
NEXT paramater is also modifiable in the ALTER CLUSTER, ALTER INDEX,
ALTER TABLE and ALTER TABLESPACE DDL SQL statements.
The value of the INITIAL parameter represents the number of disk blocks originally allocated to a database object (table, index, tablespace, or cluster), while the NEXT parameter represents the size in disk blocks of the Extent that object will increase by when it needs more space. An Extent is a logical unit of database storage and is made up of contiguous disk blocks.
The less often new extents are added to a database object, the less
fragmentation. Thus, large INITIAL and NEXT extent sizes are better
than small extent sizes, providing disk space is not wasted that may
needed elsewhere. As is true with all database configuration issues,
the setting of the INITIAL and NEXT parameters is a matter understanding
the nature of the data and of the application.
Dealing With Existing Fragmentation
In an existing system with fragementation, the NEXT clause can be
modified in the ALTER CLUSTER, ALTER INDEX, ALTER TABLE and ALTER
TABLESPACE DDL SQL statements to prevent it from getting worse.
To eliminate existing system fragmentation, the database can
be exported and then imported. If the database is 24/7 and can never
be scheduled for down time, that database would need to be rebuilt in
a separate space and put online using some switchover mechanism.
Table Clustering
The Oracle database gives the database designer or DBA 3 distinct
techniques to store, or cluster, table rows together
within disk blocks. Those 3 ways are Table Clustering about 1
table, Table Clustering about 2 or more tables, and Hash
Clustering. The primary purpose of Table Clustering about 2 or
more tables is to expidite joins and is discussed in a
separate web page.
The factors that determine if a table or tables are good candidates for Table Clustering are:
The way it works is that the key value, possibly comprising a number of columns, will only be stored once for all the rows that share that same key value. Additionally, all the rows that share that same key value will be stored contiguously within the same disk block. Obviously, it makes no sense to apply this technique to unique keys.
The primary advantage to Table Clustering about 1 table
is that when the Table Cluster Index key is used in queries, those
queries will be faster since all the rows with the same key value are
physically stored together. In addition, that duplicate key will only
need to be stored once by the associated index, a significant factor
in index traversal and maintenance performance.
Hash Clustering
A database index can be defined as an auxilary mechanism to a table
that is used to rapidly locate particular rows given a particular key
value. By this definition it would be possible to design an index
based upon a Hashtable object. However, in Oracle, the term 'index'
applies only to B-Tree indexes.
In Oracle, hashtables per se do not exist, but key hashing can be used to locate rows via Hash Clustering. Hash Clustering is similar in concept to Table Index Clustering, but uses a hashing function to determine in which disk block a row of a given key value resides, instead of a B-Tree index.
Altho a Hash Cluster is similar in concept to an Index Cluster,
it has its own criteria for use. Hash Clustering is not about the
efficiency of doing joins or for index keys of low cardinality. The
advantage to using a Hash Cluster is that it replaces a B-Tree index.
Like all Oracle Table Clustering, properly determining the optimal
configuration parameters can be complicated, but it only has to done
once.
Maximize Size of System Global Area
The System Global Area (SGA) is a shared memory
region allocated by Oracle that contains:
In regards to performance, B-Tree indexes are expensive. Indexes can improve the performance of searching by indexed keys, but they degrade the performance of the SQL INSERT, UPDATE and DELETE operations. Consequently, the importance of the queries that benefit from the indexing of a particular key must be weighed against all the other operations that are hurt by it. A common cause of performance degredation is having too many indexes. Indexes should not, but often are, created in a willy-nilly fashion.
A basic criteria for the creation of an index is whether the query
that benefits from it is a fundamental part of the application, or
whether it is only for report generation. If the query that benefits
from an index is not a fundamental part of the application
(ie reports), an alternative to indexing may be a better solution.
Alternatives to B-Tree Indexes
Alternatives to B-Tree indexes include:
Since SQL optimization is well covered by Oracle, it will not be
expanded upon here. However, it is still well worth looking into.
Using Stored Procedures
The use of Stored Procedures provide 3 main advantages:
A common method for an application to access the database is to use ODBC or its Java counterpart, JDBC. Every time an ODBC operation is done, the SQL script is passed to the database. This SQL script must be compiled before it can be used. A precompiled stored procedure eliminates compiling overhead during transactions.
SQL itself is a stateless language. A stored procedure contains SQL within the environment of a logical and algorithmic program. Thus, all the interaction that may be involved between the database and the application's program logic takes place within the database. I/O between the database and the application is greatly reduced.
Previously, all stored procedures were coded in PL/SQL. PL/SQL is a distinct programming language that requires its own learning curve. PL/SQL was modelled after PL/1, which most developers today probably never heard of.
Now that Oracle supports Java, stored procedures can be developed,
supported and understood by anyone who understands Java. Java
programmers are everywhere and Java itself may soon become the most
common programming language.
Application Table Cacheing
A huge boost in database performance can be made by storing
particular tables completely in RAM. This technique involves the
development of application code to read entire tables, to store
those tables completely in RAM, and to access those RAM based tables
for processing. Indexing support for these in RAM tables can be
done using RAM based objects like Hashtables.
Since system RAM has become so cheap and abundant, this technique has become feasible for larger and larger tables. The main criteria for chosing tables for Application Table Cacheing is if those tables are essentially nonTransactional (they don't often change).
A common example of a table that is nonTransactional but yet core to its application is an Item Description table in an inventory oriented application. The Item Description table itself, the table of the descriptions and parameters of the standard items that are offered for sale by department, mail order and web based stores, is essentially nonTransactional. Most stores do not change their standard product line on a minute by minute, or even hour by hour basis.
A mail order business must have its items defined and stable before its catalogs or flyers are printed. A web site has to update the content equivilant of the mail order catalog before a new item can be used. A department store must coordinate a new item with supply, display, stocking it, etc. The point is, an Item Description table does not change often. If a change occurs, such a table can be refreshed in its RAM buffer on a scheduled basis during off hours.
In inventory oriented applications, the Item Description table is
involved in almost every transaction: sales, inventory updates, order
generation, invoice generation, etc. By having the Item Description
table in RAM, not only is its access much faster, it also puts no
burden on the I/O channel from the application to the database, the
disk head movement sequence or the disk block cache.
Report Data Warehousing
A data warehouse is nonTransactional database used for
data mining, the obtaining of data by the running of
elaborate queries. Transactional support is very expensive in terms
of resources and performance, so the data warehouse exists outside of
the transactional system and is used to support data mining.
The basic idea of using data warehousing to increase performance is to separate, at least conceptually, which operations need to be part of the transactional system, and which do not. For example, scheduled daily reports do not need to be transactional. Scheduled reports can be scheduled to run in off hours. If a particular index exists soley to support certain reports, it is possible that in off hours the system can tolerate a less efficient search, even if it's a sequential search, and that index can be eliminated.
It may seem that a read only query, like a report, is not a transaction. But it is. Oracle automatically supports read consistency. Read consistency essentially takes a snapshot of all the tables involved in a query and guarantees that the entire query will see the all data at the point in time that the query started. Thus, if any UPDATEs, INSERTs or DELETEs are done to the targetted tables during a query, that query essentially becomes a read transaction.
It is common that a number of scheduled reports access the same data, but may use different indexed keys for querying. Instead of querying the transactional database separately for each report, intermediate tables can be designed that contain the information needed for those reports. These intermediate tables, which essentially become a temporary little data warehouse, can be optimized for report generation without the need for transactional considerations. To create the intermediate tables, usually only a single query, possibly a sequential search, is the only burden to the transactional system. Again, creation of intermediate tables would best be done in off hours.
Since after its creation, no UPDATEs, INSERTs or DELETEs are done
to these intermediate tables (mini data warehouse), the report queries
that run against it do not generate transactions. These reports can
then be run on an ad hoc basis against their data warehouse. After
the reports that required certain information are generated, the
table rows of the intermediate tables that contain that information,
or the entire tables, can be deleted.
Collection of Statistics
Oracle automatically accumulates a number of statistics on the
database. But more useful information never hurts. Instead of
spending time pursuing actions based upon guesses about some aspects
of a database, that time may be better spent deciding what exactly
what information is needed, often in the form of statistics, and
adding code and/or tables to obtain that information.
An example is using custom statistics can be found in the web page
on Oracle Table Clustering. Table
clustering is usually not recommended even by Oracle reps, not because
it doesn't work, it does work and can be very effective, but because
it is difficult to determine when to properly apply table clustering.
In short, there is more complexity involved that the average rep
wants to deal with. As can be seen in the
Oracle Table Clustering web page, a
methodical approach to obtain and analyze pertinent information can
make the mysterious become clear. This increased understanding will
provide a heightened sense of control which will yield successful
results in many performance related areas.
Techniques Affecting Data Model
As stated in the Introduction, any
performance improvements that involve modifying the
Data Model,
will affect other things downstream to it on the dependency tree.
The best thing to do, obviously, is to consider changes to the Data
Model early in the development phase. The next best thing is to do,
at least in theory, is to encapsulate the Data Model within a
Functional Model so that any subtle Data Model changes are
transparent to the Application Code.
Third Normal Form
A fundamental tenent of Relational Databases is to insure
Data Integrity. One
of the means used to insure Data Integrity is
Normalization.
Altho some purists may argue that a database should be designed
to its maximum level of normalization, anything higher than
Third Normal Form
will probably be unnecessary and serve only to hurt performance.
Collections - Implement Many <=> 1 relationships
Oracle has become an Object Relational database. In other
words, Oracle now supports objects. It would appear that if any of
the the new Object Relational features would benefit performance,
that feature would be the support of Collections. Collections
in Oracle consist of VARRAYs and Nested Tables.
VARRAYS - Atomic MultiValues withing a single column.
The VARRAY data type is essentially an array with the same
properties as an array in any programming language. It has a fixed
size, or given number of slots for storage bins for values, and
each slot is addressable via its whole number index.
The index of and array is not to be confused with a database index.
VARRAYs are useful in cases there is a 1 to MANY relationship between a table row and a single value. Simple examples might be criminal's alias names, or the people listed to drive a particular car. In each case, the array of values is strictly one dimensional. Conversely, a list of phone numbers a person might have would not be one dimensional, since a phone number is incomplete without knowing if a given phone number was for home, work, cell phone, etc.
In regard to performance, the pertinent property of VARRAYs is that unless they are very large (greater that 4K), they are stored in place within the data row. Thus, VARRAYs can significantly improve the performance of a 1 dimensional multivalues.
The downside of VARRAYs that its empty data slots must store a
NULL value.
Nested Tables - MultiValues of another table's rows
Evaluating Nested Table Candidates
Similar to VARRAYs, Nested Tables exist to conveniently
store a 1 to MANY relationship between a table row and
another entity, in this case the rows of a nested table. The term
nested is significant. It is only intended for cases
where the only time the nested table rows are accessed is as part of
the row that contains them. If the nested rows are also expected to
be accessed independently, those rows should not be nested.
Following is an example of 4 normalized tables that comprise the information contained within an Order. These tables are:
Item Description Table
Item # Item Name Misc Item Parameters ~
------ --------- ---- ---- ---- ----- ~
066364 thingagig .... ~
087905 dodad .... ~
037617 gizmo .... ~
053135 thingie .... ~
.... .... .... ~
Customer Description Table
Customer # Customer Other Customer Info ~
Name Addr City State Zip ~
---------- ---------- ---- ---- ---- ---- ~
AA23487 Widget Inc .... ~
GV12464 zipTech .... ~
MT57612 A-OK Corp .... ~
Order Customer Table Order Table
Order # Customer # Order # Item # Qty
------- ---------- ------- ------ ---
02-1123 AA23487 02-1123 066364 39
02-1154 GV12464 02-1123 087905 32
02-2321 MT57612 02-1154 037617 81
02-1154 053135 17
02-2321 006776 6
02-2321 055662 144
02-2321 067587 37
02-2321 087905 55
The base of an order is the join between the Order Customer
Table and the Order Table, which has 1 to many
relationship respectively, as shown below:
Joined Order Table
Order # Customer # Item # Qty
------- ---------- ------ ---
02-1123 AA23487 066364 39
087905 32
02-1154 GV12464 037617 81
053135 17
02-2321 MT57612 006776 6
055662 144
067587 37
087905 55
The Item Description and Customer Description tables
simply add descriptive parameters to this join on a 1 to 1 basis.
In this example, the Order Table exists only to normalize the
1 to many relationship between the Order Customer Table and the
Order Table. There is no reason to access the Order Table other
than as part of the join of the Order Customer Table and the Order
Table. Thus, the Order Table is a candidate to be a Nested Table.
Performance Considerations of Nested Tables
Unfortunately, the nested rows of Nested Tables are stored
separately, just as they would be in the conventional normalized
scenario. When the combined table, the Joined Order Table in
our example, is accessed, a dynamic join still must be done.
Consequently, Nested Tables are really just a fancy View
mechanism and don't offer a performance benefit.
A better solution to use Table Clustering.
Table Clustering not only actually stores the parent row and the
nested rows together, but is even less restrictive since the nested
rows can efficiently be accessed independently.