16 Business Intelligence


001 This chapter describes some of the basic ideas in business intelligence.
 
本章介绍业务智能(business intelligence)领域的基础概念。
 
002 This chapter contains the following topics: 本章包含以下主题:
003

Introduction to Data Warehousing and Business Intelligence

16.1 数据仓库及业务智能简介

004 A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
 
数据仓库是一个专为查询与分析(query and analysis)——而非事物处理(transaction processing)——而设计的关系型数据库。 数据仓库中通常包含来自事务处理系统的历史数据,还可能包含来自其他数据源的数据。数据仓库的作用是将数据分析负载与事务处理负载分离,同时还能将用户 拥有的多个数据源融合为一个整体。
 
005 In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
 
数据仓库系统中除了关系型数据库之外,还应包含数据抽取(extraction),转换(transportation),传输(transformation),加载(loading)方案(即 ETL),联机分析处理(online analytical processing,OLAP)引擎,客户端分析工具,以及其他用于管理数据收集与数据发布过程的应用程序。
 
006

Characteristics of Data Warehousing

16.1.1 数据仓库特性

007 A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: 在介绍数据仓库时,通常都会引用 William Inmon 所阐述的数据仓库应具备的特性:
008

Subject Oriented

16.1.1.1 面向主题的

009 Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
 
数据仓库的作用是帮助用户分析数据。例如,如果用户需要分析企业的销售数据,就应建立一个以销售数据为中心的数据仓库。利用此数据仓库,分析人员能够回答类似“谁是去年对某类产品销售贡献最大的用户?”等问题。由于数据仓库是依据主题而创建的,所以说数据仓库是面向主题的(Subject Oriented)。
 
010

Integrated

16.1.1.2 统一的

011 Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
 
数据仓库的统一性(integration)与其面向主题性(subject orientation)是紧密相关的。数据仓库需要将来自不同数据源的数据整合为统一的形式。在向数据仓库中加载数据时,还要解决命名冲突,度量单位不一致等问题。当上述问题都解决后,数据仓库就可以 被看作是统一的(integrated)。
 
012

Nonvolatile

16.1.1.3 永久的

013 Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
 
永久性(nonvolatile)是指数据进入数据仓库后就不应被修改或删除。永久性的意义在于,数据仓库能够供用户分析历史上发生了什么。
 
014

Time Variant

16.1.1.4 历史的

015 In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
 
为了发现业务中存在的趋势,分析用户需要使用大量数据。这与联机事务处理(online transaction processing,OLTP)系统截然相反,OLTP 系统对性能要求较高,历史数据必须移动到归档中。由于数据仓库关注数据在时间上的变化,因此称其是历史的(time variant)。
 
016 Typically, data flows from one or more online transaction processing (OLTP) databases into a data warehouse on a monthly, weekly, or daily basis. The data is normally processed in a staging file before being added to the data warehouse. Data warehouses commonly range in size from tens of gigabytes to a few terabytes. Usually, the vast majority of the data is stored in a few very large fact tables.
 
在一个典型的数据仓库系统中,数据每月,每周或每日从一个或多个 OLTP 数据库汇入数据仓库。数据在加载到数据仓库之前经常存储在中转文件(staging file)内,并进行适当的处理。数据仓库的容量一般在数十 GB 至数 TB 之间。通常,事实表(fact table)占了数据仓库数据量的绝大部分。
 
017

Differences Between Data Warehouse and OLTP Systems

16.1.2 数据仓库系统与 OLTP 系统的区别

018 Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:
 
数据仓库系统与 OLTP 系统的需求差别很大。以下介绍典型的数据仓库系统与 OLTP 系统的区别:
 
019

Workload

16.1.2.1 负载

020 Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.
 
设计数据仓库时要满足即席查询(ad hoc query)的需要。开发者无法预先确定数据仓库的负载,因此在设计数据仓库时应该考虑到所有可能的查询操作 ,并进行优化。
 
021 OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations.
 
而 OLTP 系统只支持预定义的操作(predefined operation)。用户的应用系统可以进行有针对性的调优,或只提供特定的操作功能。
 
022

Data Modifications

16.1.2.2 数据变更

023 A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.
 
数据仓库的数据由采用批量数据变更方式(bulk data modification technique)的 ETL 过程周期性地(每天夜间或每周)进行更新。数据仓库的终端用户(end user)不能直接更新其中的数据。
 
024 In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.
 
而在 OLTP 系统中,终端用户会经常性地向数据库提交进行数据变更的语句。OLTP 数据库总是在不断地更新,以反映事务的最新状态。
 
025

Schema Design

16.1.2.3 模式设计

026 Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.
 
数据仓库通常采用反范式模式(denormalized schema)或部分反范式模式(partially denormalized schema)(例如星型模式(star schema))来优化查询性能。
 
027 OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.
 
而 OLTP 系统通常采用范式模式(normalized schema)以优化更新/插入/删除等操作的性能,并确保数据一致性。
 
028

Typical Operations

16.1.2.4 典型操作

029 A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month."
 
一个典型的数据仓库查询所扫描数据行的数量级通常是数千,甚至是数百万。例如,“找出所有客户上个月的总销售量”。
 
030 A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."
 
而一个典型的 OLTP 操作只需访问个别的记录。例如,“获得某客户的当前订单”。
 
031

Historical Data

16.1.2.5 历史数据

032 Data warehouses usually store many months or years of data. This is to support historical analysis.
 
数据仓库内通常会存储数月甚至数年的数据。这些数据被用于历史分析。
 
033 OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.
 
而 OLTP 系统内通常只存储几周或几个月的数据。OLTP 系统只存储当前事务所需的历史数据。
 
034

Data Warehouse Architecture

16.1.3 数据仓库体系结构

035 Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are: 数据仓库的体系结构是由用户的实际情况决定的。下面介绍三种常见的体系结构:
036

Data Warehouse Architecture (Basic)

16.1.3.1 数据仓库体系结构(基本型)

037 Figure 16-1 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.
 
图 16-1 显示了一个最基本的数据仓库的体系结构。数据仓库的数据来自多个数据源,终端用户直接访问数据仓库。
 
038 Figure 16-1 Architecture of a Data Warehouse
 
图 16-1 数据仓库体系结构
 
039


 


 

040

Figure 16-1 shows a data warehouse with metadata, raw data and summary data. Data going in to the data warehouse comes from data sources (operational systems and flat files) and from users (doing analysis, reporting, and mining).

图 16-1 显示了一个包含元数据,原始数据,及汇总数据的数据仓库。数据仓库存储来自多个数据源(操作型数据库及平面文件)及终端用户(分析,报表,及挖掘)的数据。
041 In Figure 16-1, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales.
 
图 16-1 中,数据仓库内存储了来自传统 OLTP 系统的原始数据(raw data)及元数据(metadata),同时还有汇总数据(summary data)。汇总数据在数据仓库中十分必要,因为其中预先计算了需要长时间执行才能得到结果。例如,一个典型的数据仓库查询可能会获取八月份的总销售情况。
 
042 Summaries in Oracle are called materialized views.
 
在 Oracle 中可以使用物化视图(materialized view)存储汇总数据。
 
043

Data Warehouse Architecture (with a Staging Area)

16.1.3.2 数据仓库体系结构(使用中转区)

044 Figure 16-1, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 16-2 illustrates this typical architecture.
 
图 16-1 所示的系统中,用户在将操作型数据(operational data)放入数据仓库之前,首先要对数据进行清洗及其他处理。上述过程可以完全在数据抽取程序中完成,但大多数数据仓库系统都会利用中转区(staging area)来进行此类操作。中转区能够简化计算汇总数据的过程,并使数据仓库更易于管理。图 16-2 显示了一个使用中转区的数据仓库系统。

 
045 Figure 16-2 Architecture of a Data Warehouse with a Staging Area
 
图 16-2 使用中转区的数据仓库体系结构
 
046


 


 

047

Figure 16-2 shows a data warehouse with metadata, raw data and summary data. Data going into the data warehouse comes from data sources (operational systems and flat files) and from users (doing analysis, reporting, and mining). The operational data from the data sources first goes to a staging area.

图 16-2 显示了一个包含元数据,原始数据,及汇总数据的数据仓库。数据仓库存储来自多个数据源(操作型数据库及平面文件)及终端用户(分析,报表,及挖掘)的数据。 来自数据源的操作型数据首先进入中转区。
048

Data Warehouse Architecture (with a Staging Area and Data Marts)

16.1.3.3 数据仓库体系结构(使用中转区及数据集市)

049 Although the architecture in Figure 16-2 is quite common, you might want to customize your warehouse's architecture for different groups within your organization.
 
图 16-2 所示的系统是较常见的数据仓库体系结构。但有时开发者还需要为组织内部的不同用户群定制特殊的数据仓库。
 
050 Do this by adding data marts, which are systems designed for a particular line of business. Figure 16-3 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.
 
这种定制是通过数据集市(data mart)实现的,数据集市是为不同业务而单独设计的数据仓库系统。在 图 16-3 所示的系统中,采购,销售,库存等系统是相互分离的。在这个例子中,专注于财务的分析用户只需访问采购及销售系统中的历史数据。
 
051 Figure 16-3 Architecture of a Data Warehouse with a Staging Area and Data Marts
 
图 16-3 使用中转区及数据集市的数据仓库体系结构
052


 


 

053

Figure 16-3 shows a data warehouse with metadata, raw data and summary data. Data going in to the data warehouse comes from data sources (operational systems and flat files) and from users (doing analysis, reporting, and mining). The operational data from the data sources first goes to a staging area. The user data is separated into lines of business, such as sales and purchasing.

图 16-3 显示了一个包含元数据,原始数据,及汇总数据的数据仓库。数据仓库存储来自多个数据源(操作型数据库及平面文件)及终端用户(分析,报表,及挖掘)的数据。 来自数据源的操作型数据首先进入中转区。终端用户使用的数据根据业务进行分割,例如销售及采购。
054

See Also:

Oracle Database Data Warehousing Guide

另见:

Oracle Database Data Warehousing Guide
055

Overview of Extraction, Transformation, and Loading (ETL)

16.2 抽取,转换,及加载(ETL)概述

056 You need to load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. We refer to the entire process, including data loading, as ETL. You should understand that ETL refers to a broad process, and not three well-defined steps.
 
用户需要定期地向数据仓库中加载数据,才能使其发挥业务分析的作用。数据需要从一个或多个操作型系统(operational system)中抽取并复制到数据仓库中。从源系统(source system)抽取数据再存储到数据仓库的过程被称为 ETL,即抽取(extraction),转换(transformation),及加载(loading)的首字母缩写。 事实上,只用 ETL 描述此过程过于简化,因为其中没有体现出传输(transportation)子过程,也没有体现出各子过程之间的联系。在本文中将介绍完整的 ETL 过程。用户应意识到 ETL 涉及一系列子过程,而不只是 3 个已经严格定义的步骤。
 
057 The methodology and tasks of ETL have been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary applications and database systems are the IT backbone of any enterprise. Data has to be shared between applications or systems, trying to integrate them, giving at least two applications the same picture of the world. This data sharing was mostly addressed by mechanisms similar to what we now call ETL.
 
ETL 过程采用的方法论及其中包含的主要工作其实早已为开发者所熟知,且 ETL 并非只应用于数据仓库系统。一个企业的 IT 环境通常由多个数据库系统及定制的应用系统构成。这些系统之间应该可以共享数据,或进行相互的集成,不同系统中的数据应该能够反映相同的业务事实。上述的数据共享工作通常是由与 ETL 类似的机制实现的。
 
058 Data warehouse environments face the same challenge with the additional burden that they not only have to exchange but to integrate, rearrange and consolidate data over many systems, thereby providing a new unified information base for business intelligence. Additionally, the data volume in data warehouse environments tends to be very large.
 
与数据共享工作相比,数据仓库系统除了需要交换数据,还需集成(integrate),重构(rearrange),融合(consolidate)来自多个源系统的数据,从而为业务智能提供一个统一的信息基础。此外,数据仓库系统中的数据量通常十分巨大。
 
059 What happens during the ETL process? During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files for example can easily become hundreds of megabytes in a very short period of time.
 
ETL 过程究竟包含哪些具体工作呢?在抽取阶段,需要从不同的数据源(数据库系统或应用程序)中定位并抽取分析所需的数据。但是准确地定位感兴趣的数据子集比较困难,因此在抽取 阶段通常会抽取多余的数据,定位有用数据的工作将在之后的步骤进行。有些转换工作也是在抽取阶段完成的,这取决于数据源系统的能力(例如,操作系统是否有足够资源)。抽取的数据量是由源系统及业务需求决定的,或许数字节,也可能高达数 GB。两次(逻辑上)完全相同的抽取之间的间隔同样由源系统及业务需求决定,有可能为数天,数小时,数分钟,甚至接近实时。例如,Web 服务器的日志数据在短时间内就可能积累数百 MB。
 
060 After extracting data, it has to be physically transported to the target system or an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT statement.
 
数据被抽取后,需要传输(physically transport)到目标系统(target system)或中间系统(intermediate system)中进行进一步处理。在传输的过程中也可以进行某些转换工作,这取决于用户选择得传输方式。例如,使用 SQL 语句通过网关(gateway)直接访问远程目标系统时,可以在 SELECT 语句中将两列数据合并。
 
061 If any errors occur during loading, an error is logged and the operation can continue.
 
如果在加载中发生错误,应该将问题记录到日志中,并使抽取过程继续执行。
 
062

Transportable Tablespaces

16.2.1 可移动表空间

063 Transportable tablespaces are the fastest way for moving large volumes of data between two Oracle databases. You can transport tablespaces between different computer architectures and operating systems.
 
可移动表空间(transportable tablespace)是在两个 Oracle 数据库间移动大量数据时最迅速的方式。用户可以在操作系统或计算机体系结构不同的系统间移动表空间。
 
064 Previously, the most scalable data transportation mechanisms relied on moving flat files containing raw data. These mechanisms required that data be unloaded or exported into files from the source database. Then, after transportation, these files were loaded or imported into the target database. Transportable tablespaces entirely bypass the unload and reload steps.
 
在可移动表空间技术出现之前,将原始数据(raw data)存储在平面文件(flat file)中再进行移动是最具伸缩性(scalable)的数据传输机制。这种方法需要首先将数据从源数据库中导出(unload or export)到平面文件。经过传输(transportation)后,再将文件中的数据导入(load or import)目标数据库。而使用可移动表空间可以完全跳过导入及导出的步骤。
 
065 Using transportable tablespaces, Oracle data files (containing table data, indexes, and almost every other Oracle database object) can be directly transported from one database to another. Furthermore, like import and export, transportable tablespaces provide a mechanism for transporting metadata in addition to transporting data.
 
使用可移动表空间功能时,Oracle 数据文件(包含表数据,索引,及几乎所有类型的 Oracle 数据库对象)可以直接在数据库间传输。此外,可移动表空间功能除了能够进行数据传输,还能进行元数据传输,这同 import 及 export 的元数据传输功能类似。
 
066 The most common applications of transportable tablespaces in data warehouses are in moving data from a staging database to a data warehouse, or in moving data from a data warehouse to a data mart.
 
在数据仓库系统中,可移动表空间功能最常见的应用是将数据从中转数据库(staging database)移动到数据仓库,或将数据从数据仓库移动到数据集市。
 
067

Table Functions

16.2.2 表函数

068 Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. Scenarios as mentioned earlier can be done without requiring the use of intermediate staging tables, which interrupt the data flow through various transformations steps.
 
用户可以利用表函数(table function)管道化(pipelined)及并行化(parallel)地执行数据转换(transformation),表函数可以由 PL/SQL,C,或 Java 实现。本章前面讲述的某些抽取案例中使用了起中间作用的中转表(staging table),但这会使数据在各个转换步骤间出现停顿,而采用表函数则能够避免这种停顿。
 
069 A table function is defined as a function that can produce a set of rows as output. Additionally, table functions can take a set of rows as input. Table functions extend database functionality by allowing:
  • Multiple rows to be returned from a function
  • Results of SQL subqueries (that select multiple rows) to be passed directly to functions
  • Functions take cursors as input
  • Functions can be parallelized
  • Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining
表函数指能够返回一个结果集的函数。此外,表函数也能够把结果集作为输入参数。表函数在以下几个方面扩展了数据库的能力:
  • 一个函数可以返回多行数据
  • SQL 子查询的结果(即选择多行数据)可以直接传递给表函数
  • 表函数能够把游标作为输入参数
  • 表函数能够并行执行
  • 表函数能够增量地返回结果集。这被称为增量管道
070 Table functions can be defined in PL/SQL using a native PL/SQL interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI).
 
表函数可以使用本地 PL/SQL 接口(native PL/SQL interface)在 PL/SQL 中定义,也可以使用 Oracle Data Cartridge Interface(ODCI)在 Java 或 C 中定义。
 
071

External Tables

16.2.3 外部表

072 External tables let you use external data as a virtual table that can be queried and joined directly and in parallel without requiring the external data to be first loaded in the database. You can then use SQL, PL/SQL, and Java to access the external data.
 
外部表(external table)功能使用户可以将外部数据虚拟为一个表,并可直接对此虚拟表进行并行地查询与联接操作,而无需将外部数据导入数据库。建立了外部表后,用户可以使用 SQL,PL/SQL,及 Java 访问外部数据。
 
073 External tables enable the pipelining of the loading phase with the transformation phase. The transformation process can be merged with the loading process without any interruption of the data streaming. It is no longer necessary to stage the data inside the database for further processing inside the database, such as comparison or transformation. For example, the conversion functionality of a conventional load can be used for a direct-path INSERT AS SELECT statement in conjunction with the SELECT from an external table. Figure 16-4 illustrates a typical example of pipelining.
 
采用外部表后,加载及转换两个阶段即可实现管道化。转换过程及加载过程可以相互融合,其间的数据流无需中断。用户无需首先将数据暂存在数据库内就可直接对其进行处理(例如比较或转换)。例如,用户可以使用 direct-path 的 INSERT AS SELECT 语句,使用 SELECT 语句访问外部表,并在加载的同时实现转换。图 16-4 显示了一个管道化(pipelining)的典型例子。
 
074 Figure 16-4 Pipelined Data Transformation
 
图 16-4 管道化的数据转换
075


 


 

076

Figure 16-4 shows 4 dimension tables (products, customers, times, and channels) connected to one fact table, called sales.

图 16-4 显示了四个维表(产品,客户,时间,及渠道)与名为销售的事实表关联。
077 The main difference between external tables and regular tables is that externally organized tables are read-only. No DML operations (UPDATE/INSERT/DELETE) are possible and no indexes can be created on them.
 
外部表与常规表的区别在于外部表是只读的。在外部表上不能执行 DML 操作(UPDATE/INSERT/DELETE),也不能在其上创建索引。
 
078 External tables are a complement to SQL*Loader and are especially useful for environments where the complete external source has to be joined with existing database objects and transformed in a complex manner, or where the external data volume is large and used only once. SQL*Loader, on the other hand, might still be the better choice for loading of data where additional indexing of the staging table is necessary. This is true for operations where the data is used in independent complex transformations or the data is only partially used in further processing.
 
外部表可以作为 SQL*Loader 的补充,当存在以下情况,例如全部外部数据需要与已有数据库对象结合并进行复杂地转换,或外部数据的数据量很大但只需被使用一次,使用外部表就能很好地解决问题。如果必须在中转表上创建索引,那么还是应使用 SQL*Loader 加载数据。如果外部数据自身需要进行复杂地转换,或外部数据中的某些部分数据需要进行进一步处理,也应先将数据加载到数据库中。
 
079

Table Compression

16.2.4 表压缩

080 You can save disk space by compressing heap-organized tables. A typical type of heap-organized table you should consider for table compression is partitioned tables.
 
用户可以通过压缩堆表(heap-organized table)来节省磁盘空间。分区表(partitioned table)就是一种典型的可以考虑采用表压缩(table compression)的堆表。
 
081 To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression can also speed up query execution. There is, however, a slight cost in CPU overhead.
 
为了减少磁盘使用及内存使用(尤其是数据库缓存(buffer cache)),用户可以将数据库内的表及分区表以压缩形式存储。这有助于提高只读操作的性能。表压缩能够提高查询执行的速度,但会略微增加 CPU 开销。
 
082 Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted.
 
当数据表内存在大量冗余数据时适合使用表压缩,例如表内存储了大量外键时。用户应该避免对可能存在大量 DML 操作的表进行压缩。尽管被压缩的表或分区是可更新的,但是更新时存在额外开销,对更新操作较多的表进行压缩可能反而会导致空间浪费。
 
083

See Also:

"Table Compression"

另见:

表压缩
084

Change Data Capture

16.2.5 变化数据捕捉

085 Change Data Capture efficiently identifies and captures data that has been added to, updated, or removed from Oracle relational tables, and makes the change data available for use by applications.
 
变化数据捕捉(Change Data Capture)功能能够识别并捕捉到 Oracle 关系数据表内新增,更新,及被删除的数据,并使这些数据可以被其他应用所使用。
 
086 Oftentimes, data warehousing involves the extraction and transportation of relational data from one or more source databases into the data warehouse for analysis. Change Data Capture quickly identifies and processes only the data that has changed, not entire tables, and makes the change data available for further use.
 
通常,数据仓库需要从一个或多个源数据库中抽取数据,再传输到数据仓库中供分析使用。采用变化数据捕捉功能后,用户能够迅速地识别并处理修改过(而非整个表)的数据以供分析 使用。
 
087 Change Data Capture does not depend on intermediate flat files to stage the data outside of the relational database. It captures the change data resulting from INSERT, UPDATE, and DELETE operations made to user tables. The change data is then stored in a database object called a change table, and the change data is made available to applications in a controlled way.
 
变化数据捕捉功能无需依赖关系数据库外部的中间文件来中转数据。此功能根据对数据表的 INSERTUPDATE,及 DELETE 操作来捕捉变化的数据。变化数据将存储在被称为变化表(change table)的数据库对象,这些数据能够根据用户的控制而提供给有需要的应用程序。
 
088

See Also:

Oracle Database Data Warehousing Guide

另见:

Oracle Database Data Warehousing Guide
089

Overview of Materialized Views for Data Warehouses

16.3 在数据仓库中使用物化视图

090 One technique employed in data warehouses to improve performance is the creation of summaries. Summaries are special kinds of aggregate views that improve query execution times by precalculating expensive joins and aggregation operations prior to execution and storing the results in a table in the database. For example, you can create a table to contain the sums of sales by region and by product.
 
在数据仓库系统中,可以通过创建汇总数据(summary)来提升其性能。汇总数据是一种聚合视图(aggregate view),其中可以保存预先计算的复杂关联,聚合,或排序,从而提高相关查询的执行性能。例如,用户可以创建一个表,保存按区域及产品汇总的销售数据。
 
091 The summaries or aggregates that are referred to in this book and in literature on data warehousing are created in Oracle using a schema object called a materialized view. Materialized views can perform a number of roles, such as improving query performance or providing replicated data.
 
在本章内或 Oracle 数据仓库系统中,汇总数据或聚合数据(aggregate)通常是通过物化视图(materialized view) 这种模式对象(schema object)创建的。物化视图的应用有多种,例如提高查询性能或进行数据复制。
 
092 Previously, organizations using summaries spent a significant amount of time and effort creating summaries manually, identifying which summaries to create, indexing the summaries, updating them, and advising their users on which ones to use. Summary management eased the workload of the database administrator and meant that the user no longer needed to be aware of the summaries that had been defined. The database administrator creates one or more materialized views, which are the equivalent of a summary. The end user queries the tables and views at the detail data level.
 
在物化视图出现之前,开发者如果需要使用汇总数据需要花费大量时间进行手工管理,例如需要确定创建哪些汇总数据,对汇总数据进行索引,更新汇总数据,并将汇总数据的信息告知有需求的用户。而(采用物化视图的)汇总管理(summary management)既减轻了数据库管理员的工作,同时使用户无需关心与汇总数据有关的细节。管理员只需要创建起到汇总数据作用的物化视图。而用户则直接在明细级查询数据表或视图。
 
093 The query rewrite mechanism in the Oracle database server automatically rewrites the SQL query to use the summary tables. This mechanism reduces response time for returning results from the query. Materialized views within the data warehouse are transparent to the end user or to the database application.
 
Oracle 数据库的查询重写(query rewrite)机制能够自动地重写查询 SQL 以使用汇总表。此机制减少了查询返回结果的响应时间。数据仓库内的物化视图对终端用户及数据库应用程序是透明的。
 
094 Although materialized views are usually accessed through the query rewrite mechanism, an end user or database application can construct queries that directly access the summaries. However, serious consideration should be given to whether users should be allowed to do this because any change to the summaries will affect the queries that reference them.
 
物化视图通常是通过查询重写机制访问的,但终端用户及数据库应用程序也可以直接对其访问。但是管理员应该仔细限定是否让终端用户直接访问物化视图,因为对物化视图的任何修改都将影响引用她们的所有查询。
 
095 To help you select from among the many possible materialized views in your schema, Oracle provides a collection of materialized view analysis and advisor functions and procedures in the DBMS_ADVISOR package. Collectively, these functions are called the SQL Access Advisor, and they are callable from any PL/SQL program. The SQL Access Advisor recommends materialized views from a hypothetical or user-defined workload or one obtained from the SQL cache. You can run the SQL Access Advisor from Oracle Enterprise Manager or by invoking the DBMS_ADVISOR package.
 
为了帮助用户选择可用的物化视图,Oracle 在 DBMS_ADVISOR 包中提供了一系列用于分析及建议物化视图的函数与过程。上述函数与过程被统称为 SQL Access Advisor(SQL 访问建议器),可以在 PL/SQL 程序中调用。SQL Access Advisor 能够根据假设的负载,用户定义的负载,或从 SQL 缓存(SQL cache)中获得的负载来建议如何使用物化视图。用户可以在 Oracle Enterprise Manager 里运行 SQL Access Advisor,也可以通过 DBMS_ADVISOR 包进行调用。
 
096
See Also:

Oracle Database Performance Tuning Guide for information about materialized views and the SQL Access Advisor
另见:

Oracle Database Performance Tuning Guide 了解更更多关于物化视图及 SQL Access Advisor 的信息
097

Overview of Bitmap Indexes in Data Warehousing

16.4 在数据仓库中使用位图索引

098 Bitmap indexes are widely used in data warehousing environments. The environments typically have large amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such applications, bitmap indexing provides:
  • Reduced response time for large classes of ad hoc queries
  • Reduced storage requirements compared to other indexing techniques
  • Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory
  • Efficient maintenance during parallel DML and loads
数据仓库应用(data warehousing application)的特点是数据量巨大,执行的多为即席查询(ad hoc query),且并发 DML 事务较少。这种环境下使用位图索引(bitmap index)具备如下优势:
  • 能够减少大数据量即席查询的响应时间
  • 与其他索引技术相比能够节省大量存储空间
  • 即使硬件配置较低也能显著提高性能
  • 有利于并行 DML(parallel DML)和并行加载(parallel load)
099 Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the indexed data in the table.
 
为一个大表建立传统的 B-tree index(平衡树索引)可能占用极大的存储空间,索引有可能比数据表还要大数倍。而一个位图索引(bitmap index)所占的空间比被索引数据还要小得多。
 
100 An index provides pointers to the rows in a table that contain a given key value. A regular index stores a list of rowids for each key corresponding to the rows with that key value. In a bitmap index, a bitmap for each key value replaces a list of rowids.
 
索引的目标是为用户提供指向包含特定键值(key value)的数据行的指针。在常规的索引中,Oracle 将各行的键值及与此键值对应的一组 ROWID 存储在一起,从而实现了上述目标。而在位图索引(bitmap index)中,只需为每个键值存储一个位图(bitmap),而非一组 ROWID。
 
101 Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so that the bitmap index provides the same functionality as a regular index. If the number of different key values is small, bitmap indexes save space.
 
位图(bitmap)中的每一位(bit)对应一个可能的 ROWID。如果某一位被置位(set),则表明着与此位对应的 ROWID 所指向的行中包含此此位图对应的键值(key value)。Oracle 通过一个映射函数(mapping function)将位信息转化为实际的 ROWID,因此位图索引同样能实现常规索引的功能。当不同值的索引键的数量较少时,位图索引的存储效率相当高。
 
102 Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause. Rows that satisfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatically. A good candidate for a bitmap index would be a gender column due to the low number of possible values.
 
如果一个查询的 WHERE 子句中包含多个条件,在查询中利用位图索引能够极大地提高效率。不满足全部条件的数据行可以在访问表数据之前被预先过滤掉。这通常会显著地缩短响应时间。 在类似于性别的数据列上适合创建位图索引,因为这样的列中可能的键值的数量较少。
 
103 Parallel query and parallel DML work with bitmap indexes as they do with traditional indexes. Bitmap indexing also supports parallel create indexes and concatenated indexes.
 
位图索引(bitmap index)如同常规索引一样,可以结合并行查询(parallel query)和并行 DML(parallel DML)一起工作。Oracle 还支持并行地创建位图索引,以及创建复合位图索引。
 
104
See Also:

Oracle Database Data Warehousing Guide
另见:

Oracle Database Data Warehousing Guide
105

Overview of Parallel Execution

16.5 并行执行概述

106 When Oracle runs SQL statements in parallel, multiple processes work together simultaneously to run a single SQL statement. By dividing the work necessary to run a statement among multiple processes, Oracle can run the statement more quickly than if only a single process ran it. This is called parallel execution or parallel processing.
 
当 Oracle 并行地(parallel)执行 SQL 语句时,存在多个进程同时工作共同执行同一个 SQL 语句。将一个语句的工作负载分配给多个进程,与单进程相比,同样的语句能够被更快地执行。这被称为并行执行(parallel execution)或并行处理(parallel processing)。
 
107 Parallel execution dramatically reduces response time for data-intensive operations on large databases typically associated with decision support systems (DSS) and data warehouses. Symmetric multiprocessing (SMP), clustered systems, and large-scale cluster systems gain the largest performance benefits from parallel execution because statement processing can be split up among many CPUs on a single Oracle system. You can also implement parallel execution on certain types of online transaction processing (OLTP) and hybrid systems.
 
对于大型数据库中的大数据量操作,并行执行能够显著地缩短响应时间,这个特点在决策支持系统(decision support systems,DDS)及数据仓库系统中表现的尤为突出。对称多处理系统(Symmetric multiprocessing,SMP),集群系统(clustered system),及大型集群系统(large-scale clustered system)能够从并行执行中获得最大的性能提升,因为在这样的系统平台上,一个 Oracle 系统的语句处理工作可以被分配到多个 CPU 上。用户也可以在某些联机事务处理系统(online transaction processing,OLTP)或混合系统(hybrid system)上采用并行执行。
 
108 Parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. An example of this is when 12 processes handle 12 different months in a year instead of one process handling all 12 months by itself. The improvement in performance can be quite high.
 
并行(parallelism)指将一个工作分解,多个进程同时执行查询工作的一部分,而非一个进程完成全部工作。例如,12 个进程分别处理 1 年 12 个月的数据,而非 1 个进程独自处理 12 个月的数据。并行带来的性能提升十分显著。
 
109 Parallel execution helps systems scale in performance by making optimal use of hardware resources. If your system's CPUs and disk controllers are already heavily loaded, you need to alleviate the system's load or increase these hardware resources before using parallel execution to improve performance.
 
并行执行能够优化硬件资源的使用,从而使系统性能具备可伸缩性。如果系统的 CPU 或磁盘控制器(disk controller)上的负载已经很重,用户必须首先减轻负载或增减硬件资源,才能使用并行执行来提升系统性能。
 
110 Some tasks are not well-suited for parallel execution. For example, many OLTP operations are relatively fast, completing in mere seconds or fractions of seconds, and the overhead of utilizing parallel execution would be large, relative to the overall execution time.
 
有些工作并不适合并行执行。例如,有些 OLTP 操作执行速度较快,能购在数秒甚至不到一秒的时间内完成,此时采用并行执行的额外开销与总执行时间相比就比较大。
 
111
See Also:

Oracle Database Data Warehousing Guide for specific information on tuning your parameter files and database to take full advantage of parallel execution
另见:

Oracle Database Data Warehousing Guide 了解如何调整参数文件及数据库,以便发挥并行执行的全部优势
112

How Parallel Execution Works

16.5.1 并行执行是如何工作的

113 When parallel execution is not used, a single server process performs all necessary processing for the sequential execution of a SQL statement. For example, to perform a full table scan (such as SELECT * FROM emp), one process performs the entire operation, as illustrated in Figure 16-5.
 
在未使用并行执行(parallel execution)时,只有一个服务进程(server process)完成了一个 SQL 语句的顺序执行所需的全部处理工作。例如,进行全表扫描(SQL 语句可能为 SELECT * FROM emp)时,一个进程完成了所有的工作,如 图 16-5 所示。
 
114 Figure 16-5 Serial Full Table Scan
 
图 16-5 串行全表扫描
 
115


 


 

116 Figure 16-6 illustrates several parallel execution servers performing a scan of the table emp. The table is divided dynamically (dynamic partitioning) into load units called granules and each granule is read by a single parallel execution server. The granules are generated by the coordinator. Each granule is a range of physical blocks of the table emp. The mapping of granules to execution servers is not static, but is determined at execution time. When an execution server finishes reading the rows of the table emp corresponding to a granule, it gets another granule from the coordinator if there are any granules remaining. This continues until all granules are exhausted, in other words, until the entire table emp has been read. The parallel execution servers send results back to the parallel execution coordinator, which assembles the pieces into the desired full table scan.
 
图 16-6 显示了多个并行执行服务器(parallel execution server)对表 emp 进行扫描。表被动态地划分(即动态分区(dynamic partitioning))为称作颗粒(granule)的负载单位(load unit),每个颗粒由一个并行执行服务器负责读取。颗粒是由并行执行协调器(coordinator)划分的。每个颗粒都是 emp 表的一组物理数据块。颗粒与并行执行服务器之间的对应关系并不是静态的,而是在执行时决定的。当一个并行执行服务器完成了对 emp 表某一颗粒的读取后,能够从并行执行协调器获得剩余的颗粒继续工作。这个过程将持续直至所有颗粒全部完成,即整个 emp 表被读取完毕。每个并行执行服务器都将结果返回给并行执行协调器,后者将所有结果汇总为全表扫描的结果。
 
117 Figure 16-6 Parallel Full Table Scan
 
图 16-6 并行全表扫描
 
118


 


 

119 Given a query plan for a SQL query, the parallel execution coordinator breaks down each operator in a SQL query into parallel pieces, runs them in the right order as specified in the query, and then integrates the partial results produced by the parallel execution servers executing the operators. The number of parallel execution servers assigned to a single operation is the degree of parallelism (DOP) for an operation. Multiple operations within the same SQL statement all have the same degree of parallelism.
 
通过观察并行执行的 SQL 查询的执行计划(query plan),可以发现并行查询协调器将 SQL 查询的每个操作都进行了分割,并按查询中指定的顺序执行,最后将所有并行执行服务器的执行结果合并。为一个操作所分配的并行执行服务器的数量被称为此操作的并行度(degree of parallelism,DOP)。同一 SQL 语句内的所有操作的并行度均相同。
 
120
See Also:

Oracle Database Data Warehousing Guide for information on granules as well as how Oracle divides work and handles DOP in multiuser environments
另见:

Oracle Database Data Warehousing Guide 了解关于颗粒的信息,以及 Oracle 如何分割工作,如何在多用户环境中处理 DOP
121

Overview of Analytic SQL

16.6 分析型 SQL 概述

122 Oracle has introduced many SQL operations for performing analytic operations in the database. These operations include ranking, moving averages, cumulative sums, ratio-to-reports, and period-over-period comparisons. Although some of these calculations were previously possible using SQL, this syntax offers much better performance.
 
Oracle 提供了多种 SQL 功能,用于在数据库中进行分析操作。这些功能包括排名(ranking),移动平均值(moving average),累积求和(cumulative sum),百分比(ratio-to-report),以及同期比较(period-over-period comparison)等。尽管有些计算以前的 SQL 也能实现,但新语法能够实现更好的性能。
 
123 This section discusses: 本节将讨论以下内容:
124

SQL for Aggregation

16.6.1 用于聚合运算的 SQL

125 Aggregation is a fundamental part of data warehousing. To improve aggregation performance in your warehouse, Oracle provides extensions to the GROUP BY clause to make querying and reporting easier and faster. Some of these extensions enable you to:
  • Aggregate at increasing levels of aggregation, from the most detailed up to a grand total
  • Calculate all possible combinations of aggregations with a single statement
  • Generate the information needed in cross-tabulation reports with a single query
聚合运算(aggregation)是数据仓库系统的基本功能。为了提升数据仓库中聚合运算的性能,Oracle 对 GROUP BY 字句进行了扩展,使查询及报表生成更简单更迅速。这些扩展使用户可以进行以下操作:
  • 从最明细的数据开始逐层汇总,即可以在低层汇总的基础上再次进行汇总
  • 在一个 SQL 语句中计算所有可能组合的汇总
  • 使用一个 SQL 语句获得旋转表(cross-tabulation reports)所需的数据
126 These extension let you specify exactly the groupings of interest in the GROUP BY clause. This allows efficient analysis across multiple dimensions without performing a CUBE operation. Computing a full cube creates a heavy processing load, so replacing cubes with grouping sets can significantly increase performance. CUBE, ROLLUP, and grouping sets produce a single result set that is equivalent to a UNION ALL of differently grouped rows.
 
这些扩展使用户可以利用 GROUP BY 子句获得其感兴趣的汇总结果。用户无需使用 CUBE 操作就能够高效地进行多维分析(analysis across multiple dimensions)。完全计算一个立方体(cube)将带来很重的系统负载,因此以聚合运算的结果集来代替立方体能够显著地提升系统性能。使用 CUBEROLLUP,及聚合运算都能得到一个单一的结果集,其效果与使用 UNION ALL 语句将多个聚合运算得到的数据行进行合并后相同。
 
127 To enhance performance, these extensions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance, and scalability.
 
为了进一步提升系统性能,上述扩展操作可以被并行执行:即多个进程同时执行语句。这使聚合运算更为高效,也提高了数据库的性能及可伸缩性(scalability)。
 
128 One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.
 
多维分析(multidimensional analysis)是决策支持系统(decision support system)中的一个关键概念:其含义是以各种可能的维度(dimension)的组合来观察一个企业。此处使用维度这个术语来代指分析中所使用的不同角度。常用的维度有时间(time),地域(geography),产品(product),部门(department),及渠道(distribution channel)等,但用户可以根据企业自身行为特点总结出各种维度。与一组维度值(imension value)相关的事件(event)或实体(entity)通常被称为事实(fact)。常见的事实有销售数量及金额,利润,客户数量,产量等,任何值得观察的信息都可以作为事实。
 
129 Here are some examples of multidimensional requests:
  • Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000.
  • Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals.
  • List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commissions.
以下是一些多维查询的例子:
  • 按产品维度及各层次地域维度(从州到国家再到地区)对 1999 年及 2000 年的销售进行聚合运算。
  • 创建旋转表,以南美洲各个国家为角度分析 1999 年及 2000 年企业运行的成本。包括所有子汇总(subtotal)。
  • 列出 2000 年创造汽车产品销售利润前 10 名的销售代表,并按佣金排序。
130 All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.
 
上述查询都涉及多个维度。大多数多维查询都会按时间,地域等维度汇总数据,并对汇总结果进行比较。
 
131
See Also:

Oracle Database Data Warehousing Guide
另见:

Oracle Database Data Warehousing Guide
132

SQL for Analysis

16.6.2 用于分析运算的 SQL

133 Oracle has advanced SQL analytical processing capabilities using a family of analytic SQL functions. These analytic functions enable you to calculate:
  • Rankings and percentiles
  • Moving window calculations
  • Lag/lead analysis
  • First/last analysis
  • Linear regression statistics
Oracle 内含有一套分析型 SQL 函数(analytic SQL function),提供了高级 SQL 分析处理