CHAPTER 1-DATA MODELING CONCEPTS
The Role of Data Modeling
Data modeling tasks provide the most benefit when performed early in the development lifecycle. The model provides information critical to understanding the scope of a project for iterative development phases. Beginning the implementation phase without a clear understanding of the data requirements might cause your project to incur costly overruns or end up on the scrap heap.
An Introduction to Project Development
Many publications discuss project development, and this text does not cover this subject in detail. I included this section to assist modelers in understanding the role of data modeling in project development and to provide an understanding of when modeling should occur.
Most companies follow a methodology that outlines the development lifecycle selected to guide the development process. To some degree, most adhere to the same order of high-level concepts:
1. Problem definition
2. Requirements analysis
3. Conceptual design
4. Detail design
5. Implementation
6. Testing
This development method is generally referred to as the waterfall method. As you can see in Figure 1.1, each phase is completed before moving to the next, creating a “waterfall” effect.
Figure 1.1
The waterfall method of project development. Note that the results of each phase cascade into the next.
Many projects are developed using iterations or phases. An iterative development approach decreases risk by breaking the project into discrete manageable phases. Each phase includes analysis, detail design, implementation, and testing. Subsequent phases build upon and leverage the functionality of the preceding phase. However, within each phase, the waterfall method applies.
As with most engineering projects, you create a data model by following a set of steps.
1. Problem and scope definition
2. Requirements gathering
3. Analysis
4. Logical data model creation
5. Physical data model creation
6. Database creation
Figure 1.2 illustrates how each step provides input for the next.
Figure 1.2
Logical data model creation can occur prior to selecting a database platform (Oracle, DB2, Sybase, and so on). ERwin can provide support for specific physical properties if the physical data model is produced after the database platform is selected.
Problem and Scope Definition
Begin logical data modeling by defining the problem. This step is sometimes rederred to as writing a mission or scoping statement. The problem definition can be a simple pragraph or it can be a complex document that outlines a series of business objectives. The problem definition defines the scope, or boundary, of the data model, much the way a survey defines property boundaries.
Gathering Information Requirements
Most industry experts agree that the most critical task in a development project is an accurate and complete definition of the requirements. In fact, an incomplete or inaccurate understanding of requirements can cause expensive re-work and significant delay.
Gathering information requirements is the act of discovering and documenting the information necessary to identify and define the entities, attributes, and business rules for the logical model. There are two well-recognized methods for gathering requirements: facilitated sessions and interviews. Most development methodologies recommend facilitated sessions. The sections that follow provide high-level guidelines for gathering information requirements using facilitated sessions. A later exercise demonstrates how to use the information gathered to create a data model using ERwin.
Analysis
You must analyze and research the data requirements and business rules to produce a complete logical model. Analysis tasks should provide accurate and complete definitions for all entities, attributes, and relationships. Metadata, data about the data, is collected and documented during the analysis phase.
The analysis can be performed by the modeler or by a business analyst. Either the model or the business analyst works with users to document how users intend to use the data. These tasks drive out the corporate business objects needed to support the information requirements. Corporate business objects are also called code, reference, or classification data structures. This is also the opportunity to document code values that will be used. You should carefully document any derived data, data that is created by manipulating or combining one or more other data elements, and data elements used in the derivation.
Logical Data Model
A logical data model is a visual representation of data structures, data attributes, and business rules. The logical model represents data in a way that can be easily understood by business users. The logical model design should be independent of platform or implementation language requirements or how the data will be used.
The modeler uses the data requirements and the results of analysis to produce the logical data model. The modeler also resolves the logical model to third normal form and validates against the enterprise data model, if available. Later sections provide a description of a complete logical model, resolving a logical model to third normal form, an overview of an enterprise model, and provide some tips on validating a logical model against an enterprise model.
After you compare the logical model and enterprise data model and make any necessary changes, it is important to review the model for accuracy and completeness. The best practice includes a peer review as well as a review with the business partners and development team.
Entities
Entities represent the things about which the enterprise is interested in keeping data. An entity can be a tangible object such as a person or a book, but it can also be conceptual such as a cost center or business unit. Entities are nouns and are expressed in singular form, CUSTOMER as opposed to CUSTOMER, for clarity and consistency.
You should describe an entity using factual particulars that make it uniquely identifiable. Each instance of an entity must be separate and cleary identifiable from all other instances of that entity. For example, a data model to store information about customers must have a way of distinguishing one customer from another.
Figure 1.3 provides some examples of entities.
Figure 1.3
Here are examples of using ERwin to display entities in their simplest form.
Attributes
Attributes represent the data the enterprise is interested in keeping about objects. Attributes are nouns that describe the characteristics of entities.
Relationships
Relationships represent the associations between the objects about which the enterprise is interested in keeping data. A relationship is expressed as a verb or verb phrase that describes the association. Figure 1.5 provides some examples using ERwin’s Information Engineering (IE) notation to represent relationships.
Normalization
Normalization is the act of moving attributes to appropriate entities to satisfy the normal forms. Normalization is usually presented as a set of complex statements that make it seem a complicated concept. Actually, normalization is quite straightforward: “One fact in one place,” as stated by C.J.Date in his 1999 books An Introduction to Database Systems. Normalizing data means you design the data structures in such a way as to remove redundancy and limit unrelated structures.
Five normal forms are widely accepted in the industry. The forms are simply named first normal form, second normal form, third normal form, fourth normal form, and fifth normal form. In practice, many logical models are only resolved to third normal form.
Formal Definitions of Normal Forms
The following normal form definitions might seem intimidating; just consider them formulas for achieving normalization. Normal forms are based on relational algebra and should be interpreted as mathematical functions.
Business Normal Forms
In his 1992 book, Strategic Systems Development, Clive Finklestein takes a different approach to normalization. He defines business normal forms in terms of the resolution to those forms. Many modelers, myself include, find this business approach more intuitive and practical.
First business normal form (1BNF) removes repeating groups to another entity. This entity takes its name, and primary (compound) key attributes, from the original entity and forms the repeating group.
Second business normal form (2BNF) removes attributes that are partially dependent on the primary key to another entity. The primary (compound) key of this entity is the primary key of the entity in which it originally resided, together with all additional keys on which the attribute is wholly dependent.
Third business normal form (3BNF) removes attributes that are not dependent at all on the primary key to another entity where they are wholly dependent on the primary key of that entity.
Fourth business normal form (4BNF) removes attributes that are dependent on the values of the primary key or that are optional to a secondary entity where they wholly depend on the value of the primary key or where they must (it is mandatory) exist in that entity.
Fifth business normal form (BNF) exists as a structure entity if recursive or other associations exist between occurrences of secondary entities or if recursive associations exist between occurrences of their principal entity.
A Complete Logical Data Model
A complete logical model should be in third business normal form and include all entities, attributes, and relational ships required to support the data requirements and the business rules associated with the data.