Overview of the Entity Relationship Approach
Background and Origins
Almost all data modeling as practiced today uses some form or variation of the entity-relationship analysis. Almost all CASE tools are either based on Entity-Relationship models, or support the creation of Entity-Relationship models. The various forms of Information Engineering all use some form of the Entity-Relationship data model for their data side activities. The Entity-Relationship (ER) Approach consists of both an analytical and design method and a modeling technique.
The ER Approach was first described by Dr. Peter Chen in the March 1976 issue of the ACM publication Transactions On Database Systems (Volume 1, Number 1). Since that time it has evolved into one of the most important tools in the Data Analysis and systems design tool kit. Today there are many practitioners of Entity-relationship analysis and modeling, and many books and articles on design methodology and technique routinely include at least some mention of it in their work.
Prior to the introduction and use of this approach, most data analysis activities focused either on determining data requirements as a by-product of analyzing the processes to be performed, or on data elements presumed to be needed by the user. Some concentrated on trying to fit lists of data elements into one of the data structure models which can be implemented by a DBMS, others on designing from reports, screens and files, still others on following trails of transactions through its various processing stages. From these processes, flows, data elements and/or outputs they attempt to recreate the real world. Many attempt to recreate the processes from the desired results.
Today many designers and data analysts attempt to combine both old and new by developing record mode entity-relationship models and then switching to older analysis methods to populate those models with data elements derived from lists of elements gleaned from their analysis documentation and from user requirements interviews. This approach not only results in ineffective data design choices and decisions, but fails to use the full power of the entity-relationship approach.
What contributes to the notable lack of success in following this path is the failure to realize that concentration record structures, and following formula rules for the placement of data elements in records ignores the inherent meaning of data as a means of describing things and of recording information about things. Without a proper understanding of those things themselves, neither the system designer nor the data analyst cannot make reasoned decisions as to what data is needed about those things.
In the business environment, examining only transactions, or processes, or outputs, or data flows, or even a combination of all four, produces a picture, which is correct as far as it goes, but which does not reflect a true or complete picture of the environment. Business Environments are populated by people, using things, and both people and things are located in places. Any business description must not only include these people, places, and things, but it must also start with them. These people, places and things are called entities.
These people, either individually or in groups, work with things, or provide services, for other people. Since both the people and the things are real (they physically exist) they can be described, and they must be located somewhere (in some place). Additionally, relationships which exist between people and things, people and places, things and places, and between different types of things, different types of places, and different types of people themselves must be described.
Processes are the actions which people entities (or their mechanical or electromechanical surrogates) perform with other people entities and with other thing entities, in place entities. Transactions and reports are mechanisms for recording those processes, or for communicating between entities, and data flows are the paths that these transactions, reports etc., take between and among the people entities and the location entities where such records are stored.
These entities, may be well defined, in that the firm may know a great number of things about them, or vaguely defined, in that the firm may know very little about them. In some cases, such as with either prospective customers or employees, the firm may only know or suspect that they exist, but not who they are, or where they are.
These entities may exist in large homogeneous groups where all members are capable of being described in the same manner, or they may be fragmented into many different subtypes, each with descriptions which are either slightly different, or in some cases radically different, from the other members of the same group.
The relationships that exist between these entities are real. And as with the entities these relationships themselves may be well defined, in that the firm may know a great number of things about them, or vaguely defined, in that the firm may know very little about them, again as little as knowledge or suspicion of their existence.
The power of the ER Approach lies in its ability to focus on describing these entities of the real world of the business, and the relationships between them. By describing these real world entities through the identification and definition and assignment of attributes to them, and their relationships, the designer is describing how the business should operate.
Although the business itself may change, sometimes dramatically, these types of changes occur much less frequently than changes in the routine processes and activities. Regardless of the business changes, the entities of the business rarely change. What may change however, is the firm's perception of which attributes of those entities are currently of interest. Some relationships between these business entities may also change but even these relationship changes occur infrequently. Thus by understanding and properly describing these entities and the relationships between them, the designer can form a very stable foundation for understanding and redesigning the business itself, and for properly recording the results of, or changes caused by, the processes of the business.
Constraints on the ER Approach
As with any analytical method, the effectiveness of the ER Approach is limited, or constrained, by three factors, all of which have to do with the designer's understanding of the business environment. These closely related factors are:
Entity identification and definition consists of recognizing the various entities, determining why they are of interest to the firm, and naming them. The identification and definition process must specify the entity at the exact level of precision which ensures that it is not so general as to be meaningless, and yet no so specific that it fragments into too many subsets. For example:
Entity description consists of identifying which attributes of the identified entities are needed by the firm, and why those attributes are of interest. For example, is the firm interested in the attribute "hobbies" or "clothing sizes" for the employees? If the firm is a sporting goods firm, the answer to the former might be yes. If on the other hand the firm provides uniforms for its employees, the answer to the latter might be yes.
Business context involves identifying and defining the relationships which exist between the identified and defined entities, and their relative importance to the firm as a whole, and to each specific part of the firm. Business context also involves identifying and defining the use or role of each of the entities within the firm. An entity's appearance, role or use in one firm may be entirely different in another firm, and yet the entity itself is the same.
Just as an entity may have different roles or uses between firms, so also, each part of the firm may have a different perspective on the business, and each part of the firm may have a different perspective on the entities of the firm. This perspective does not change the fact of the entity's existence, only the attributes and relationships of those entities which are of interest to that portion of the firm and their role or use in that firm.
The specific description of these entities and their relationships with other entities within the firm is relevant only within the context of that firm and is totally dependent upon the attributes of the entities which are of interest to the firm. An entity within one firm may be only an attribute of an entity within another firm, and vice versa.
Entities, Relationships and attributes
The importance of identification and definition, description and context can be seen when one looks at the formal definitions of the three key elements (figure 8-1) which form the heart of the ER Approach. These definitions form the basis for both the data analysis method and the data modeling technique of the Entity Relationship Approach.
An attribute must be capable of being defined in terms of words or numbers. That is, the attribute must have one or more data elements associated with it. An attribute may be the name of the entity or relationship. It may describe what the entity looks like, where it is located, how old it is, how much it weighs, etc. An attribute may describe why a relationship exists, how long it has existed, how long it will exist, or under what conditions it exists.
It is important to note at this point that relationships exist only between entities, not between attributes of entities.
To illustrate:the entity "person" could be anyone
when the attributes name, age and sex are added, we can identify men from women, adults from children, and one person from another.
When the relationships married to, parent of, child of, member of, and works for are added, we know whether we are talking about a group of unrelated people, a family, or a corporation.
To describe the entity, we must describe it in terms of its attributes and its relationships with other entities. An entity description consists of a series of statements which complete a phrase such as "the entity is...", "the entity has...", "the entity contains...", or "the entity does...." Each attribute relates to the entity in hierarchic terms, that is, all attributes of the entity are fully dependent upon the entity itself because individually and together they are the entity.
The question can still be asked however, "How can we begin to identify these entities?" Is, for example, the entity identified as Customer (representing all customers), or is it the specific types of customer (such as mail order, or retail) or is it a single customer? The answer is that it can be all of these, none of these, or more than these.
The specific identification and definition of the entity has meaning only within the context of that firm. However, most businesses can be described using a fairly restricted set of generic entity types such as Customer, Product, Machine, Employee, Location, Organizational Unit, etc.
An entity (figure 8-2) is whatever the business defines it to be, and that definition must make sense within the context of the firm. Thus, an entity in one firm may be a subset of entities included in the entity definition of another firm, or may be the global definition of the entity used within another firm.
These differences in identification and definition can be illustrated by the following example:
A town planning board, with responsibility for community planning and zoning, would describe that community in terms of each of its buildings, and further subdefine those buildings into residential, office, stores, warehouses, and factories.
The board might be interested in which people or which firms occupies or owned those buildings, but for their purposes that information would be an attribute of the building, just as the size of the building, the number of floors, the number of windows and doors, and the cost of the building were attributes.
On the other hand, the local Chamber of Commerce doing a census. or community directory, would be interested in the people and the firms who lived, worked or were located in the community. In that case they would be interested in the names of the people, their incomes, length of residence, amount of taxes paid, and where within the community they lived or were located (the buildings). Here the buildings become attributes of the people.
Neither the buildings nor the people have changed. Both still exist, physically unchanged. The perspective, however, has changed and the things which interest about those buildings and people have changed.
The perspective of the town council would need to know all the information about both the people and the buildings, along with information about roads, utilities, etc. In this case, both the buildings and the people become entities in their own right, along with the relationships between them (who lives or works where, how owns what, etc.).
This need for both attributes and relationships is consistent with the accepted dictionary definition (figure 8-3) of an entity which defines it as "the fact of existence; being. The existence of something considered apart from its properties." Thus although the entity exists, its true form and role is only apparent after its attributes are added.
Figure 8-3 What is an Entity
Without attributes all we know about the entity (figure 8-4) is that it exists. The distinction between the entity and its attributes, and the relationship between the entity and its attributes, is so important, that the ER diagram distinguishes between the entity and its attributes by using different symbols for each.
Figure 8-4 Definition of an Entity
The attributes of an entity could be contained in a single record, or it may take a large collection of records where each contains the data elements of a single attribute. Either way, an entity should not be equated to a record, a logical data record or a table. Records, and logical data records are the means for storing related items of data in the data processing environment.
ER Models versus Data Structure Models
Records hold individual data elements or groups of data elements. Logical data records are representations as to how selected groups of record types are logically and physically connected. Table rows may also be viewed as records from a data processing viewpoint.
Logical data records, also called logical data models, fall into three main formats: hierarchic, network and relational. These traditional data models, represent implementations, specifically Data Base Management System (DBMS) implementations, of logical data records.
Each one models a different view of the structure of data and in that light they are more properly data structure models. The data structure models are creatures of data processing. The real world when it considers data, does not look at data structures, but rather it looks at things (usually paper things) which contain data.
The Entity relationship model represents a conceptual view of the world, which incorporates all three data models, is independent of any DBMS or data processing considerations, and is a representation of the business environment. The ER model contains the major aspects of all three structural models.
Although we speak of entities as if they were singular, an entity is in reality that set of persons, places or things all of which have a common name, a common definition and a common set of descriptors (properties or attributes). This conforms to the relational model and is equivalent to placing all attributes in the third form of the normalization process (3NF).
The entity representation in the model, while it may represent a single instance, usually represents numerous people, places or things all of whom have a common name and common descriptors and thus can be treated as a set, again in conformance with the relational model.
These entities interact (relate) with other entities. These interactions form a complex set of names, discrete, relationships as in the network model.
An entity, although it exists physically, only has physical substance when it is described in terms of what it looks like, where it is, what it does, and how it relates to other entities. Each component of that description, is a property or an attribute of the entity. The sum of the properties is the entity. This association of attributes to entities, if diagrammed hierarchically would appear as a flat (two level) structure, with the root being no more than an anchor which names and types, or subtypes the entity. In a network diagram it would appear as a key-only owner record, with multiple set relationships. In relational form, the entity would be the name of the primary relation, and the attributes might be subsumed in that relation or might be separate secondary relations.
Since the entity relationship model incorporates all three data structure models, it can view data in a more complete and realistic manner. The ER model, although it can be translated, quickly and easily into any or all of the data structure models, is not a data structure model, but instead seeks to identify and describe things and how they relate, rather than just data (used to describe those things) and how it can be stored.
The meld of the three data structure models within the ER model reflects that fact that each of these models reflects a portion of the way in which those real world things actually occur.
These entities are physically real and their real properties can be described, these people perform actions, using and transforming both things and information (which is contained on things as data). The common characteristic between all entities is that they can all be described, and the medium we use for that description are words and numbers. These words and numbers, collectively are data, and individually are data elements.
The fact that entities, especially in the data processing environment are described by data, does not make them data objects, nor is every collection of data elements an entity.
Some writers have suggested that data entities are built from collections of data elements in the same manner that a car is built from a collection of parts. In fact, an ER model can be complete and meaningful with no traditional data elements at all. The parts of a car were specifically chosen because each contributes something to the overall design of the vehicle. Any number of different sets of parts could be assembled and would result in a car, but a specific can can only be built from a specific set of parts.
A Car is a thing, it is a subtype of the larger group of things called vehicles, and part of another subtype called self-powered vehicles for transporting people and things. Just as there are may different types of vehicles not all of which are cars (some may be boats, planes, or trains), so to there are many different types of entities.
A final type of attribute needs to be discussed, and they are, attributes which do not describe the thing itself, but what it does, or how it is used, or why it is used. These things that an entity does are called activities, and collectively they are called processes. The attributes which describe these are called processing related attributes.
The processes, or activities, of the business are in reality the actions that people take with respect to things or places, or other people. These actions usually result in some change in the physical appearance, state or condition of one or more other entities, or sometimes in the creation of a new entity itself.
We can use the entity called Car as an example:
The physical characteristics of the car, its size, weight, year, make and model, color and parts list, represent the car itself. What ever happens to it, so long as it remains a car, these characteristics (except for possibly color) will never change. Whether or not it is owned by anyone, whether or not it is new or used, in good repair or falling apart, driven one mile or 100,000 miles does not change these facts of it's existence.
However, the fact that that car exists is meaningless unless we put it in context which tells us why the firm is interested in it.
If we were a new and used car dealer, or a company fleet manager, we might want to know other things about it, such as ownership, use and usage, options and accessories, etc. We might also want to know how many miles it was driven, how much gas it uses, how many times it was maintained, what was done to it at each maintenance, how many times it was in an accident, how many different people have driven it, what it cost new, what it cost now, how much it costs to maintain, etc. These latter attributes are really process attributes. They are part of of the description of the car, in some cases even part of the physical description, but these attributes tell us about what was done to or with the car, not about the car as a thing.
If we were an auto parts dealer, we might be interested in the parts of the car themselves, both new and used, in which case both the year, make, model and color of the car become attributes of the part, along with its usage characteristics (if it is not a new part), its cost, size, shape, weight, how many are used in a specific year, make and model, etc.
A specific part, could be an elemental part such as a bolt, tire rim, windshield, etc., or it could be a complex subassembly such as a transmission, radio, motor, etc. It could fit one year, make, and model of car, or any car. By combining several of these parts into a subassembly we have in effect created a new "part".
All entities and most relationships have these types of process attributes associated with them. In many cases, these process attributes tell us why the firm is interested in this entity. In other cases, they help to define how the firm distinguishes between different classes of the entity, or why the firm is interested in some instances of the entity and not in others. These attributes also assist in determining when the firm first becomes interested in an entity, or ceases to be interested in the entity.
These process attributes are variable in that their values change frequently, and these changes usually involve the participation of some other entity. Thus, since they relate what one entity did to another, or where or how many of one entity are contained in another entity, they are normally descriptive of the relationship between the two, rather than descriptive of one entity or the other, although obviously they could be.
The processes of identification and definition, description and contextual placement of the entities are vital to any understanding of the business, and to any effort directed at either application development or file design. Processes like data normalization (a much discussed concept) can not be meaningful unless we know what those entities are, what the difference is between an entity and an attribute of an entity, and further what relationships exist between those entities.
Data Analysis, Data Modeling and Classification
Written by Martin E. Modell
Copyright © 2007 Martin E. Modell
All rights reserved. Printed in the United States of America. Except as permitted under United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the author.