Data Element Analysis

Contact Martin Modell Table of Contents

Element Analysis

Each of the preceding sections dealt with various aspects of company data model development. The common thread in those sections was an emphasis on attributes as our lowest level data aggregate. The final result however, is a data model as yet devoid of data element detail. It is just that elemental detail that is needed to complete the design.

The chapter on entity-relationship attribute modeling concluded with the statement that data elements are only assigned to attributes. This reinforced the clear distinction which the Entity Relationship approach makes between attributes and data elements. The development of an entity-relationship model through the attribution phase is a deductive process, applying rules of analysis and classification to develop the data groupings which must be in place to describe and record data about the members of each entity family.

Because a top-down decomposition approach is used, this process is relatively divorced from considerations of existing systems data file content. Data attribution is developed from analysis of data events, data entity life cycle activities and observation, experience and empirical knowledge. The data classification techniques develop successively smaller and finer groups of data which are developed using both usage and existence considerations. These attributes are also developed from an analysis of the various characteristics of the each distinct kind of member within the family, the family characteristics itself and the characteristics of each intermediate grouping between the family and lowest meaningful kind of member group within the family.

On a family by family basis, these characteristics and groupings also serve to identify and guide the identification of non-characteristic physical, logical and action based attributes which are common to every member of a family, to each level of grouping within the family and to each peculiar kind of member group within the family. Since the members of each family can be classified by multiple concurrent means (i.e. location, size, and any characteristic which all members share as a family), each classification hierarchy also serves to identify another set of attributes which serve to expand, qualify and describe the family as classified within that hierarchy.

Because the attributes are developed in a top down manner, and because each attribute or characteristic data grouping represents a distinct, concise and specific idea or concept, this process can be thought of as a before the fact (i.e. before elements are added) normalization of data as it is practiced by data base designers. This process however is based upon business usage and business views rather than processing efficiency and data file maintenance considerations.

Attribute Decomposition

Each attribute in its lowest decomposed form is a representation of a group of data which is either completely existence interdependent (i.e. all of the data comes into being at the same time, gets updated as a whole - and previous sets of values are always archived or discarded) or concept dependent (i.e. all of the data within the attribute may arise from different sources and may be updated randomly, but all of the data elements are interrelated and interdependent because they describe or pertain to a single idea or aspect of the entity).

The system design process is, as previously discussed, in reality always a redesign process, and as such must deal with and account for all current system items (including data elements), all item changes, and must predict to the extent possible the need for new items (including data elements).

The design process may, and usually does, result in these items (especially data) being rearranged, regrouped, relocated, changed or modified in some was (split, combined, recoded, etc.) and supplemented. New groupings of data elements (into attributes, versus old file records and forms) may and probably will result in new grouping identifiers and distinguishing characteristics, and may also result in date and time stamps being added to existence dependent data groups.

Because entity attributes in a data model represent logical data groupings, each attribute of each entity and each relationship must contain at least one data element otherwise it serves no purpose in the data model. Usually however an attribute represents some group of elements, which when viewed as a whole, describe that aspect, characteristic, action of an entity, or qualifies some relationship between data entities.

Each attribute is complete and self-contained in that all data elements within it relate to the same attribute, aspect or characteristic. They are complete in that the contents of each attribute should be all that is needed to describe that action, aspect, or characteristic. Since all attributes relate directly to the data entity, they are represented as being directly dependent to, are part of, or are contained within the entity. Attributes may be repetitive, in that there may be multiple possible occurrences.

Generalized versus Specific Attributes

An attribute may also have variable content, depending upon whether it is generalized to describe all possible members of the entity family, or specialized to describe some specific group of entity family members. They might also be variable in that there might be multiple possible variations of content of a single attribute. For instance, there may be shipping addresses defined for a customer, each of which is complete and each of which is active, depending upon when and what is being shipped to the customer. Again, there might be multiple sets or definitions of customer demographic data, depending upon whether that customer is a doctor, lawyer, teacher, school or company. Although this representation may be desirable for modeling purposes to show commonality of data, it is usually avoided by giving each specialized combination of elements for a common attribute a different name, in effect making it into a different attribute.

Because the entity is viewed as a family in some cases and as groups of specialized members in others, the names of the data element each attribute must be consistent within the entity family, across all common attributes used to describe the various members of the family and across the attributes which are necessary to describe each specific and specialized grouping of members within the family. Entity families are created based upon the role of their members within the organization or based upon a commonality of characteristics. All members of the entity family share the attributes and relationships common to every entity within the family. Specific family members may have specific attributes or relationships unique to them.

Conditional Attribute Existence

Each data entity and each attribute of each data entity is conditionally existent. These conditions being defined by the entity context, its relationship to its parent, siblings, children and cousins, and to other data entities within other entity families.

Each relationship is constructed such that it defines and supports a specific association or connection between the owner or source (the subject entity) and the target (the object entity). Subject and object are used in the traditional grammatical sense. Each attribute of a relationship, contains the data elements necessary to support, qualify and otherwise define that relationship and the conditions under which that relationship between those two entities is active and valid. The use of each relationship is dependent upon how, when and under what conditions an association or connection must be established between the two entities.

By far the most powerful and complex portion of the data model are the relationships, and thus they are the most difficult to define.

Unlike attributes, it is not enough to determine the name, definition and description and use of the elements of a particular attribute. The design team must also ascertain the valid format and in the case of representational elements (codes, etc.) the valid and permissible values. These values, while they can be generalized, can be as multitudinous as are the variations in the unique entities which they describe.

The Attribute Equation

A data element is the lowest unit of meaningful information within the data model however, few elements have meaning or usability by themselves. If a attribute is equated to a grouping of data elements then the data elements relate to that group much as arguments relate to an equation. Thus, the attribute is equal to the sum of its data element components. Each attribute can thus be expressed as an equation in the form:

Attribute-A = (data element(1), data element(2), ... data element(n)

Thus the question to be asked for each element are:

Can this element be derived from other data?
Is this data element part of the equation of the attribute within which it resides?

A datum is a fact. Data are multiple facts. The combination of facts yields information, so to the combination of data elements in an attribute yield information about that attribute.

Data Element Aggregation

Data elements are aggregated to groups below the attribute level as well. Thus, month, day of month, and year aggregate into date. Within this context, since month, day and year are all meaningful in their own right, or could be meaningful, we define each separately, and then define the aggregate, with its meaning, and as being composed of three lower level items. In the same manner phone number is an aggregate consisting of the separate elements area code, exchange, and extension; zip code is an aggregate consisting of sectional center facility (SCF) number, Post Office Number and Zip Code Extension (Plus Four).

As a general rule, if an item of data can be broken down into further meaningful parts, each lower level element should be defined separately and the aggregate element should be defined as well. Thus a data element may be both aggregate data and sub-elements.

Data in Atomic or Elemental Form

Another general rule states that where possible and feasible, the data model should represent elemental data as opposed to derived data or information. In other words, do specific data elements have to recorded or can the system "know" that something has occurred simply by the presence, or absence, of some other data element, combination of data elements, or attribute?

As an illustration, if element A is defined as the result of element B divided by element C, only element B and element C should be defined in the data model, since element A can easily be computed.

When in doubt it is generally safer, to record the variables or arguments in a computation, rather than only the result of that computation. Sometimes however, it is more efficient and necessary to record both the arguments and the result, although is is usually done for processing efficiency reasons.

The net amount of an invoice can be recomputed accurately each time by recomputing each line item and adding or subtracting additional charges or discounts, respectively but only if the item price does not change, or if the item price is recorded by line item. In any other cases the net invoice would change each time any item price changed and orders are usually price-fixed as of the time of taking the order.

Net account balance can easily be recomputed by adding or subtracting all account entries - debit or credit. Point in time net account balance can only be consistently computed by time and date stamping each entry.

The composite data attribute model of each entity family within the data model is constructed to contain the complete variety of individual family members. As the model was developed the accompanying narratives described each family member and each family group variation within the family. The data event models described the data that was to be recorded or retrieved from each attribute. The data event model access sequences to each entity and entity attribute described those data elements available in the trigger, or or those attributes and those elements which had to be retrieved and which could then be used to access additional entities and their attributes. Each of those expected data elements should have been identified in the accompanying narrative.

Domain versus Context Data Elements

Because data models represent data in the context of their use in describing entities, and because certain kinds of elements are frequently used through the corporate model, most designs distinguish between these two kinds of element usages in the way in which they are documented. This distinction between elements types also allows for the creation of a common and model-wide definition which can be used by inclusion whereever the same element is used, but with a different name. These two kinds of elements are called domain elements and context elements.

A domain element is defined as:

a data element in its atomic or lowest level form, which is used to describe the specific type of values to be contained in the element where ever it is used or referenced.

Some examples of a domain element would be:

Social Security Number
Telephone Number
Zip Code

A context element is defined as:

an element that describes a specific named usage of a domain element in a specific context.

The description of a context element must include a "See Reference" to the domain element of which it is a specific usage instance.

Some examples of a context element are:

Social Security Number of the Spouse of the Employee
Social Security Number of the Employee
Social Security Number of a child of the Employee.

Naming Data Elements

Each data element, both domain and context, should be assigned a unique name. Name assignment should be in a standard form. One of the most commonly used forms of naming elements employs a format known as the the "of-language". Names in this form identify the element, and its next higher level aggregate plus any identifiers or qualifiers which are needed to make that name unique. A typical name might be:

Date of birth of Employee
Number of line of a purchase order of a customer
First name of employee

The prime considerations when creating an element name, or element name standard format are that:

the name be fully qualified
the name be fully descriptive
the name be unique across the universe of names
the name be meaningful to the user
that where possible, the name be descriptive of its contents or the values contained within
The name should where possible be indicative of the format, usage or type of data contained within it (i.e. date, name, code, etc.)

Naming Considerations

The above considerations are by no means the only ones, others being:

how much context identification is to be included within the name
how names are to be grouped when sorted alphabetically within the dictionary (by entity, by type, etc.)
whether domain and context elements are to be used
dictionary physical limitations on name format and name length
dictionary flexibility
how much meaning the firm wants to incorporate in the name, etc.

Other considerations have to do with, programming languages used, Data Base Management systems (DBMS) used, and the personal preference and philosophy of the data administration staff.

These items are probably the most frequently used from the system design, and the most numerous. Many firms attempt to abbreviate them or otherwise shorten the names. While this may be necessary, it is also necessary to ensure that uses who need information about data availability can access the dictionary in their context, and find the information they need with a minimum of technical knowledge or dictionary training.

The identification and naming of data elements is almost purely an inductive and enumerative process, since each and every needed data element has to be identified, defined, described and placed. The omission of one element may in some cases severely impact the design. The misspecification of one element may have an equally severe impact. Although the design team has the inventory of current data elements to work with, the process of renaming each item to the new format, examining and in cases redefining (or defining for the first time) those elements, changing some of the elements, discarding some existing elements (no longer needed or replaced by new ones), and developing the names, definitions of new ones, is a time-consuming and tedious task, but as critical to the design as the development of user task procedures, and in since in many cases those procedures will reference the data elements individually, this task must be completed before procedure dosing can begin.

Data Element Issues

As elements are assigned to the attributes of each entity, that placement becomes a determinant of all context element names, and most of the domain element names. It may in some cases be a determinant of the definition and description of that element as well. The following issues should be examined for each element:

Is it a function of that attribute and of that entity?
Is its value dependent solely on that entity and that attribute?
Were that attribute represent as an equation would that data element be part of that equation?
Could that attribute be fully described or be meaningful without that data element?

Classification Model Impact

If for instance, the classification model identified twelve distinct customer types, with twelve distinct demographic constructs, each demographic construct or attribute can occur in the composite entity, although only one will appear on any given unique entity member. Each classification variable form of a common attribute should be identified in some manner by a classification characteristic, attribute, identifier or code.

All archival data or time dependent attributes should be dated, indicating both date effective and date of replacement or discontinuance.

Data Element Formats

Data element format or representation consistency should be ensured by type. Format issues include:

Should dates may be recorded in Julian format (yyddd) or in Gregorian (mmddyy) format, or both? If recorded in Gregorian format, should that format be standard MMDDYY format?

YYMMDD format?
YYYMMDD format?
MMDDYYY format?

(Note: Remember that as we approach the end of the century special consideration must be given to data representation as they span the century mark.)

For numeric data:

Should the values be recorded to a standard decimal place, i.e., .0, .00 or .000 etc., or should it vary?
Should each distinct type of numeric data as a standard format within that type, i.e. all percentages to three decimal places, or four?
Should all dollars and cents (monetary data) to dollars and cents or dollars only?

Identifier Attributes

Each entity family must have at least one common attribute whose unique values can serve as the identifier for all members within the family. Many families may have more than one of these common uniquely values attributes, and groups within the family may also have their own identifier attributes which identify members of the group uniquely. Although identifier attributes are usually more manageable when they consist of a single data element, there is no requirement that they be so. In some cases, it may be necessary and preferable to use multiple data elements as such an identifier.

As an example, each member of the employee family within the firm has a unique identifier called "Identifier of the Employee." Within the family of employees Salesmen are assigned a unique identifier "Identifier of Salesmen." This identifier is created from a combination of sales region, sales territory and sales office identifiers plus a sequential number within each office. This combination of data elements is unique across the firm.

Primary Identifiers

The design team must identify at least one primary identifier attribute for each family, and should also identify any potential secondary (which may potentially be nonunique, or not common to all members of the family) identifiers, they must also determine:

Does the proposed identifier attribute always occur, or only occur most of the time?
Can the potential identifier's values change over time or are they fixed?
Can the same identifier value be reassigned to other family members or is the identifier a single use value? For instance, the social security number of an employee is a single use number, but an internally assigned number may be reused some number of years after the employee leaves the firm or dies.
Can nonunique identifiers be qualified to entity member uniqueness?

Since each entity in the full data model represents a composite of all members of an entity family, the designers may want to consider extracting certain common data, which are not entity family candidates into a psuedo-entity. This pseudo-entity is essentially similar to the process control entity we discussed in the chapter on the classification model, but is used to contain the common rates and tables which help to define and control the operations of the business system and its environment.

This "rates and tables" entity family can be used to contain the expanded descriptions or values of common data, such as state code to state name translations, extensive code lists for editing, verification and presentation purposes, etc. This can be used for highly dynamic data which is frequently, and frequently changed and constantly referenced, such entities as interest rate tables, etc.

The use of this type of entity family will remove many conversions and tabular searches from the application code and individual processing procedures. These rates and tables become part of the common system data and ease maintenance. Each table entry must however be dated with the dates of effectivity and discontinuance.

The alternative to this type of entity is to include code value and expansion lists with each coded element. This approach is viable when the list of codes is relatively small, static, is an important part of the system documentation. Entity member attribute values are not ordinarily documented in the system design, since they are not considered part of the design.

Once all data elements within each attribute have been identified, the design team must define each data element. Each element, element aggregate and attribute should be fully defined and include a narrative or discussion of expected content, expected usage, and potential usage.

Data Element Definition

It is not enough to identify a data element, or to assign it a location or even a name, however meaningful. For all of the preceding analysis and work to be meaningful we must define and describe each data element, so that it can be used, processed and above all be supportive of the business. These definitions support the business requirement for understanding the meaning of data and support the firm's ability to use its data in a consistent and reliable manner.

If data is a corporate asset, that data must have value. The value of data lies in its timeliness, its accuracy, its relationship to other data, and most of all in the ability of the firm's management and operations personnel to understand the meaning, derivation and source of the data.

From a data processing perspective, which is to a large degree one of data manipulative and data presentation, its primary interest in a data element is its size and format, shape and value set. From a business perspective, much more is needed. The additional information needed includes:

What it is
Where it is
How it is used
Its format in reports and other presentation methods
When it is used
Who uses it
Where it comes from
Who can change it, when can it be changed and under what conditions can it be changed
Who can see it
When can it be removed from the files, who can remove it
Its legal value ranges.

Each data element has been assigned a name which represents its location and value type. These names were descriptive in and of themselves. Some data elements named were context specific, while others did not reflect a specific context. These non-context specific data elements were more generalized, both in name and content. The definition of a domain element (not context specific) describes its values and derivation, while its specific data model meaning is only available from the context element versions. For instance, date represents a specific value range and format. Date of birth of employee represents a particular date, that of a specific employee. Taken as a whole date of birth represents the value range of all dates for all employees. The definition of date is constant, only the meaning of the particular date varies. Date could be any value date in the calendar.

Context Element Definition

The context element definition describes what is unique about this data element occurrence and refers to the domain element for all definitional aspects of date. This approach allows the firm to make one change to the domain element which has the effect of automatically changing the definitions of all context elements. While consistency is desirable it is nevertheless permissible and sometimes desirable to treat each element independently, creating definitions which are appropriate rather than uniform. Thus the full range of data element definitions may become a combination of treatments and the choice is dictated by the elements rather than adherence to rules.

Each data element is described at its lowest level of occurrence as a self-contained unit. In addition, any grouping of elements which has meaning below the level of attribute should be defined as well and related upwards and downwards to its superior or subordinate parts.

To illustrate, the data element "Date of Birth of Employee" is composed of three sub-elements, "Month of Birth of Employee", "Day of Birth of Employee", and "Year of Birth of Employee". All four elements have meaning and all four should be defined.

What then are the components of an element definition? Another way to phrase the question and perhaps a more meaningful question is what does the managers and operational personnel of the firm need to know about an individual data element, and moreover, what should they know about that element?

Definition Components

This phase of data element analysis is the one which ties together all of the documentation narratives from all other design products. The definition of each data element consists of the following parts:

physical and logical format definition
narrative definition
cross reference information
value ranges and code lists
source and use information
context specific meaning
derivation or calculation
time reference or time frame information
individual element editing, verification and validation information
multi-data element or context editing, verification and validation
User alias and context alias names
attribute, entity group and entity family context
default values
data content, usage, source or other critical design assumptions with respect to this element

Each data element must be examined both from an overall design perspective and from a context specific perspective. Data elements must be examined for consistency of form and and definition. Derived elements must be consistent with the elements from which they were derived or which were used to derive them.

Code structures must be examined to determine if all possible values have been accounted for (both current and projected).

Each data element should have a narrative description that is as complete as possible. That description (or definition) should indicate exactly what the contents of the element represent. The description should reflect a single occurrence of the element if it is a repeating element. These narratives should reflect all of the assumptions, comments and specifications from all the narratives generated in all previous design products.

Derived Element Definition

For each calculated or derived data element the formula for calculation or derivation should be included in the data element documentation description. Each element that participates in the calculation, that are functions of the calculation should be listed. The definition of each of those data elements should in turn indicate the name of each calculated data element that references them.

Once the data element definitions and descriptions are complete, they must be examined within the context of each place this element is used, if it appears more than once. The design team must ensure that the the definition, exactly as written is applicable where this element is used.

Any inconsistencies is definition or usage should be resolved. If necessary, new elements should be created to ensure complete consistency of meaning and use.

Format definition describes the length, number of characters or digits, number of decimal digits, precision, format editing criteria (planation, dollar sign usage, use of commas and periods, use of spaces, justification, etc.) of the data element from the user's perspective.

The length of each element and its precision if applicable, should reflect its current and anticipated value ranges. The design team must determine:

the largest value the element can assume
the most number of characters in the longest name
the largest amount that can be anticipated
the most number of characters necessary

For identifiers:

what is the largest projected number of entity occurrences?

For attribute identifiers:

what is the largest number of occurrences of this subset we can have?

For coded elements:

what is the maximum anticipated range of each code

Element Sizing

As a general rule, if more than seventy-five percent of the capacity of an element with known data values has been used, it should be expanded, unless by its nature, it can be determined that the current capacity is sufficient.

For each element:

determine whether or not it must be present and if it is present what are its editing criteria?
What are the maximum and minimum values permissible?
What is the default value if no user supplied value is given?
What actions are to be taken if the element does not pass the editing tests?
If this element value is dependent upon or derived from the value of some other element, how are those dependencies? What are those derivations?
If this element is conditionally present, based either on the presence or absence of some other element or elements, what are those conditions and data elements? Conditional presence and conditional contents should be noted both in this element and in the definition of the other conditional elements.

Where an element is coded, all possible values or functions of that code should be listed in tabular form with the code value as the argument and the description or definition or meaning of that code as the function.

All of the data element definition and description documentation should be reviewed by all user areas of the organization. This will ensure that all users agree that the definitions supplied are correct, accurate, and complete. All code value lists and all derivation calculations or rules must be verified with the users. All documented assumptions should be reviewed and approved by the user community.

Any discrepancies, misrepresentations or incorrect definitions should be revised, corrected and submitted for reapproved. Each user area should "sign-off" on the documentation. Any later amendments should be reviewed by any users who have previously "signed-off".

Contact Martin Modell Table of Contents

Data Analysis, Data Modeling and Classification
Written by Martin E. Modell
Copyright © 2007 Martin E. Modell
All rights reserved. Printed in the United States of America. Except as permitted under United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the author.