1                 Introduction

1.1        Purpose

The purpose of this document is to define the Information Quality Metrics for the Information Quality Management Framework. This is a sub component of the overall Information Quality Management Framework being implemented through the  Performance Management Program .

This document provides the detail of each of the specific components that need to be defined developed and implemented to support the initial adoption of better Information Quality Management in and that are essential to the Program’s Project Delivery Stage. In most cases specific documentation will be developed for each component.

Articulation of the overall vision, strategy, program scope and critical success factors as well as the overall program governance is covered in the Program Management Plan.

1.2        Background

An organisation’s approach to Information Management is critical to its success. Information Management provides the framework that the business needs in order to maintain, verify, and take control of the flow of data from source to destination, and to plan it’s archival or deprecation.

Implementing Information Quality Management provides a measurable and confidence in reporting and information analysis as the final output; the Information in the report can be trusted and tracked. If necessary, a data lineage can be reported on to show the information’s source and any transformations that have occurred during its journey to the final user.

The Information Management Methodology describes the framework and approach to improving how information is managed, measured and improved upon across . The methodology consists of components such as principles, policies, practices, tools, processes and procedures, and governance, that work together to manage the information throughout its lifecycle, including maintenance, validation and archival or disposal.

The methodology also involves the application of management techniques to collect information, communicate it within and outside the organisation, and process it to enable managers to make quicker and better decisions.

The Information Management Methodology is the vehicle for addressing Information Quality issues and establishing the associated information governance structures and processes. Quality issues are measures against the metric described in this document.

1.3        Objectives

The Information Quality Management Framework document will provide the subcomponent of the Information Quality Management Framework. It will describe the definitions of metrics. This document is a self contained subcomponent of the overall Quality framework.

2                 Information Quality Metrics

Data Governance Metrics provide the information quality objectives that the organisation plans to achieve. Just like other business performance measures, metrics should be managed and tracked at the executive level. Metrics are created by either executives or data stewards with input from data analysts. Each KDE is measured against the defined metric category through the appropriate measurement technique.

There are different types of Qualitative Metric categories that can be measured in varying fashions: Accuracy, Integrity, Consistency, Completeness, Validity, Accessibility, and Timeliness. Data Governance Metrics

This task defines each of these categories and the measurement techniques and processes that will be used for measuring each KDE. Sometimes, less tangible metrics are also assessed. A rating scale should be defined for metrics at an aggregate level that is supported by the detail of the assessment.

Output / Artefacts:

·        Definition of Metric Categories and Measurement Techniques

·        Current-State Metrics on KDEs

·        Target Metrics on KDEs

Typically, organisations do not only measure along quantitative dimensions, but also include softer/intangible dimensions to justify their investment and measure success. Defining data quality KPIs requires interplay of organisational support, governance and accountability, processes, policies and standards, as well as an overall support (either automatically or by analysis from members of the data governance team) by a set of tools. The following diagram outlines an approach to measuring data quality:

2.1        Detailed Information Quality Metrics

How are Data Governance Metrics Measured?

Metric Category

Description

How is Metric Measured?

Accuracy

Does the data accurately represent reality or a verifiable source?

Audit

Integrity

Do broken links exist between data that should be related?

Profiling / Business Rules

Consistency

Is there a single representation of data?

Profiling / Business Rules

Completeness

Is any key information missing?

Profiling / Business Rules

Validity

Is the data stored in acceptable format and contain valid values?

Profiling / Business Rules

Accessibility

Is the data easily accessible, understandable, and used consistently?

Survey

Timeliness

Is information recorded and made available to systems as rapidly as is required?

Survey 

2.1.1    Accuracy/Correctness

Description:

·        The degree of agreement between a data value (or set of values) and a source assumed to be correct. The source may be a reference set obtained by comparison to “real world” data, or by reference to a data set on another system or file that is deemed “correct”.

Typical issues uncovered during assessment:

·        It is expected that some names and addresses will be spelt incorrectly, or not an accurate representation of the real party, or address. Other free form text, including customer description information may not be correctly representing the party.

·        Effectiveness of profiling to assess this area:

·        Accuracy is very difficult to assess without verifying customer details directly with the customer, or using a “correct” 3rd party data source. Where a reference data set is available, e.g. a list of postcodes or addresses, this may be assessed; otherwise the source system data is assumed to be accurate.

2.1.2    Completeness

Description:

·        The degree that the full values are present in the attributes that require them, and the degree that the attributes cover the user data requirements.

Typical issues uncovered during assessment:

·        Free text fields not completed (E.g. Name, address, phone numbers, types, descriptions, and null values). Target mandatory fields not populated, or present in the source system extract file. The source systems may not have all the attributes required to load the target systems.

·        Effectiveness of profiling to assess this area:

·        The completeness of the field values is tested as part of the specific TBA tests. The completeness of the attributes to satisfy the user requirements are assessed in the source to target mapping documents that highlight where a target attribute cannot be populated because a source field not available

2.1.3    Timeliness / Currency

Description:

·        Currency measures how up-to-date the data is, and whether the data required can be provided by the required time.

Typical issues uncovered during assessment:

·        The source extracts may not be provided by the required time. Product and type codes that change over time may cause products, or types assigned before a particular date for one product, or type, and those assigned after the cut-over date are a slightly different product with the same code or now have a different type.

·        Effectiveness of profiling to assess this area:

·        The source system product codes are expected to be unchanged over the product life. If these assumptions do not hold the change in product code is handled during the transformation of source data in the target data acquisition.

·        Changes to customer type, product types, and other codes will be treated the same way as product codes. Delays in providing the source system extracts by the specified time are identified as part of the general project reporting.

2.1.4    Consistency / Uniqueness (No Duplicates, Integrity)

Description:

·        Consistency is the extent that there is a single representation of data. Consistency also includes the extent that data is duplicated within a system, e.g. customer duplicates.

·        The ability to establish the uniqueness of a data record (and data key values).

Typical issues uncovered during assessment:

·        Duplicate customer records

·        Effectiveness of profiling to assess this area

·        Determining whether there are customer duplicates in large volume data sets require specialist tools.

·        De-duplication also requires a dedicated data quality improvement project that includes the appropriate governance, and sponsorship as well as appropriate tools to establish and implement the de-duplication.

2.1.5    Validity

Description:

·        The data is stored in an acceptable format, and is within a reasonable range of possible values.

·        Target enforced formats, lengths, and data types not implemented in the source systems. (E.g. Date YYYY-MM-DD is stored as a character in most host systems and if these fields contain other characters they cannot be transformed into the target date format)

Typical issues uncovered during assessment:

·        Target enforced formats, lengths, and data types not implemented in the source systems. (E.g. Date YYYY-MM-DD is stored as a character in most host systems and if these fields contain other characters they cannot be transformed into the target date format).

·        Target codes, and types may not map 1-to-1 to the source system codes, and types. Values for some attributes are not within an acceptable range.

·        Effectiveness of profiling to assess this area

·        It is generally not practical to check all the values in every attribute loaded are the correct type, length, and format. The most important data types to assess are date, and numeric fields.

·        Check, and convert where possible source system date to the YYY-MM-DD standard target format.

·        Check the format of identification numbers.

·        Range of important balance, rate fields. This will only be an atomic value checks, it will not net accounts, and check the range.

·        Compound attribute tests: Check the value of one attribute that is dependant on another attribute.

2.1.6    Accessibility

Description:

·        Ability for users to extract the existing data they require. Users must not have different interpretations of the same data. Ability for users to extract the existing data they require. Users must not have different interpretations of the same data.

Typical issues uncovered during assessment:

·        The meaning of each file attribute, and the interrelationship between attributes, and files is difficult to determine. This requires specific product system, and business knowledge.

·        This may cause errors in the mapping of source fields, to target fields.

·        The layout, and extract process documentation may not match the actual source system files delivered.

·        Effectiveness of profiling to assess this area:

·        This is a manual check of source system data, and metadata availability.

·        If the meta-data does not match the actual file an issue is raised.

The relevant subject matter experts will be consulted to confirm, or explain the definition of ambiguous or conflicting attributes.

3                 Data Profiling

Possible outcomes for information quality enhancement:

1.        Do nothing (measure the Quality problem

2.        Correct the information in transit (ETL data correction)

3.        Root Cause Analysis of Data Governance Issues

Preventing Data quality issues involves analysing those process activities or application automation that prevents Data quality issues from occurring in the first place. Root Cause Analysis of Data Governance Issues is concerned with correcting root cause issues as opposed to addressing the symptoms.

Outputs:

·        Data Governance Issues related to Source System Edits, Business Process, and Technology Architecture.

·        Recommended Changes for Improved Data Governance

3.1        Objectives of Data Profiling

·        Identify data quality issues - measurements are taken against a number of dimensions, to help identify issues at the individual attribute level, at the level table and between tables.

·        Capture metadata as a by-product of the process – Metadata is the key to the success of subsequent phases of the data load including the ongoing use and maintenance of the data. Useful metadata is identified as part of the data profiling exercise and the tool-based approach will ensure it is captured in a reusable format.

·        Identify business rules – The next step is to perform the data mapping. Data profiling will assist in gaining an understanding of the data held in the system and in identifying business rules for handling the data. This will feed into the data mapping exercise.

·        To assess the ‘fitness for purpose’ of the source system data to satisfy the business requirements. This assessment is often done in the context of a Data Warehouse, CRM, or ERP system initiative. Therefore, the focus is on gaining a very detailed understanding of the source data that will feed these target systems, to ensure that the quality level is sufficient to meet the requirements of the target system.

This section is meant to provide guidelines as to the tests that are commonly used and the approach taken to measure data quality. Each project should consider these to be a starting point only and should ensure that the specific tests that will uncover greatest value for a client are uncovered as part of the discovery process for this client.

Key Output / Artefacts:

·        Data Quality Report (per Source Systems)

·        Definition of Mapping Rules and Business Rules

3.2        Different Types of Profiling

Data sources are profiled in multiple dimensions: down columns (Column Profiling); across rows (Table Profiling); and across tables (Multi-Table Profiling). The output from Data Profiling can then be used in a number of fashions, including data cleansing and mapping to a target environment.

 

3.2.1    Column Profiling

Column profiling typically examines the data found in a single column/field in either a table or a flat file. This analysis can either (1) examine the actual values found in the column (e.g. produce a complete frequency report listing each and every value found), (2) look for the various “data patterns” for the values in the column or (3) discover the underlying “parsed patterns” that might exist within more complex, free-form columns. Assessments are typically conducted to verify the validity/uniqueness, accuracy and completeness; column profiling can help answer the following questions:

·        What do the fields mean?

·        Which ones are of interest to the project?

·        What is the quality of data in each field?

·        And, given this, is the data of sufficient quality to load to into the target information environment?

Analysis on complex fields such as names, addresses, and contact information can be done to determine the patterns and consistency of the data. This provides information on the consistency of the values in a standard format and if the format is one that has been used already across similar fields on other tables within the database. From a vendor perspective, separate tools may be used to profile simple fields (ones that do not require parsing) as opposed to complex fields.

3.2.2    Table Profiling

Table Analysis is used to determine the key information and also relationship and dependency patterns between the data in each field within the table. It is done across every field within each table in the source system. Table Analysis is important as there are some situations where it is absolutely essential to examine values in two or more columns. A blank value in Column A might be acceptable if there is a valid value in Column B, but unacceptable if Column B is also blank. Similarly, a valid value in Column A (e.g., product code “HL”) and a valid value in Column B (e.g., sub-product code “521”) might be an invalid code combination. In such situations you must look at value combinations across columns rather than just examine values within a single column. A tools-based approach is undoubtedly the best choice for examining these cross-column or “table based” dependencies.

3.2.3    Multi-Table Profiling

Multi-Table Analysis looks for relationships among columns existing in different tables, rather than just within a single table. Multi-Table profiling shows the relationships between tables using key fields as links to bridge the multiple tables. The aim is to analyse and determine the existence of referential integrity issues between each table analysed, e.g. orphans and clients without address and contact information. If the profiling analysis reveals two columns in different tables share the same set of values and one is not a foreign key and the other a primary key, than redundant data exists between the two tables. For example, an employee table and an employee demographics table might both contain a “home phone number” column and the same values within these columns. Thus, an employee’s home phone number might be stored redundantly in two separate tables.

3.3        The Iterative Profiling Process

It is important to note that there is an iterative approach to profiling within each of the analysis steps below. Each step involves:

·        Running the analysis within the appropriate Vendor Tool

·        Analysing the results of each analysis

·        Verify the results with the Source System SME

·        Documenting the results (both in deliverables and within the profiling tools)

·        Plan further analysis based on results

The data investigation process should verify what the source system owners say about the data against the actual data; it should also verify what the data says against what the source system owners say.

This process is iterative – for example, a source system owner may say that all customer names must have a full first name and full surname. However, when this rule is checked against the data, this shows that 10% of the records have only a first initial. In this case, this must be discussed with the source system owner. This type of anomaly may be explained by a business rule that was applied to new data that was not applied to historical data. Further analysis is performed in this case to verify that all anomalous records were created before the expected data. Data Re-Engineering also follows and iterative process for standardize, correct, match and enrich data.

3.4        Typical Investigations Performed

3.4.1    Column Profiling (Simple Fields)

Column Profiling of Simple Fields is most typically done to check for Completeness, Uniqueness and Validity. An example would be the Account Number field within the Client_Accounts table. The Account Number column is analysed for:

 

Column Profiling (Simple Fields)

Field

Test

Description

Account Number

Completeness

All records contain a non-zero value

 

Uniqueness

All records contain a different account number

 

Validity

Ensure all account numbers are numeric or follow the source systems account number structure

3.4.2    Column Profiling (Complex Fields)

Column Profiling of Complex Fields is most typically conducted to verify the validity/uniqueness, format accuracy and completeness. An example would be looking at Customer Records:

 

Column Profiling (Complex Fields)

Field

Test

Description

Multiple Fields

Completeness

As per Client Name. All the name fields should contain a title, firstname, and lastname. Not all person names contain middle names and suffixes such as Jnr etc.

 

Uniqueness

Presence of Title at start of name

Has a first name, and Has a last name

 

Validity

No characters, i.e.: #,@% present in all the name fields.

3.4.3    Table Profiling

Table Profiling is most typically conducted to verify the validity/uniqueness, format accuracy and completeness. An example would be looking at Customer Records:

 

Table Profiling

Field

Test

Description

Customer ID

Completeness

Is a unique value populated within this field to uniquely identify each customer record? If so is there a high population ratio on this field, i.e.: 100% populated.

 

Uniqueness

Presence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?

 

Validity

Are these key fields populated with valid value, i.e.: unique values?

 

Format

Are these values in a consistent format, i.e.: numeric or does it contain leading alphas etc.

3.4.4    Multi-Table Profiling

Multi-Table Profiling is conducted to verify the validity/uniqueness and accuracy of the key fields to ensure referential integrity. An example would be to assess the Customer Table vs. a Customer Loan:

 

Customer Table

Field

Test

Description

Customer ID

Completeness

Is a unique value populated within this field to uniquely identify each customer record? If so is there a high population ratio on this field, i.e.: 100% populated.

 

Uniqueness

Presence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?

 

Validity

Are these key fields populated with valid value, i.e.: unique values?

 

Format

Are these values in a consistent format, i.e.: numeric or does it contain leading alphas etc.

 Multi-Table Profiling

Customer Table:

Field

Test

Description

Customer ID

Completeness

Is a unique value populated within this field to uniquely identify each customer record within the Customer table? If so is there a high population ratio on this field, i.e.: 100% populated.

 

Uniqueness

Presence of a Primary Key or Foreign Key in relational databases. Are these keys unique to each record?

 

Validity

Are these key fields populated with valid value, i.e.: unique values?

 

Format

Are these values in a consistent format, i.e.: numeric or does it contain leading alphas etc.