Data Mining (5 page)

Read Data Mining Online

Authors: Mehmed Kantardzic

BOOK: Data Mining
3.56Mb size Format: txt, pdf, ePub

At this early time in the evolution of data warehouses, it is not surprising to find many projects floundering because of the basic misunderstanding of what a data warehouse is. What
does
surprise is the size and scale of these projects. Many companies err by not defining exactly what a data warehouse is, the business problems it will solve, and the uses to which it will be put. Two aspects of a data warehouse are most important for a better understanding of its design process: the first is the specific types (classification) of data stored in a data warehouse, and the second is the set of transformations used to prepare the data in the final form such that it is useful for decision making. A data warehouse includes the following categories of data, where the classification is accommodated to the time-dependent data sources:

1.
old detail data

2.
current (new) detail data

3.
lightly summarized data

4.
highly summarized data

5.
meta-data (the data directory or guide).

To prepare these five types of elementary or derived data in a data warehouse, the fundamental types of data transformation are standardized. There are four main types of transformations, and each has its own characteristics:

1.
Simple Transformations.
These transformations are the building blocks of all other more complex transformations. This category includes manipulation of data that are focused on one field at a time, without taking into account their values in related fields. Examples include changing the data type of a field or replacing an encoded field value with a decoded value.

2.
Cleansing and Scrubbing.
These transformations ensure consistent formatting and usage of a field, or of related groups of fields. This can include a proper formatting of address information, for example. This class of transformations also includes checks for valid values in a particular field, usually checking the range or choosing from an enumerated list.

3.
Integration.
This is a process of taking operational data from one or more sources and mapping them, field by field, onto a new data structure in the data warehouse. The common identifier problem is one of the most difficult integration issues in building a data warehouse. Essentially, this situation occurs when there are multiple system sources for the same entities, and there is no clear way to identify those entities as the same. This is a challenging problem, and in many cases it cannot be solved in an automated fashion. It frequently requires sophisticated algorithms to pair up probable matches. Another complex data-integration scenario occurs when there are multiple sources for the same data element. In reality, it is common that some of these values are contradictory, and resolving a conflict is not a straightforward process. Just as difficult as having conflicting values is having no value for a data element in a warehouse. All these problems and corresponding automatic or semiautomatic solutions are always domain-dependent.

4.
Aggregation and Summarization.
These are methods of condensing instances of data found in the operational environment into fewer instances in the warehouse environment. Although the terms aggregation and summarization are often used interchangeably in the literature, we believe that they do have slightly different meanings in the data-warehouse context. Summarization is a simple addition of values along one or more data dimensions, for example, adding up daily sales to produce monthly sales. Aggregation refers to the addition of different business elements into a common total; it is highly domain dependent. For example, aggregation is adding daily product sales and monthly consulting sales to get the combined, monthly total.

These transformations are the main reason why we prefer a warehouse as a source of data for a data-mining process. If the data warehouse is available, the preprocessing phase in data mining is significantly reduced, sometimes even eliminated. Do not forget that this preparation of data is the most time-consuming phase. Although the implementation of a data warehouse is a complex task, described in many texts in great detail, in this text we are giving only the basic characteristics. A three-stage data-warehousing development process is summarized through the following basic steps:

1.
Modeling.
In simple terms, to take the time to understand business processes, the information requirements of these processes, and the decisions that are currently made within processes.

2.
Building.
To establish requirements for tools that suit the types of decision support necessary for the targeted business process; to create a data model that helps further define information requirements; to decompose problems into data specifications and the actual data store, which will, in its final form, represent either a data mart or a more comprehensive data warehouse.

3.
Deploying.
To implement, relatively early in the overall process, the nature of the data to be warehoused and the various business intelligence tools to be employed; to begin by training users. The deploy stage explicitly contains a time during which users explore both the repository (to understand data that are and should be available) and early versions of the actual data warehouse. This can lead to an evolution of the data warehouse, which involves adding more data, extending historical periods, or returning to the build stage to expand the scope of the data warehouse through a data model.

Data mining represents one of the major applications for data warehousing, since the sole function of a data warehouse is to provide information to end users for decision support. Unlike other query tools and application systems, the data-mining process provides an end user with the capacity to extract hidden, nontrivial information. Such information, although more difficult to extract, can provide bigger business and scientific advantages and yield higher returns on “data-warehousing and data-mining” investments.

How is data mining different from other typical applications of a data warehouse, such as structured query languages (SQL) and online analytical processing tools (OLAP), which are also applied to data warehouses? SQL is a standard relational database language that is good for queries that impose some kind of constraints on data in the database in order to extract an answer. In contrast, data-mining methods are good for queries that are exploratory in nature, trying to extract hidden, not so obvious information. SQL is useful when we know exactly what we are looking for, and we can describe it formally. We will use data-mining methods when we know only vaguely what we are looking for. Therefore these two classes of data-warehousing applications are complementary.

OLAP tools and methods have become very popular in recent years as they let users analyze data in a warehouse by providing multiple views of the data, supported by advanced graphical representations. In these views, different dimensions of data correspond to different business characteristics. OLAP tools make it very easy to look at dimensional data from any angle or to slice-and-dice it. OLAP is part of the spectrum of decision support tools. Traditional query and report tools describe
what
is in a database. OLAP goes further; it is used to answer
why
certain things are true. The user forms a hypothesis about a relationship and verifies it with a series of queries against the data. For example, an analyst might want to determine the factors that lead to loan defaults. He or she might initially hypothesize that people with low incomes are bad credit risks and analyze the database with OLAP to verify (or disprove) this assumption. In other words, the OLAP analyst generates a series of hypothetical patterns and relationships and uses queries against the database to verify them or disprove them. OLAP analysis is essentially a deductive process.

Although OLAP tools, like data-mining tools, provide answers that are derived from data, the similarity between them ends here. The derivation of answers from data in OLAP is analogous to calculations in a spreadsheet; because they use simple and given-in-advance calculations, OLAP tools do not learn from data, nor do they create new knowledge. They are usually special-purpose visualization tools that can help end users draw their own conclusions and decisions, based on graphically condensed data. OLAP tools are very useful for the data-mining process; they can be a part of it, but they are not a substitute.

1.6 BUSINESS ASPECTS OF DATA MINING: WHY A DATA-MINING PROJECT FAILS

Data mining in various forms is becoming a major component of business operations. Almost every business process today involves some form of data mining. Customer Relationship Management, Supply Chain Optimization, Demand Forecasting, Assortment Optimization, Business Intelligence, and Knowledge Management are just some examples of business functions that have been impacted by data mining techniques. Even though data mining has been successful in becoming a major component of various business and scientific processes as well as in transferring innovations from academic research into the business world, the gap between the problems that the data mining research community works on and real-world problems is still significant. Most business people (marketing managers, sales representatives, quality assurance managers, security officers, and so forth) who work in industry are only interested in data mining insofar as it helps them do their job better. They are uninterested in technical details and do not want to be concerned with integration issues; a successful data mining application has to be integrated seamlessly into an application. Bringing an algorithm that is successful in the laboratory to an effective data-mining application with real-world data in industry or scientific community can be a very long process. Issues like cost effectiveness, manageability, maintainability, software integration, ergonomics, and business process reengineering come into play as significant components of a potential data-mining success.

Data mining in a business environment can be defined as the effort to generate actionable models through automated analysis of a company’s data. In order to be useful, data mining must have a financial justification. It must contribute to the central goals of the company by, for example, reducing costs, increasing profits, improving customer satisfaction, or improving the quality of service. The key is to find actionable information, or information that can be utilized in a concrete way to improve the profitability of a company. For example, credit-card marketing promotions typically generate a response rate of about 1%. The praxis shows that this rate is improved significantly through data-mining analyses. In the telecommunications industry, a big problem is the concept of churn, when customers switch carriers. When dropped calls, mobility patterns, and a variety of demographic data are recorded, and data-mining techniques are applied, churn is reduced by an estimated 61%.

Data mining does not replace skilled business analysts or scientists but rather gives them powerful new tools and the support of an interdisciplinary team to improve the job they are doing. Today, companies collect huge amounts of data about their customers, partners, products, and employees as well as their operational and financial systems. They hire professionals (either locally or outsourced) to create data-mining models that analyze collected data to help business analysts create reports and identify trends so that they can optimize their channel operations, improve service quality, and track customer profiles, ultimately reducing costs and increasing revenue. Still, there is a semantic gap between the data miner who talks about regressions, accuracy, and ROC curves versus business analysts who talk about customer retention strategies, addressable markets, profitable advertising, and so on. Therefore, in all phases of a data-mining process, a core requirement is understanding, coordination, and successful cooperation between all team members. The best results in data mining are achieved when data-mining experts combine experience with organizational domain experts. While neither group needs to be fully proficient in the other’s field, it is certainly beneficial to have a basic background across areas of focus.

Introducing a data-mining application into an organization is essentially not very different from any other software application project, and the following conditions have to be satisfied:

  • There must be a well-defined problem.
  • The data must be available.
  • The data must be relevant, adequate, and clean.
  • The problem should not be solvable by means of ordinary query or OLAP tools only.
  • The results must be actionable.

A number of data mining projects have failed in the past years because one or more of these criteria were not met.

The initial phase of a data-mining process is essential from a business perspective. It focuses on understanding the project objectives and business requirements, and then converting this knowledge into a data-mining problem definition and a preliminary plan designed to achieve the objectives. The first objective of the data miner is to understand thoroughly, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The data miner’s goal is to uncover important factors at the beginning that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions. Data-mining projects do not fail because of poor or inaccurate tools or models. The most common pitfalls in data mining involve a lack of training, overlooking the importance of a thorough pre-project assessment, not employing the guidance of a data-mining expert, and not developing a strategic project definition adapted to what is essentially a discovery process. A lack of competent assessment, environmental preparation, and resulting strategy is precisely why the vast majority of data-mining projects fail.

Other books

Copycat Mystery by Gertrude Chandler Warner
Stepping by Nancy Thayer
Strings Attached by Nick Nolan
Linger by Maggie Stiefvater, Maggie Stiefvater
e.Vampire.com by Scarlet Black
Give Me Love by McCarthy, Kate
Starfist: Hangfire by David Sherman; Dan Cragg