Basic Concepts of KDD & DM

  1. Data Mining & KDD

    • Defination of KDD

      • Data -> Pattern -> Knowledge
    • Defination of Data Mining

      • Data -> Pattern
  2. Process of KDD (Goal -> Collection -> Transformation -> Mining -> Evaluation)
    Process of KDD
    Process of KDD
  • Step1: Goal setting

    • What kind of knowledge you would like to learn?
  • Step2: Data collection

    • Transform the data into certain format that is convenient for an algorithm to work on.
  • Step3: Data transformation (preperation, clean)

    • Not all the data is useful, filter away those that are not needed
  • Step4: Data mining

    • Apply an algorithm on the data to find some patterns
  • Step5: Interpretation & Evaluation

    • Find actionable patterns and translate into knowledge
  1. Goal of KDD

    1. Find valid, novel, useful, understandable patterns
    2. Find knowledge through patterns
  2. Characteristic of KDD

    • Iterative

      • The result obtained from one run through an iteration of the KDD process may not get what you want. Very often, some steps of the process need to be refined, and then the whole process be repeated
    • Interactive

      • A certain amount of human involvement is needed to monitor and to fine tune the steps
  3. Traditional database system

    • A database system that supports simple aggregation functions, for example, should not be considered as providing data mining functionality

      • Data Mining vs. Traditional Database Systems
      • KDD for Knowledge Discovery
  4. Prediction & Description of knowledge

    • The knowledge resulting from data mining should carry certain degree of predictive ability or descriptive (explanatory) ability (or both)

      • Prediction involves using the database records that describe information about past behavior to automatically generate a model (or rule) that can predict future behavior
      • Description involves deriving patterns that summarize the underlying relationships in data and to describe the characteristics of data
  5. DSS & Data Warehouse

    • DSS & Data Warehouse

      • A decision-support system (DSS) is a system that assists decision makers to make important decisions for an organization or business
      • To be effective, a data mining application must have access to organization-wide data. We can integrate departmental databases together into a Data Warehouse
  6. Data Warehousing

    • A data warehouse is an integration of various departmental databases so that access to organization-wide data is possible
    • A data warehouse is a convenient place where KDD and data mining applications are performed
    • A data warehouse can also be used to support other DSS tools. For example, On-Line Analytical Processing (OLAP)
  7. OLAP

    • The OLAP approach allows users to view data in a multi-dimensional model (a Data Cube), supporting fast aggregation and summarization operations
      OLAP Cube
      OLAP Cube
      Aggregating
      Aggregating
  8. Data Mining Tasks type

    • Classification
    • Clustering
    • Association Analysis
    • Regression Analysis
  9. Classification

    • Given a collection of records (training set)
    • Each record contains a set of attributes, one of the attributes is the class (or the label, dependent variable).
    • Find a model that describes the class attribute as a function of the values of other attributes (independent variables).
    • Goal: previously unseen records should be assigned a class as accurately as possible.
      Classification
      Classification
  10. Regression

    • Predict the value of a numerical variable based on the values of other variables.
    • Extensively studied in statistics.
    • Examples:

      • Predicting sales amounts of a new product based on advertising expenditure.
      • Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
        Linear Regression
        Linear Regression
  11. Clustering Definition

    • Defination

      • Given a set of data objects, each having a set of attributes, and a similarity measure among them, find clusters such that

        • Objects in one cluster are more similar to one another.
        • Objects in separate clusters are less similar to one another.
      • Typically, cluster analysis requires a user to define a similarity measure between records. Clustering is then performed based on the principle of maximizing the intra-cluster similarity and minimizing the inter-cluster similarity (distance-based clustering).
    • Differences between Clustering and Classification:

      1. Objective:

        • Clustering: The goal is to group data objects into clusters where objects within the same cluster are more similar to each other, and those in different clusters are less similar. Clustering aims to discover the natural structure and patterns in data without prior knowledge of the labels.
        • Classification: The objective is to assign a known category label to a given data sample. Classification is a supervised learning task, requiring a pre-existing set of labeled training data for the model to learn the features of each class and predict labels for new data.
      2. Supervised vs. Unsupervised:

        • Clustering: It is an unsupervised learning method, as the true labels of data are not known initially, and the organization of data is done based on similarity measures.
        • Classification: This is a supervised learning method, as the model training involves using a labeled dataset to learn the characteristics of different classes.
      3. Labels:

        • Clustering: Typically, specific labels are not of primary interest; the focus is on understanding similarities and structures among data points.
        • Classification: The main concern is accurate prediction and assignment of labels, aiming to correctly classify new data into known categories.
      4. Applications:

        • Clustering: Suitable for discovering hidden structures, grouping users, products, or entities, and can be applied in tasks such as image segmentation and text clustering.
        • Classification: Applied for predicting the category of unknown samples, such as spam filtering, handwriting recognition, and other tasks where known labels are essential.
          In summary, clustering and classification are distinct yet complementary approaches in data analysis. The choice between them depends on the specific task and nature of the data being analyzed.
  12. Illustrating Clustering
    Illustrating Clustering
    Illustrating Clustering
  13. Association Rule Discovery

    • Given a set of records each of which contains some items from a given collection;
    • Produce dependency rules which predict occurrence of an item based on occurrences of other items.
      Association Rule Discovery
      Association Rule Discovery
  14. Sequence Analysis

    • A sequence database contains sequences of events. Sequence analysis is about finding interesting, frequently occurring (sub)sequences to predict future behavior.
    • Example: renting movies, buying habits, web serving behavior, web log analysis.
  15. Challenges of Data Mining

    • Technical challenges

      • Scalability
      • Dimensionality
      • data stream
    • Data challenges

      • complex and heterogeneous data
      • data quality
    • Legal challenges

      • data ownership and distribution
      • privacy protection and regulations (e.g., GDPR)
      • Algorithmic biases
    • Results

      • interpretation of patterns
      • Explanability
  16. Privacy

    • The “fair information practice”
      “The primary purpose of the collection must be clearly understood by the consumer and identified at the time of the collection. Data mining, however, is a secondary, future use. As such it requires the explicit consent of the data subject or consumer.”

Citation:

  1. HKU - COMP7103 - Materials