Skip to content

Latest commit

 

History

History
288 lines (213 loc) · 16.4 KB

DATA-CENTRIC.md

File metadata and controls

288 lines (213 loc) · 16.4 KB

Data-centric AI: The Underdog Game Changer in AI's Evolution

https://www.kaggle.com/code/muhammadirfanakbar/data-centric-ai-the-underdog-game-changer-in-ai by Muhammad Irfan Akbar

1. Introduction

Data-Centric AI: A Game Changer in Artificial Intelligence

Introduction:

  • Exponential growth of AI driven by data and computational power
  • Different methodologies for optimal AI performance
  • Emphasis on data-centric approach in this report [1]

Background:

  • Traditional AI development: model-centric approach
  • Data scarcity and impractical computational power situations
  • Importance of a data-centric strategy in such cases [1]

Key Concept:

  • Data quality, cleanliness, relevance: crucial for superior performance [1]

Significance:

  • Effective model design through Exploratory Data Analysis (EDA) [1]
  • Useful outputs from well-crafted inputs in language models [1]

Why Data-Centric AI?:

  • Quality data crucial for improving performance [1]
  • Less complex models can achieve superior results with high-quality data [1]

Tasks Involved:

  • Data preprocessing: cleaning, normalization, transformation [1]
  • Data labeling: annotation, tagging, classification [1]
  • Data augmentation: generating new data from existing [1]

Advancements:

  • Recent developments in data preprocessing techniques [1]
  • Data augmentation strategies for improving model performance [1]

Impact on Kaggle Community:

  • Data preprocessing and engineering competitions [1]
  • Increased focus on data quality for better results [1]

Conclusion:

  • Data-centric AI: crucial component in AI research and development [1]
  • Continuous advancements improving performance [1]

2. Why Data-centric AI?

A shift from algorithm-centric to data-centric AI is underway. The recognition that data is key to successful AI models is driving this change. Let's explore why data-centric AI is gaining attention and transforming the field.

2.1 Understanding the Problem: Kaggle's Insight

Kaggle and Data Science Community

  • Kaggle: widely recognized for machine learning competitions
  • Reflects changing trends in data science community utilizing AI

Hypothesis:

  • In Kaggle community, individuals attribute their success more to models used rather than the data they process

Testing Hypothesis:

  • Analyzed meta Kaggle dataset [2]
  • Contrast observed between topics discussed in forums and winners' write-ups (Fig.1)

Findings:

  • Stark contrast between word count of terminologies in winners' write-ups and forum discussions
  • Despite prevalence of data-related discussions in forums, success attributed more to models in winners' write-ups
  • Lower word count for data-related terms in winners' write-ups compared to forums, indicating that models often outshine data in attribution of success.

Fig.1. Comparison of Frequent Word used between Winner's Writeup and Forum Discussion

Implications:

  • Data-centric challenges often cause significant difficulties but spotlight frequently received by models
  • Simplicity can often outperform complexity in data processing, especially with proper handling and quality of data (Philosophy of data-centric AI).

2.2 The Art Science of Data Techniques

Importance of High-Quality Data for AI Models

  • A state-of-the-art model with poor quality data will yield subpar results
  • High-quality data is a non-negotiable ingredient in successful AI models
  • Securing high-quality data is challenging:
    • Each dataset brings its own challenges
    • Data preparation requires careful management
      • Collection and preprocessing
      • Feature engineering and data augmentation

Art vs. Science of Data Analysis

  • Data analysis involves both art and science
  • Qualitative data requires careful interpretation, like creating a narrative
  • Presenting data is an artistic endeavor to tell a clear story
  • Data analysis is often exploratory, probing and refining the data

Shift Towards Data-Centric AI

  • Data-centric AI is shifting the field towards more systematic and empirical approach
  • Traditional data augmentation was intuitive, now systematized with comprehensive papers
  • Other aspects of data preprocessing are also seeing scientific transformations

Fig.2. The Systematic Diagram of Data Augmentation

  • The diagram illustrates the systematic way of carrying out data augmentation, irrespective of data shape.

3. Tasks in Data-centric AI

The operational framework of data-centric AI is non-linear and consists of three interdependent tasks: Training Data Development, Inference Data Development, and Data Maintenance, which can be addressed concurrently.

3.1 Training Data Development

Training Data Development

Process:

  • Collection: gathering data from various sources
  • Labeling: assigning tags or categories to data points
  • Preparation: cleaning, transforming, and formatting data for machine learning
  • Reduction: reducing the size of data to improve performance and reduce computational requirements
  • Augmentation: generating new data by applying transformations, such as rotation, scaling, and flipping

Importance:

  • High-quality input data essential for accurate and reliable output from models
  • "Garbage in, garbage out" principle applies
  • Ensures machine learning models learn meaningful patterns.

Fig.3. (from [5]) illustrates the training data development process:

drawing

3.2 Inference Data Development

Data-Centric AI (DCAI)

  • Inference data: unseen data used to evaluate model performance
  • Model evaluation focuses on providing finer details beyond just metrics
  • Importance of effective models on unseen data

Inference Data Development:

  1. In-distribution Evaluation: assesses model's performance on similar training data distributions
  2. Out-distribution Evaluation: measures how well the model performs on divergent data from training distribution
  • Autonomous vehicles as real-world application: encountering unfamiliar routes = Out-distribution evaluation
  1. Prompt Engineering: devising effective inputs to obtain desired outputs from models

Out-distribution Evaluation:

  • Example in autonomous vehicles: navigating unfamiliar routes (Fig.4)
  • Model may not have encountered such data during training

Prompt Engineering:

  • Crucial aspect of data-centric AI
  • Art and science of crafting optimal prompts for desired outputs.

3.3 Data Maintenance

Real-world data evolves dynamically, not like Kaggle's static datasets. Data Maintenance ensures continuous processing and integration of this influx, keeping training data updated. This process is called Continual Learning, enabling AI systems to adapt to changing data and maintain accuracy. Together, these tasks form the foundation of a data-centric AI approach.

4. Recent Advancement in Data-centric AI

Data-centric AI has grown rapidly due to breakthroughs in various fields over the past two years. Some key developments highlight the shift towards a data-centric approach.

4.1 Prompt Engineering for Large Language Models (LLMs) and Text-to-Image Models-and-Text-to-Image-Models)

Prompt Engineering:

  • Gained significant traction in recent years
  • Importance of optimizing inputs to maximize model performance
  • Accessible LLMs like ChatGPT and text-to-image models (e.g., Dall-E, Stable Diffusion) require prompt engineering to enhance efficiency

Value of Prompt Engineering:

  • Refines inference data, as underlying model is final and unalterable
  • Shifts focus towards data-centric AI
  • Established specialized roles for "prompt engineering"

Examples:

  • ChatGPT:
    • Well-designed prompt: "Generate a code to show the trend of column A using a bar graph in matplotlib" (accurate code)
    • Poorly-constructed prompt: "show the trend of dataset A" (ambiguous and poor-performance code)
  • Dall-E:
    • Well-designed prompt: "An armchair in the shape of an avocado" (exact image generated)
    • Poorly-constructed prompt: "A unique chair" (images may not align with intended concept)

Impact on AI Community:

  • Critical to LLMs and text-to-image models
  • Represents a substantial shift towards data-centric AI problem-solving

4.2 "Segment Anything": A Leap in Segmentation and Data Engine

Data-Centric AI in Segment Anything Model (SAM)

Introduction:

  • Groundbreaking innovation in segmentation with Facebook AI's "Segment Anything" [12]
  • Focus on data rather than model complexity for impressive results [12]
  • Data engine and dataset components crucial to SAM's success [22]

Data Engine and Dataset Components: SA-1B Dataset: collection of 11 million images and 1.1 billion segmentation masks [22] Three stages for efficient data gathering:

  1. Assisted Manual: annotators work with SAM to capture masks within an image
  2. Semi-Automatic: annotators annotate masks where SAM is unable to generate confident predictions
  3. Full-Auto: SAM fully predicts masks, differentiating ambiguous masks through a comprehensive sweep [22] Data-centric approach: focuses on quality, diversity, and application-specific nature of data used rather than model sophistication [22]

Model:

  • Composed of three key components: Image Encoder, Prompt Encoder, Mask Decoder [23]
  • Image Encoder: pre-trained Vision Transformer (ViT) for high-resolution image inputs [23]
  • Prompt Encoder: manages sparse and dense prompts using positional encodings, learned embeddings, and convolutions [23]
  • Mask Decoder: maps image and prompt embeddings to a mask using a Transformer decoder block and dynamic mask prediction head [23]

Application Scenarios:

  • SAM accepts specific user prompts for precise image segmentation [23]
  • Accommodates three types of prompts: points, bounding boxes, masks [23]
  • Addresses aspects of image segmentation, localization, and classification in a unified manner [23]
  • Empowers features like assisted labeling, real-time augmented reality, and Bio-Medical Image Segmentation [[23]]

4.3 General Trend: Data-centric Paper Trend for the Past Two Years

Data-Centric AI: Increasing Research Interest

Observable Trends in Data-Centric AI Research:

  • Remarkable increase in related research papers over past few years as shown on arXiv and Kaggle [15]
  • Number of papers published with keywords related to data-centric field: [13]
    • Selected keywords extracted using LLM from research community
    • Sweeping overview of field's escalating popularity
  • For 2023: projected number based on monthly average paper count in 2023 multiplied by 12

Fig.8: Data-centric Papers Amount Extracted from Arxiv [15]:

  • Remarkable increase in number of papers related to data-centric AI between 2021 and 2023 (projected)
  • Surge in interest within the research community, shift from model-centric to data-centric focus
  • Continued advancements and innovations expected as more researchers explore the power of data.

5. The Impact of Data-centric AI towards Kaggle Community

Data-centric AI techniques revolutionize problem-solving by emphasizing data quality, curation, and intelligent processing, helping uncover complex patterns and derive insights.

5.1 Kaggle Competitions: Then and Now

Data-Centric AI Transformation in Kaggle's "Titanic - Machine Learning from Disaster" Competition

Past Approach:

  1. Exploratory Data Analysis (EDA):
    • Manual exploration of Titanic data patterns
    • Handling missing values: 'Age', 'Cabin', and 'Embarked' columns
    • Understanding relationships between variables like 'Sex', 'Pclass', and 'Survived'
  2. Data Preprocessing:
    • Treating missing values
    • Filling 'Age' column with median age
    • Handling outliers based on EDA findings
  3. Feature Engineering:
    • Manual creation of new features
    • Creating 'Family Size' feature based on 'SibSp' and 'Parch'
    • Generating an 'IsAlone' feature for passengers traveling alone
  4. Model Selection and Tuning:
    • Experimentation with machine learning models: Logistic Regression, Random Forests
    • Model choice not highly impactful due to dataset size and nature

Present Approach:

  1. ChatGPT-assisted EDA:
    • Using AI for comprehensive exploratory data analysis
    • AI suggests investigating survival rates for different passenger classes or age groups in Titanic dataset
  2. Data Preprocessing & Feature Engineering with ChatGPT:
    • Leveraging AI to suggest strategies and generate code for data preprocessing and feature engineering
    • Recommending imputation techniques for missing 'Age' values
    • Creating a feature to indicate if a passenger was alone or not
  3. Prompt Engineering with ChatGPT:
    • Shifting focus from model selection to effective data handling prompts
    • AI helps design prompts to guide the model on understanding variable influences: 'Pclass', 'Sex', and 'Age' on survival.

Key Insight: In small, artificial datasets like those in Getting Started Prediction Competitions, data quality and prompt crafting are significantly more crucial than complex model choice.

5.2 The Art of Winning Big Competitions: It's All About Data

Data-Centric AI in Kaggle Competitions

Winning Solutions:

  • Often rely on simple or state-of-the-art models easily accessible online
  • Differentiating factor lies in the art of data processing

Top 10 Frequently Employed Methods:

  • Depicted in Figure 10
  • Widely known model techniques (e.g., ensembles, CNNs) are straightforward to implement
  • Data-related techniques (e.g., feature engineering, outlier detection, missing data handling) are less well-understood and more complex

Importance of Data Processing:

  • The heart of winning solutions often lies in the sophisticated data processing techniques
  • Methods like advanced feature engineering, outlier detection, missing data handling, and exploratory data analysis are crucial

Data-Centric AI Approach:

  • Shifting focus from model complexity to data quality
  • Leading to innovations in handling and processing data
  • Reshaping the approach of Kaggle competitors and the wider data science community

6. Conclusion

The growing prominence of data-centric AI marks a shift from model complexity to data quality. This shift emphasizes the importance of high-quality, well-prepared data over complex models, promising more nuanced understandings and innovation. A robust data-centric approach will be crucial for AI's future success, highlighting data as a powerful tool.