The Conference for Machine Learning Innovation

Curating Quality Datasets for Machine Learning

Workshop
Join the ML Revolution!
Register until the conference starts:
✓ Special discount for freelancers
✓10 % Team Discount
Register Now
Join the ML Revolution!
Register until the conference starts:
✓ Special discount for freelancers
✓10 % Team Discount
Register Now
Join the ML Revolution!
Register until December 12:
✓ML Intro Day for free
✓Raspberry Pi or C64 Mini for free
✓Save up to $580
Register Now
Join the ML Revolution!
Register until December 12:
✓ML Intro Day for free
✓Raspberry Pi or C64 Mini for free
✓Save up to $580
Register Now
Join the ML Revolution!
Register until November 7th:
✓Save up to € 210
✓10% Team Discount
Register Now
Join the ML Revolution!
Register until November 7th:
✓Save up to € 210
✓10% Team Discount
Register Now
Infos
Booking note:
Data Quality Workshop

In the contemporary world of machine learning algorithms, data is the new oil. And for
state-of-the-art machine learning algorithms to work their magic, it’s important to have access
to relevant data. Though volumes of crude data are available on the web, we still need the
ability to identify and extract them into meaningful datasets.
This workshop will present the power of one of the most fundamental aspects of machine
learning – dataset curation, which often does not get is due but is highly relevant in machine
learning. You’ll learn why dataset curation is important in specific industry use cases, and also
learn, via hands-on Pythonic examples, how to construct good quality datasets.
The methods and tips shared in this workshop have come in handy for the instructors when
publishing high-grade research papers, at their current employment with Twitter, Inc., and at
prior engagements in the industry and academia.

I. Introduction (Session 1)

      A. Modern bloom and dominance of Machine LearningB. Significance of data in Machine Learning

      C. Why should one hone the skill of dataset curation?

      D. What will this workshop really teach?

II. Dataset Extraction (Session 1)

      A. Planning Phase

          1. Where is the data? Overview of the data sources

          2. Situational Analysis

              a) Guided Search: Having a specific problem statement

              b) Unguided Search: Looking to solve a novel problem

      B. Extraction Phase

          1. Overview of tools

          2. Introduction of sample datasets

          3. Step-by-step process to extract data

              a) Understanding the structure of the data source/website

              b) Difference in extracting static and dynamic content

              c) Obtaining key information – product details and reviews

              d) Process automation

              e) Error handling for robustness

              f) Overcoming limitations due to request rate

             g) Additional Tips for efficient and safe data extraction

III. Dataset Preparation (Session 2)

      A. Data Cleaning

      B. Data Anonymizing

      C. Data Standardization

      D. Data Integration

      E. Data Transformation

      F. Dataset Balancing

      G. Dataset Structuring

IV. Data Pre-processing (Session 3)

      A. Reflecting on the core differences between crude data and features

      B. Data Reduction – Filtering and Cleaning

          a. Processing text and strings

          b. Processing timestamps and dates

      2. Feature Engineering

         a. Definition and Importance

         b. Step-by-step formal process of manual Feature Engineering

         c. Accounting for missing data via Feature Imputation

         d. Maintaining the quality of features via Feature Manipulation

         e. Yielding more powerful features via Feature Interaction

         f. Understanding complex networks via Feature Visualization

        g. Excessive Feature Engineering

        h. Automated Feature Engineering via modern tools

V. Algorithmic Applications, Takeaways, and Q&A (Session 4)

      A. Identifying problem type from the data

          1. Supervised vs Unsupervised

      B. Applying Supervised Algorithms

      C. Applying Unsupervised Algorithms

      D. Takeaways

      E. Some personal anecdotes and recommendations as ML Engineers

      F. Q&A

Target group: Developers, aspiring developers, and technical (project) managers.
Requirements: Own laptop with Python and Jupyter notebook environment installed.
Attendees should be able to run the started code provided in
github.com/rishabhmisra/Curating-Quality-ML-Datasets repo before attending the session.

This Session Diese Session Take me to the current program of . Hier geht es zum aktuellen Programm von Online Edition Online Edition , Munich Munich , Singapore Singapore or oder Berlin Berlin .

Behind the Tracks