Preparing Text Input for Machine Learning

This talk originates from the archive. To the CURRENT program
Join the ML Revolution! Blind Bird ends September 13! Register now and save up to 480 €! Team Discount Register with 3+ colleagues and get 10 % off! Register Now
Wednesday, June 20 2018
09:00 - 10:00
Asam 2

Deep down ML is a pure numbers game. With very few exceptions, the actual input to an ML model is always a collection of float values. This is straightforward for numerical, spreadsheet-like input, images where pixels are just numerical color values or audio samples, but how do ML algorithms work on words and letters? As proper preprocessing is often the most crucial part in a successful ML project, it is important to understand how to handle textual input properly. We will have a look at the two most important jobs when handling text in ML: preprocessing/normalization and vector representations of text. We will first navigate the minefield of correct Unicode normalization of our input and then – after we have tamed our strings – how to convert normalized and sanitized strings into various vector representations, from simple one-hot encodings to embeddings produced by algorithms like Word2Vec.