PALS0039 Introduction to Deep Learning for Speech and Language Processing
UCL Division of Psychology and Language Sciences

Week 4 - Preparation of Text and Speech for Machine Learning

In which we look at how to prepare text and speech materials to make them compatible with machine learning approaches to classification and regression.

Learning Objectives

By the end of the session the student will be able to:


  1. Data preparation for machine learning
  2. We present a high-level overview of the problems of converting text and speech data into a form suited for machine learning. We discuss a general approach for summarising variable-length sequences to fixed-length vectors.

  3. Text preparation
  4. We discuss different pre-processing steps necessary to convert text into numerical form suitable for machine learning: (i) Tokenisation, (ii) Stop word removal, (iii) Normalisation, stemming and lemmatisation (iv) Building a dictionary, (v) One-hot coding of words, (vi) Bag of words document model.

  5. Speech preparation
  6. We discuss the different pre-processing steps necessary to convert speech recordings into a form suitable for machine learning: (i) Recording, (ii) Segmentation, (iii) Short-time analysis, (iv) Summarisation.

Research Paper of the Week

Web Resources


Be sure to read one or more of these discussions of text and speech processing

Tutorial Notebooks


Implement answers to the problems described in the notebooks below. Save your completed notebooks into your personal Google Drive account.

    1. Sentiment Analysis from text
    2. Age prediction from speech

Word count: . Last modified: 22:45 11-Mar-2022.