Skip to content

Week 9

hackmd-github-sync-badge

In this lesson we learn how to preprocess text-based data and train deep learning models on that data.

Objectives

After completing this week, you should be able to:

  • Transform text input into tokens and convert those tokens into numeric vectors using one-hot encoding and feature hashing.
  • Build basic text-processing models using recurrent neural networks (RNN)
  • Understand how word embeddings such as Word2Vec can help improve the performance of text-processing models

Readings

  • Read chapter 6 in Deep Learning with Python

Weekly Resources

Assignment 9

9.1

In the first part of the assignment, you will implement basic text-preprocessing functions in Python. These functions do not need to scale to large text documents and will only need to handle small inputs.

a.

Create a tokenize function that splits a sentence into words. Ensure that your tokenizer removes basic punctuation.

def tokenize(sentence):
    tokens = []
    # tokenize the sentence
    return tokens
````

#### b.  

Implement an `ngram` function that splits tokens into N-grams. 

```python
def ngram(tokens, n):
    ngrams = []
    # Create ngrams
    return ngrams

c.

Implement an one_hot_encode function to create a vector from a numerical vector from a list of tokens.

def one_hot_encode(tokens, num_words):
    token_index = {}
    results = ''
    return results

9.2

Using listings 6.16, 6.17, and 6.18 in Deep Learning with Python as a guide, train a sequential model with embeddings on the IMDB data found in data/external/imdb/. Save the model performance metrics and training and validation accuracy curves in the dsc650/assignments/assignment9/results/model_1 directory.

9.3

Using listing 6.27 in Deep Learning with Python as a guide, fit the same data with an LSTM layer. Save the model performance metrics and training and validation accuracy curves in the dsc650/assignments/assignment9/results/model_2 directory.

9.4

Using listing 6.46 in Deep Learning with Python as a guide, fit the same data with a simple 1D convnet. Save the model performance metrics and training and validation accuracy curves in the dsc650/assignments/assignment09/results/model_3 directory.

Submission Instructions

For this assignment, you will submit a zip archive containing the contents of the dsc650/assignments/assignment09/ directory. Use the naming convention of assignment09_LastnameFirstname.zip for the zip archive. You can create this archive in Bash (or a similar Unix shell) using the following commands.

cd dsc650/assignments
zip -r assignment09_DoeJane.zip assignment09

Likewise, you can create a zip archive using Windows PowerShell with the following command.

Compress-Archive -Path assignment09 -DestinationPath 'assignment09_DoeJane.zip

Discussion Board

For this discussion, pick one of the following topics and write a 250 to 750-word discussion board post. Use the DSC 650 Slack channel for discussion and replies. For grading purposes, copy and paste your initial post and at least two replies to the Blackboard discussion board.

Topic 1

Compare and contrast using MapReduce, Spark, and Deep Learning Frameworks (e.g. TensorFlow) for performing text preprocessing and building text-based models. Are there use cases where it makes sense to use one over another?

Topic 2

How might you combine stream processing such as Spark's stream processing framework with deep learning models? Provide use cases that are relevant to your professional or personal interests.


Last update: March 12, 2023