R Shiny App: NLP Next Word Prediction
This article would show you how I build a Rshiny app for Natural Language Processing and Prediction.
1.Data Processing
- First, all non-English characters are removed; numbers, punctuation, whitespace was also removed. All text is also changed to lowercase.
- Profane words are also removed. The project used Carnegie Mellon University’s resource: Offensive/Profane Word List.
2.Prediction Algorithm
- Tokenization: used for finding the frequency of five types of n-gram: unigrams (single words), bigrams (two word phrases), trigrams (three words), quadgrams (four word) and quintgrams (five words).
- N-grams: indicate which words appear together in the text. (The higher the frequency of a certain n-gram, the more likely it is to be found in the corpus.)
- The predictive algorithm uses the n-gram frequency to suggest/ predict the next word based on the users input. The model checks the phrase length and starts with the quintgram, then moves onto the quadgram and so on. The model is a version of a ‘back-off’ model.