11.1 First Week

The first week involves understanding text mining infrastructure in R and exploring the data sets provided from the course. The process I took to understand the subject are as follows:

  1. Read the material
  2. Apply the learned material and exploring the data sets which
  3. Leads to more questions such as:
    • Optimizing for speed vs. accuracy?
    • Are there any other framework that can do this better?
    • What technology are out there in NLP language?

The exploration leads to more questions. The goal is to optimize algorithm based either on speed or accuracy. Finding the balance is pretty difficult.

Below is the environment needed to examine our data sets.

library(bibtex)
library(knitr)
library(rvest)
library(tidyverse)
library(glue)
library(stringi)
library(caret)
library(spacyr)
library(tidytext)
library(htmlwidgets)
library(echarts4r)

Data sets can be found here but before we dive-in the data, let us define some terminologies that are use often in NLP infrastructure such as Text Mining, Corpus, and Tokenization.