Clean the Text
Last updated on 2022-11-09 | Edit this page
Overview
Questions
- What is clean text?
Objectives
- Clean the example documents with the package tidytext
- Calculate the word frequencies in the example documents
About Text Mining and Text Analysis
Depending on how it is organized, data can be grouped into two categories: structured data and unstructured data. Structured data is data that has been predefined and formatted to a tabular format with rows and columns, such as data stored in a relational database, or membership information housed in an Excel spreadsheet. While unstructured data does not have a predefined data format. It comes in various formats, for example, email, presentation, images, etc. Another category is a blend between structured and unstructured data formats, which is called semi-structured data. It refers to what would normally be considered unstructured data, but that also has metadata that identifies certain characteristics. Some common examples of semi-structured data are XML, JSON, and HTML files.
Text mining, or text analytics, is the process of exploring and analyzing unstructured or semi-structured text data to identify key concepts, pattens, relationships, or other attributions of the data. Text mining began with the computational and information management areas, whereas text analysis originated in the humanities with the manual analysis of text such as newspaper indexes and Bible concordances. Now these two terms are exchangeable, and generally refer to the use of computational methods to explore and analyze unstructured text data.
A simplified process of a typical text mining study can include four steps: data gather, text preprocessing or cleaning, text analysis, and integration with the study.
In terms of data gathering, we may create a dataset or select existing datasets. After a dataset is generated, usually, we need to preprocess or clean the text to get it ready for analysis. Common techniques used for preparing a dataset include converting text to lower case, removing punctuation and non-alphanumeric character, remove stopwords, tokenization, tagging parts of speech, word replacement, stemming and lemmatization, etc. Next step will be text mining or analysis. Some common text mining methods are topic modelling, sentiment analysis, term frequency and TD-IDF, and collocation analysis. We will then integrate the findings from text mining to the study. Various text preprocessing techniques and text mining methods serve different research purposes. This lesson is to demo how to use the R package tidytext
to preprocess text data from an existing dataset to perform a sentiment analysis.
Preprocess and Clean Text
R is powerful processing structured data, or tabular data, where data display in columns or tables. R can also handle unstructured and semi-structured data such as text. Julia Silge and David Robinson followed the tidy data principles branded by Hadley Wickham and developed the package tidytext
to preprocess and analyze textual data.
Tidy data sets allow manipulation with a standard set of “tidy” tools, including popular packages such as dplyr (Wickham and Francois 2016), tidyr (Wickham 2016), ggplot2 (Wickham 2009), and broom (Robinson 2017). These packages extend the capacities of tidytext
of exploring and visualizing textual data. Users can transit fluidly between these packages by keeping the input and output in tidy formats.
Token and Tokenization
In the package tidytext
, tidy text is defined as a one-token-per-row data frame, where a token is a semantically meaningful unit of text, such as a word, a sentence, or a paragraph, that we are interested in analyzing. Tokenization is a process of segmenting running text into a list of tokens to create a table with one-token-per-row.
Here is a simple example to explain how to use tidytext
to tokenize textual data. In R, textual data can be stored as character vectors. For example:
R
lyrics <- c("How many roads must a man walk down",
"Before you call him a man?",
"How many seas must a white dove sail",
"Before she sleeps in the sand?",
"Yes, and how many times must the cannonballs fly",
"Before they're forever banned?")
lyrics
OUTPUT
[1] "How many roads must a man walk down"
[2] "Before you call him a man?"
[3] "How many seas must a white dove sail"
[4] "Before she sleeps in the sand?"
[5] "Yes, and how many times must the cannonballs fly"
[6] "Before they're forever banned?"
To tokenize this character vector, we first need to put it into a data frame. We use the function tibble from the package tidyverse
to convert a character vector into a tibble.
R
library(tidyverse)
lyrics_df <- tibble(line = 1:6, lyrics)
lyrics_df
OUTPUT
# A tibble: 6 × 2
line lyrics
<int> <chr>
1 1 How many roads must a man walk down
2 2 Before you call him a man?
3 3 How many seas must a white dove sail
4 4 Before she sleeps in the sand?
5 5 Yes, and how many times must the cannonballs fly
6 6 Before they're forever banned?
Next step is tokenization, where we split the text into units, or tokens for further analyses. We will use the function unnest_tokens
to break the lyrics into words and strip punctuations.
The function unnest_tokens
has three primary arguments:
- tbl: the data frame to be tokenized.
- output: the column to be created as string or symbol.
- input: the column that gets split as string or symbol.
R
library(tidytext)
unnest_tokens(tbl = lyrics_df,
output = word,
input = lyrics)
# A tibble: 41 × 2
line word
<int> <chr>
1 1 how
2 1 many
3 1 roads
4 1 must
5 1 a
6 1 man
7 1 walk
8 1 down
9 2 before
10 2 you
# … with 31 more rows
The result of unnest_tokens
is a tibble. In our case, the lyrics is split into 41 words with each word takes a row. The input column lyrics is removed; the new column, or the output column word, is added; and the column line is kept unchanged.
Beyond these three primary arguments, the function unnest_tokens
also has several optional arguments. The default token is “words”. It can be set as “characters”, “sentences”, “ngrams”, “lines”, “paragraphs”, etc. unnest_tokens
automatically converts tokens to lowercase and drops the input column if not specified. Punctuations are stripped druing the tokenization.
Since the first argument of unnest_tokens
is a data frame, we can also use pipes to send a data frame to it and obtain the same results:
R
lyrics_df %>%
unnest_tokens(word, lyrics)
Project Gutenberg collection
The Project Gutenberg is a collection of free electronic books, or eBooks, available online. The R package gutenbergr
, developed by David Robinson, allows users to download public domain works from the Project Gutenberg collection as well as search and filter works by author, title, language, subjects, and other metadata. Project Gutenberg ID numbers are listed in this metadata, which allows us to download the text for each novel using the function gutenberg_download() . Let’s use The Time Machine, The War of the Worlds, and The Invisilbe Man.
R
library(gutenbergr)
hgwells <- gutenberg_download(c(35, 36, 5230))
R
tidy_hgwells <- hgwells %>%
unnest_tokens(word, text)
OUTPUT
TEST TEST TEST NEED TO ADD
Word Frequencies
One of the first steps used in text analysis, is word frequency. Word frequency looks at how how often words are repeat in texts. In order to count the words, we first need to remove extremely common words called stop words such as “the”, “have”, “is”, “are” amoung others in English. We can remove. Using the H.G. Wells books let’s look at how we could run it in R.
R
tidy_hgwells <- hgwells %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
This is a lesson created via The Carpentries Workbench. It is written in Pandoc-flavored Markdown for static files and R Markdown for dynamic files that can render code into output. Please refer to the Introduction to The Carpentries Workbench for full documentation.
What you need to know is that there are three sections required for a valid Carpentries lesson:
-
questions
are displayed at the beginning of the episode to prime the learner for the content. -
objectives
are the learning objectives for an episode displayed with the questions. -
keypoints
are displayed at the end of the episode to reinforce the objectives.
R
tidy_books %>%
count(word, sort = TRUE) %>%
filter(n > 600) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
OUTPUT
[1] "This new lesson looks good"
You can add a line with at least three colons and a solution
tag.
Figures
You can use standard markdown for static figures with the following syntax:
{alt='alt text for accessibility purposes'}
Math
One of our episodes contains \(\LaTeX\) equations when describing how to create dynamic reports with {knitr}, so we now use mathjax to describe this:
$\alpha = \dfrac{1}{(1 - \beta)^2}$
becomes: \(\alpha = \dfrac{1}{(1 - \beta)^2}\)
Cool, right?