Spell checking using hunspell

text mining
The hunspell package is a high-performance stemmer, tokenizer, and spell checker for R. LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS-X, InDesign, Opera, RStudio and many others use this spell checker library, with support being provided for several languages, including Italian. Hunspell uses a special dictionary format that defines which characters, words and conjugations are valid in the specified language. In this post I will illustrate how to use the spell checker.
Author

Angelo Maria Sabatini

Published

January 2, 2025

Spell checking

Spell checking text consists of the following steps:

  • Parse a document by extracting (tokenizing) words that we want to check
  • Analyze each word by breaking it down in its root (stemming) and conjugation affix
  • Lookup in a dictionary if the word+affix combination if valid for the stated language
  • (optional) For incorrect words, suggest corrections by finding similar (correct) words in the dictionary

Check individual words

Let us suppose that the text to be spell checked is composed of a char array, with each element of the array being a single word expressed in the specified language (Italian, here). A custom dictionary can be set in the dict parameter when functions hunspell_* are invoked.

library(hunspell)

# check individual words
words   <- c("amore", "ammore", "prof", "professsore")
correct <- hunspell_check(words, dict = dictionary("it_IT"))
print(correct)
[1]  TRUE FALSE  TRUE FALSE
# incorrect words
incorrect.words <- words[!correct]
# find suggestions for incorrect words
suggested.words <- hunspell_suggest(incorrect.words, dict = dictionary("it_IT"))
suggested.words <- unlist(lapply(suggested.words, function(x) x[1]))

Note that the outcome of the function hunspell_suggest() is a list. An array of corrected words can then be constructed for the corresponding misspelled words. Clearly, some words are not the best choice, and the first word is usually the best alternative. The quality of data can be improved by careful exploration of the list.

Spell checking text
word best alternative
ammore amore
professsore professore

An example

Let us suppose that we want to analyze the lyrics of the song titled “Il mio canto libero” (here). One single occurrence of the words “emozioni”, “nuda” and “pianto” was intentionally misspelled, namely, “emozzioni”, “nudda”, “painto” were transcripted, respectively.

To get the list of words used in the lyrics, I first split the text into individual words. To do so, I use unnest_tokens() from the tidytext package.

Code
library(tidyverse)
library(stringi)
library(tidytext)

# data upload
lyrics <- stri_read_lines("Il mio canto libero.txt")
df_pre <- data.frame(text = lyrics) %>%
  mutate(id = row_number()) %>%
  select(id, text)
# tokenization
tokenized_text_pre <- df_pre %>%
  unnest_tokens(output = word, token = "words", input = text) %>%
  count(word, sort = TRUE)
# check spelling
correct <- hunspell_check(tokenized_text_pre$word, dict = dictionary("it_IT"))
# incorrect words
incorrect.words <- tokenized_text_pre$word[!correct]
# find suggestions for incorrect words
suggested.words <- hunspell_suggest(incorrect.words, dict = dictionary("it_IT"))
suggested.words <- unlist(lapply(suggested.words, function(x) x[1]))
# list of incorrect and suggested words
word.list <- as.data.frame(cbind(word = incorrect.words, `best alternative` = suggested.words))
Spell checking text - 'Il mio canto libero'
word best alternative
emozzioni emozioni
nudda nuda
painto pianto

The misspelled words in the song lyrics can be replaced with the suggested alternative using the code below.

Code
incorrect.whole.words <- paste0("\\b", word.list$word, "\\b")
lyrics <- stri_replace_all_regex(lyrics, incorrect.whole.words, suggested.words, 
                                 vectorize_all = FALSE)

The count of the words that were misspelled is correctly updated after spell checking and replacement.

Before spell checking
word n
emozioni 1
emozzioni 1
nuda 1
nudda 1
painto 1
After spell checking
word n
emozioni 2
nuda 2
pianto 1

The thumbnail image is credited to Text mining icons created by Freepik - Flaticon