The hunspell package is a high-performance stemmer, tokenizer, and spell checker for R. LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS-X, InDesign, Opera, RStudio and many others use this spell checker library, with support being provided for several languages, including Italian. Hunspell uses a special dictionary format that defines which characters, words and conjugations are valid in the specified language. In this post I will illustrate how to use the spell checker.
Author
Angelo Maria Sabatini
Published
January 2, 2025
Spell checking
Spell checking text consists of the following steps:
Parse a document by extracting (tokenizing) words that we want to check
Analyze each word by breaking it down in its root (stemming) and conjugation affix
Lookup in a dictionary if the word+affix combination if valid for the stated language
(optional) For incorrect words, suggest corrections by finding similar (correct) words in the dictionary
Check individual words
Let us suppose that the text to be spell checked is composed of a char array, with each element of the array being a single word expressed in the specified language (Italian, here). A custom dictionary can be set in the dict parameter when functions hunspell_* are invoked.
Note that the outcome of the function hunspell_suggest() is a list. An array of corrected words can then be constructed for the corresponding misspelled words. Clearly, some words are not the best choice, and the first word is usually the best alternative. The quality of data can be improved by careful exploration of the list.
Spell checking text
word
best alternative
ammore
amore
professsore
professore
An example
Let us suppose that we want to analyze the lyrics of the song titled “Il mio canto libero” (here). One single occurrence of the words “emozioni”, “nuda” and “pianto” was intentionally misspelled, namely, “emozzioni”, “nudda”, “painto” were transcripted, respectively.
To get the list of words used in the lyrics, I first split the text into individual words. To do so, I use unnest_tokens() from the tidytext package.
Code
library(tidyverse)library(stringi)library(tidytext)# data uploadlyrics <-stri_read_lines("Il mio canto libero.txt")df_pre <-data.frame(text = lyrics) %>%mutate(id =row_number()) %>%select(id, text)# tokenizationtokenized_text_pre <- df_pre %>%unnest_tokens(output = word, token ="words", input = text) %>%count(word, sort =TRUE)# check spellingcorrect <-hunspell_check(tokenized_text_pre$word, dict =dictionary("it_IT"))# incorrect wordsincorrect.words <- tokenized_text_pre$word[!correct]# find suggestions for incorrect wordssuggested.words <-hunspell_suggest(incorrect.words, dict =dictionary("it_IT"))suggested.words <-unlist(lapply(suggested.words, function(x) x[1]))# list of incorrect and suggested wordsword.list <-as.data.frame(cbind(word = incorrect.words, `best alternative`= suggested.words))
Spell checking text - 'Il mio canto libero'
word
best alternative
emozzioni
emozioni
nudda
nuda
painto
pianto
The misspelled words in the song lyrics can be replaced with the suggested alternative using the code below.