[ale] GoLUG online presentation: DIY Spellchecker: My Adventures and Misadventures

Sun Jul 10 16:52:19 EDT 2022

Bob Toxen via Ale said on Sat, 9 Jul 2022 15:51:56 -0400

>I'm sorry I missed this?  Any recorded video?

Hi Bob,

I've decided not to allow recordings because I'd need written
permissions from every participant.

>
>I'm sure you and I could teach those idiots (IMHO) at Apple about it.
>It's so disappointing on my iPhone!  It will change ALL 5 of the
>letters I typed in a word and come up with something completely
>different without even asking me if I want that.  General rule is to
>add or delete a letter or swap two.

I actually came up on something like what you're talking about. I first
tried soundex algorithms, which are really meant for interpreting
erroneous speech to text using soundalikes, rather than interpreting
typos or 1 or 2 erroneous vowels. My result using soundex was, as you
said, "something completely different".

I then switched to the Levenshtein Distance algorithm, meant to
calculate how many inserts+deletes+substitutions are needed to convert
one word to another. This gave a fairly good list of suggestions, using
a Levenshtein maximum of 3. There are versions of Levenshtein that
count erroneously reversed adjacent letters as 1 distance instead of 2.
I hope someday to use that algorithm and bring the maximum down to 2,
which I believe will bring up optimal suggestions.

You mention them automatically switching. Only place I've seen such
behavior is in instantaneous as-you-type spellchecking, which can only
be implemented as a part of or plugin to the particular authoring
environment. My spellchecker will check every word of every paragraph
segment in an HTML file that's also well-formed XML (which is how I
write my HTML), so my spellchecker is whole-document.

At the meeting I mentioned that one bottleneck was that my spellchecker
too about 1 second to spellcheck 1000 correctly-spelled words. This is
because my program brute-forced one-by-one comparison with each line of
each dictionary. Since then I used the extremely quick and
collision-avoidant Jenkins One At A Time algorithm to build a sorted
file of hash values to check on, and a binary search algorithm to check
that list, in RAM, to detect misspellings. Using enough identical "Mary
Had A Little Lamb" poems to make 1386128 words to check, the entire
check plus loading the hash array from disk took 3.9 seconds, or better
than 355,000 words per second. 355,000 words is the size of 710 page
book assuming the industry standard of 500 words per page.

However, one more bottleneck looms: The handoffs from Python to C and
back for each paragraph segment. If this becomes a problem, I might
need to keep the C part running fulltime, processing paragraph
segments, and sending them back with revisions or not. I might need to
treat these segments like network packets, and might need to implement
a UDP-like order-corrector. Or, if I can ever find a C language XML
parser I really trust, I could eliminate Python from the mix entirely.

SteveT

Steve Litt 
Summer 2022 featured book: Making Mental Models: Advanced Edition
http://www.troubleshooters.com/mmm