[ale] GoLUG online presentation: DIY Spellchecker: My Adventures and Misadventures

Bob Toxen transam at verysecurelinux.com
Sun Jul 10 18:24:35 EDT 2022


I'm not completely sure but I suspect that you do NOT need written
permission from everyone to record.

Just announce at the start of the recording:

  "This presentation will be recorded, both speakers and questions
  and comments from the audience.  Anyone who doesn't approve should
  leave now."

Bob

On Sun, Jul 10, 2022 at 04:52:19PM -0400, Steve Litt via Ale wrote:
> Bob Toxen via Ale said on Sat, 9 Jul 2022 15:51:56 -0400
> 
> >I'm sorry I missed this?  Any recorded video?
> 
> Hi Bob,
> 
> I've decided not to allow recordings because I'd need written
> permissions from every participant.
> 
> >
> >I'm sure you and I could teach those idiots (IMHO) at Apple about it.
> >It's so disappointing on my iPhone!  It will change ALL 5 of the
> >letters I typed in a word and come up with something completely
> >different without even asking me if I want that.  General rule is to
> >add or delete a letter or swap two.
> 
> I actually came up on something like what you're talking about. I first
> tried soundex algorithms, which are really meant for interpreting
> erroneous speech to text using soundalikes, rather than interpreting
> typos or 1 or 2 erroneous vowels. My result using soundex was, as you
> said, "something completely different".
> 
> I then switched to the Levenshtein Distance algorithm, meant to
> calculate how many inserts+deletes+substitutions are needed to convert
> one word to another. This gave a fairly good list of suggestions, using
> a Levenshtein maximum of 3. There are versions of Levenshtein that
> count erroneously reversed adjacent letters as 1 distance instead of 2.
> I hope someday to use that algorithm and bring the maximum down to 2,
> which I believe will bring up optimal suggestions.
> 
> You mention them automatically switching. Only place I've seen such
> behavior is in instantaneous as-you-type spellchecking, which can only
> be implemented as a part of or plugin to the particular authoring
> environment. My spellchecker will check every word of every paragraph
> segment in an HTML file that's also well-formed XML (which is how I
> write my HTML), so my spellchecker is whole-document.
> 
> At the meeting I mentioned that one bottleneck was that my spellchecker
> too about 1 second to spellcheck 1000 correctly-spelled words. This is
> because my program brute-forced one-by-one comparison with each line of
> each dictionary. Since then I used the extremely quick and
> collision-avoidant Jenkins One At A Time algorithm to build a sorted
> file of hash values to check on, and a binary search algorithm to check
> that list, in RAM, to detect misspellings. Using enough identical "Mary
> Had A Little Lamb" poems to make 1386128 words to check, the entire
> check plus loading the hash array from disk took 3.9 seconds, or better
> than 355,000 words per second. 355,000 words is the size of 710 page
> book assuming the industry standard of 500 words per page.
> 
> However, one more bottleneck looms: The handoffs from Python to C and
> back for each paragraph segment. If this becomes a problem, I might
> need to keep the C part running fulltime, processing paragraph
> segments, and sending them back with revisions or not. I might need to
> treat these segments like network packets, and might need to implement
> a UDP-like order-corrector. Or, if I can ever find a C language XML
> parser I really trust, I could eliminate Python from the mix entirely.
> 
> SteveT
> 
> Steve Litt 
> Summer 2022 featured book: Making Mental Models: Advanced Edition
> http://www.troubleshooters.com/mmm
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo


More information about the Ale mailing list