[ale] GoLUG online presentation: DIY Spellchecker: My Adventures and Misadventures
Bob Toxen
transam at verysecurelinux.com
Sun Jul 10 18:24:35 EDT 2022
I'm not completely sure but I suspect that you do NOT need written
permission from everyone to record.
Just announce at the start of the recording:
"This presentation will be recorded, both speakers and questions
and comments from the audience. Anyone who doesn't approve should
leave now."
Bob
On Sun, Jul 10, 2022 at 04:52:19PM -0400, Steve Litt via Ale wrote:
> Bob Toxen via Ale said on Sat, 9 Jul 2022 15:51:56 -0400
>
> >I'm sorry I missed this? Any recorded video?
>
> Hi Bob,
>
> I've decided not to allow recordings because I'd need written
> permissions from every participant.
>
> >
> >I'm sure you and I could teach those idiots (IMHO) at Apple about it.
> >It's so disappointing on my iPhone! It will change ALL 5 of the
> >letters I typed in a word and come up with something completely
> >different without even asking me if I want that. General rule is to
> >add or delete a letter or swap two.
>
> I actually came up on something like what you're talking about. I first
> tried soundex algorithms, which are really meant for interpreting
> erroneous speech to text using soundalikes, rather than interpreting
> typos or 1 or 2 erroneous vowels. My result using soundex was, as you
> said, "something completely different".
>
> I then switched to the Levenshtein Distance algorithm, meant to
> calculate how many inserts+deletes+substitutions are needed to convert
> one word to another. This gave a fairly good list of suggestions, using
> a Levenshtein maximum of 3. There are versions of Levenshtein that
> count erroneously reversed adjacent letters as 1 distance instead of 2.
> I hope someday to use that algorithm and bring the maximum down to 2,
> which I believe will bring up optimal suggestions.
>
> You mention them automatically switching. Only place I've seen such
> behavior is in instantaneous as-you-type spellchecking, which can only
> be implemented as a part of or plugin to the particular authoring
> environment. My spellchecker will check every word of every paragraph
> segment in an HTML file that's also well-formed XML (which is how I
> write my HTML), so my spellchecker is whole-document.
>
> At the meeting I mentioned that one bottleneck was that my spellchecker
> too about 1 second to spellcheck 1000 correctly-spelled words. This is
> because my program brute-forced one-by-one comparison with each line of
> each dictionary. Since then I used the extremely quick and
> collision-avoidant Jenkins One At A Time algorithm to build a sorted
> file of hash values to check on, and a binary search algorithm to check
> that list, in RAM, to detect misspellings. Using enough identical "Mary
> Had A Little Lamb" poems to make 1386128 words to check, the entire
> check plus loading the hash array from disk took 3.9 seconds, or better
> than 355,000 words per second. 355,000 words is the size of 710 page
> book assuming the industry standard of 500 words per page.
>
> However, one more bottleneck looms: The handoffs from Python to C and
> back for each paragraph segment. If this becomes a problem, I might
> need to keep the C part running fulltime, processing paragraph
> segments, and sending them back with revisions or not. I might need to
> treat these segments like network packets, and might need to implement
> a UDP-like order-corrector. Or, if I can ever find a C language XML
> parser I really trust, I could eliminate Python from the mix entirely.
>
> SteveT
>
> Steve Litt
> Summer 2022 featured book: Making Mental Models: Advanced Edition
> http://www.troubleshooters.com/mmm
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> https://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
More information about the Ale
mailing list