Friday, November 5, 2021

Tiny Spell Checker (part 1)

Inspired by a post I read about how designing a spell checker used to be a major software development effort[1] I designed a coding challenge of sorts for the teams I work with. We have these types of technical exchanges semi-regularly as a way to collaborate and explore technology problems; they also serve as a great team building exercise.         

I'd like to introduce the problem here and some of the solution space we considered. I plan on providing a bit more detail about my final solution in a future post. 

As an aside, we generally avoid making this competitive; it decreases participation and distracts from the main purpose of exploring the problem space and learning about others' approaches (which may be entirely non-technical).



Below are some of the topics we discussed during the showcase. The topics are primarily centered on various compresssion mechanisms (lossless and otherwise) as you might expect. However, there were additional conversations that considered a broader context. For example:
  • translating to another language which had better compression properties 
  • using phonetics (stenography)
  • a literature survey of compression techniques
  • trade-offs in time, space, and accuracy of each approach
It is those last two bullets that really highlight the benefit of these exchanges. 

It is one thing to be able to implement a decent compression algorithm; but learning how to consider the problem under extended or different constraints allows us to apply these ideas to the projects we are working on and generally makes us better developers.

Asymmetric Numeral Systems (ANS)

A set of entropy encoding methods that include probability distributions and finite state machines which combine to achieve performant compression. In particular, we discussed the table-based version of ANS (tANS).

Huffman Coding

Lossless compression technique that uses symbol frequencies (probability) to build a table for encoding symbols relative to their frequency: the more frequent a symbol the fewer the bits used to encode it.

Some additional discussion was the cost of using a well-known table or building it from the input dictionary and storing it in the encoded (compressed) output.

n-grams

Encoding common groups of letters into a single symbol/value. For example, encoding the sequence "ing" as a single value. 

This is very similar to Huffman but operating on n-grams instead of a single byte.

Rabin-Karp

Is an algorithm for searching for a string in a body of text. At a high level it leverages a hash to do approximate string matching and only does full string matching if there is an approximate match.

Trie

A data structure that encodes words in a prefix tree and indicates whether a particular word exists in the tree (i.e. is valid) when all letters follow a path in the tree and end at a word terminator node.

Bloom filter

A data structure that, given a hash function, can answer either: an element may be encoded in the set; or it is definitely not in the set. That is, there is the possibility for false positives but not false negatives when checking for existence of elements.

A nice property of this data structure is that it has known properties regarding the false positive rate given a size and hashing strategy used so it can be tuned to a specific accuracy given size constraints.