I've been looking at ASCII art recently - at least in the context of
converting images into ASCII-rendered versions of the original. This has
resulted in a few simple projects including one where I rendered a webcam
video feed as ASCII in a terminal. One of the challenges I faced in that
project was finding the correct size of the font to get reasonable results; a
standard 80 x 24 terminal left a lot to be desired in terms of the final text
rendering of a video frame. It was easy enough to simply resize the
font/terminal to get what I wanted but it wasn't a very elegant solution and
didn't translate very well to other projects (where the render target was not
a terminal, for example).
To address that limitation I ended up writing a dynamic font 'renderer'[1] -
this post describes some aspects of that project.
The interesting parts (to me) include the font rendering/scaling,
the tiling of the image parts, and the interesting combinations of rendering.
Dealing With Fonts
This was mainly familiarizing myself with the FreeType font library and how
to load and render fonts in memory. I could have just taken one of the many
already existing ASCII-to-grayscale mappings but I wanted the ability to
provide a few additional features that required more than that (e.g. custom
character sets and non-fixed-width fonts).
So, rather than use an fixed set of characters I dynamically compute
the average brightness of a particular character set (for a given font) and
use that to select glyphs to replace image pixels.
In addition to which font and glyphs to use I also dynamically scale
the glyphs used based on properties of each region of an image. The FreeType
font library supports this directly allowing me to render a glyph at whatever
size I specify; I can then 'copy' those pixels directly to the translated image.
Image Tiling
I used a windowing mechanism to select the size of the font to use. Given an
image, I subdivide the image into N rows and M columns and in each N,M cell
select a font size (more on how a size is selected later).
Then, within that cell, compute the properties of the region for each character
glyph and translate that source image region to an ASCII character that most
closely matches the average greyscale value of that region.
This nested tiling approach allows for selecting font sizes in any number of
ways. For example, selecting a random font size within each N,M cell of the
image results in the following.
Zooming in on a section of that images highlights a few things about this
approach:
- there is flexibility of rendering each region in isolation
- clipping at the edges of regions will start to produce artifacts in the
output image due to discontinuities between the glyphs
- as you approach a glyph size of 1x1 you approach a pixel-level copy of
the original image region
Of course, just like the impetus for this project, there is something left to
be desired about the resolution of the final image. Select a size too large
and there is loss of detail; too small and the 'ascii' effect is dimished.
So I started experimenting with ways to quantify the amount of resolution
in a section of the image and how to translate that to font size.
Rendering
Finding the right way to encode resolution turned out to be a bit of a rabbit
hole for me. I started with the idea of using the entropy of each window but
broadened my search to frequency- and thresholding-based techniques.
RMS
A basic root mean square approach captured some of the contrast in the image
but it wasn't sufficient for what I wanted.
Frequency Domain
After a bit of research I came across a paper "Contrast in Complex Images" by
Eli Peli which discussed how the human eye perceives contrast and various ways
to compute that spatially across an image. It fit nicely into the notion of a
subdivided image and provided an equation to compute sub-cell contrast based on
high-frequency components in an image.
Essentially this consists of taking the DFT of an image, running that through
a high-pass filter, and computing the inverse DFT of the filtered content.
This is closely related to edge detection but maintains more high-frequency
information (as you can see in the examples below).
This worked well for some images, but the residual information created some
noise for certain image types. Consider the two images: kitten and wave.
The contrast is easier to identify in the kitten but the wave is a bit more
challenging.
Thresholding
Finally, I ended up using a tunable threshold technique where I convert the
image to B/W based on some threshold brightness and use the proximity to this
threshold value as an indicator of the amount of contrast in each window. This
ultimately ended up working fairly well for the effect I was looking for
preserving much of the data of the frequency-based approach but without the
residual noise.
[1] I say renderer but I use libfreetype to do all the heavy lifting of
pixel-level character rendering.