[Discussion] Super-Resolution for Text

In this week’s online meetup, I mentioned looking at Machine Learning-based super-resolution, that is the various ML approaches to image enhancement and upscaling.

Here’s a couple of examples, but there’s been a lot of research in the area of upscaling photos - see Papers With Code:

The use-case I had in mind was in optimising printed content, both image and textual (hopefully reducing processing time/improving quality).

I hadn’t considered the use-case for accessibility, but it sounds like an interesting area if a near realtime system could be used to help with readability of printed & scanned documents.

Knowing that text and photographs have different characteristics and that (at the very least) would benefit from different training sets to tune for either general photos or textual content, I had a look around for some existing research for text-optimised super-resolution.

Quite a few papers concentrate on image enhancement to improve Optical Character Recognition (OCR):

Some image restoration approaches for common artifact/fault types have been found work well with text:

Also, some datasets and techniques focussing on general text enhancement within photos

I haven’t done anything with this yet, but it’s an interesting area with meaningful use-cases & a lot of research to base any work on.

If anyone has any other tools, use-cases or specific issues this could help with, it would be interesting to hear your thoughts.



I’ve only just joined this web site today and am very interested in this particular topic and what was discussed.

Regarding “I hadn’t considered the use-case for accessibility, but it sounds like an interesting area if a near realtime system could be used to help with readability of printed & scanned documents”. This is the exact use case that I’m interested in. i…e. Improving the legibility of low resolution printed and also on-screen text for people with visual impairments by training a neural network to perform both text image super resolution and also image denoising when used with a webcam and small printed text. e.g. The removal of webcam salt and pepper noise and JPEG compression artifacts.

I’ve had fairly promising results with a shallow-ish network consisting of a few res blocks followed by a couple of upsampling and convolutional layers, but I think I can do better by trying to devise a custom loss function because as pointed out in the “On-Device Text Image Super Resolution” paper that you’ve linked to when I use the per pixel mean squared error as my loss function it results in slightly blurred edges to the text characters (but removes noise nicely). Using L1 loss seems to make the edges slightly sharoer but I think I should be able to do better.

I was thinking of a custom loss function that somehow blended the L2 loss with performing a Laplace or Sobel filter on the predicted output and computing the variance over the pixel values because low variance usually suggests a blurred image? Or pre-train the network using L2 loss and then fine-tune the model’s weights using a GAN? (which being new to machine learning I haven’t got round to attempting yet).

I noticed in the “On-Device Text Image Super Resolution” paper that they concatenated the output of the first few convolutional layers with the input tensor passed through a Sobel filter and then convolved and then concatenated with the other dense feature maps concatenated so far. I’m very new to deep learning but to me this sounds like manual feature selection and not learned feature selection? But still, I suppose if they got sharp text edges and were able to train using L2 loss alone and got state of the art results that approach must work.

So currently I’m trying my own approach, trying to devise a custom loss function that promotes sharp edges and not result in slightly blurred smooth edges (that are better than bicubic interpolation). But right now I’m trying to devise a more optimal learning rate scheduler just to see if I can squeeze a little bit more out of my currently trained model, but I’m open to suggestions as I’m a complete newbie to machine learning and have only trained perhaps half a dozen different models using Tensorflow/Keras/Python by way of learning deep learning by reading and then trying to apply it to problems that I’m interested in trying to solve.

Hi Gareth. Welcome the Tech Shed Frome!

I’ve not done much ML for a while, so a bit out of practice & don’t think I can advise well right now (I’ve covered a lot in various courses, exercises & a lot of reading, but not much real-world or from-scratch).

I’m wondering whether a mixed approach (though probably not the most efficient) might be useful, e.g. after an initial pass to improve image quality, a pass to isolate text, then using something like this (intended for isolated text rather than full image) to further clean up the text:

Sorry if that’s a bit vague.
If you have a Git repo you’re willing to share, that may make it easier for others to help out.

Also worth mentioning the BathML Meetup group - they’ve not met physically for a while due to the pandemic, but there are a lot of experienced people there and the Bath ML Slack group is active in the mean time.

1 Like