Text Massager

2011-01-03 : In 2002, I wrote a little JavaScript called HTMLify that wold convert plain text into nicely formatted HTML. Since then, there are a lot of improvements I've wanted to make, but a plain JavaScript program is not really up to the task. So I'm going to promote it to full Java and see what I can do.

Ideas

Input

(done) Idea for specifying blocks: Allow regexp for specifying markers. Marker tokens are inserted into token stream at beginning of match. Each regexp gets a group name ("A"), and then each instance gets an instance number ("1"). This gives each marker a unique name ("A1", "A5", "B2"). For example, a default set of markers might be placed before every newline. Ranges can then be specifed from marker to marker. Mixing marker groups is allowed, and ranges are in standard half-open format.

Output

˜ ™

2011-02-08 : Well, at this point, I have a Java program that can process plain text files and produce the same results as the original JavaScript. Now it's time to start doing some improvements!

˜ ™

2011-06-03 : I've put in a bunch of work recently. It now produces very nice looking HTML files. The configuration scripts are a bit peculiar, but I'm quite happy with the results. It fixes ellipses and quotes. It supports epigraphs and hierarchical sections (chapters, etc.). You can do arbitrary search and replace to fix typos and things. You can control paragraph detection and use different settings in different regions.

It would probably be nice to create a manual. :)

Next stop, EPUB!

˜ ™

2011-06-09 : It now produces nice looking EPUB files. However, it takes quite a bit of fiddling with the script to get all the headers and titles to look good. Still, I like the results. Now I'm spending more time building config scripts than tweaking the code. :)

One challenge is producing mark up that looks good in all viewers. My original plan was to keep the HTML as simple as possible. However, getting spacing and line breaks right is starting to add lots of fiddly mark up. Sigh. I'll do my best to let it go.

˜ ™

2011-10-31 : Preliminary HTML source support added. It works well for the simple files I'm converting initially. It'll probably need more work in the long run.

The config file language is particularly bizarre. It's about time to refactor it into something better. Unfortunately, I don't really have any ideas for what a better language would look like.

Also, a little perf work would be good. I suspect it's the regexp engine but I need to profile to find out.

˜ ™

2011-11-10 : Lots of good fixes. There's MOBI (Kindle) support. I added some syntactic sugar and clarified how the boundaries of Ranges work, so they're now much easier to use.

I've had some problems with regexps moving markers around. I need a way to indicate that something needs to be matched but is not going to be replaced.

I'm also worried about stack depth. I'm wondering if I can switch from a "recursive" approach to an "iterative" approach. It would be nice if that allowed me to run stages of the pipeline concurrently too.

˜ ™

2011-11-29 : Cover art! Much awesomeness. Automatically lays out a cover with title, authors, and optional artwork, automatically adjusting font size to fit.

I based the background on blue-curves.jpg from icantgetpublished.com. Doing a Google image search for "blue background" is rather fun!

˜ ™

2014-04-05 : Started work on a Spritz mode.

˜ ™

2016-05-15 : Time to dust this project off and do a little more work on it!

I just got an Apple watch, and I was trying to have Siri read me some results off the web. She really doesn't want to do it. I did search and find out how you can have Siri on your phone read web pages if you turn on Accessibility mode. I had her read a story to me, and it was pretty good, except she kept getting some things wrong. She would pause longer for a comma than a period and she'd mispronounce some words. A little reformatting could easily work around that, and I have just the tool.

Also, I'm a FimFiction fanatic, and I wanted her to be able to read stories to me. So I decided to start there. I've worked on the HTML parser to handle more complicated web pages. Plus I added a HTML hierarchy based filter to pick out regions of a complex web page. Once I get the text cleaned up, I'll give Siri mode a shot. (Plus, maybe I'll get Spritz mode to a working state!)

Config File Reference

Syntax

System Filters - These filters will generally go in a file-type config file, not a project-specific file.

User Filters - These filters will generally go in your project-specific file.

Other Notes

Kindle width is 520. Kindle DX width is 745.

Calibre generated cover size is 590x740.

Nominal font size: Title: 48 Author: 32

The Result

Source: https://latenighthacking.com:8443/svn/code_root/2011/TextMassager

C o m m e n t s :    
(nothing yet)
Edit