For a publication library Win32 app, I am looking to extract data from its proprietary and undocumented file format.
* It is multilingual and precedes Unicode, most likely compressed.
* The individual documents are search indexed (indices in separate files).
* They are sorted into a hierarchy: Publication (file-level) > Chapter > Content.
* Some publications are magazines, their hierarchy is: Publication (file-level) > Year > Issue > Chapter > Content.
* The documents are interconnected with hyperlinks.
* Few publications contain images, most is formatted text.
I will provide you with the entire library viewer app including all of its publication files (1+GB).
* The tool you develop to read and convert the files to the following format.
* I can work from a set of legible, interconnected HTML (and JPG) files with their TOC files, sorted into nested folders.
* All formatting, links and footnotes need to be retained.
* Indices should be on a separate file per publication, using HTML anchor tags < a id="uniqueID" > in the content files.
* I should be able to use the same tool on more files of the same specification.
* Delivering a command line tool for Win32, x64, Linux or macOS is fine.