Question

For an application I'm creating, I'm looking to extract the text from open source ePubs, and manipulate the text. However, I don't want the table of contents. I'd just like from Chapter 1 or the Prologue/Preface on.

Take Tom Sawyer on Project Gutenberg for example: http://www.gutenberg.org/ebooks/74

ePubs are pretty much just a ZIP file with a bunch of HTML documents. So I open the first HTML file in that above link after unzipping the ePub, and I get the first chapter as well as a bunch of table of contents that I don't want.

That's where I'm curious. Is it possible, via some metadata that I'm missing, or Regex, to remove the table of contents/detect it?

To be clear, I'm talking programmatically.

Was it helpful?

Solution

In epub2, there is a table of contents file. First, start with the container.xml. It is always in the same place with the same name in an ePub.

$unzip -p /Users/mwu/Downloads/9781434705211.epub META-INF/container.xml
<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
    <rootfile full-path="OPS/package.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>

That tells you that the ePub package metadata is located in OPS/package.opf. The package metadata specifies that there is a manifest of all of the files in the ePub and a spine item listing defining what order they should come in the book. The spine tag also defines where the table of contents is with the toc attribute. Also, the items listed in the spine represent the files that make up the book itself. Anything listed linear="no" is auxiliary content rather than primary content. The specification says that the first linear="yes" (which is the default value) begins the main reading order however that can contain (as is the case in this book) a table of contents as part of the book itself.

<manifest>
...
<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
...
</manifest>
<spine toc="ncx">
<itemref idref="my-html-cover" linear="no"/>
<itemref idref="title"/>
<itemref idref="f1"/>
<itemref idref="ded"/>
<itemref idref="contents"/>
<itemref idref="ack"/>
<itemref idref="f2"/>
<itemref idref="chapter1"/>
<itemref idref="chapter2"/>
<itemref idref="chapter3"/>
<itemref idref="chapter4"/>
<itemref idref="chapter5"/>
<itemref idref="chapter6"/>
<itemref idref="chapter7"/>
<itemref idref="b1"/>
<itemref idref="b2"/>
<itemref idref="b3"/>
<itemref idref="b4"/>
<itemref idref="copyright"/>
</spine>

This tells you that the table of contents is idenified by the ncx item in the manifest which references the toc.ncx file. Note that the path is relative to the package.opf file, so it can be found at OPS/toc.ncx

The toc.ncx file contains a navMap tag which lists navPoint tags defining the different parts of the book and references to them.

Both in the <spine> tag in the package.opf file and in the toc.ncx file, you can get a listing of the parts of the book and in what order they go in. They also both list contents.html which I think is what you want to exclude. There is nothing consistent that can identify that in-spine table of contents, nor is it guaranteed to even exist in a book. You can try scanning the spine tag as well as the contents of each spine item file for words that commonly identify a table of contents or for a series of links that reference other spine items in the book, but that may not catch everything 100% of the time.

Generally, files like that are considered part of the book and removing them is considered incorrect (accessibility is one of the bigger reasons why).

Also, note that the ePub 2 file specifications can be found at http://idpf.org/epub/201. The ePub 3 specifications are at http://idpf.org/epub/30

OTHER TIPS

Well, I suppose you can try something like pulling out PREFACE and everything after it:

~.*\KPREFACE\n(.*)$~ms

This expression matches anything up to PREFACE and then forgets it. Then it matches PREFACE followed by a newline and anything after it all the way up to the end.

I get the feeling, though, that you may want the stuff before the table of contents as well. In that case, you can so something like this to grab the parts before and after the match:

~(.*)(?:CONTENTS\n.*?\n{3,})(.*)~ms

This would capture everything before the CONTENTS and store it into \1. Everything after it would be stored in \2.

In PHP, I'd use preg_replace to put the parts before and after the table of contents together.

<?php

$string = preg_replace('~(.*)(?:CONTENTS\n.*?\n{3,})(.*)~ms', '$1$2', $string);
print $string;

Here is a working demo

While I would personally not recommend a string-based approach whenever a DOM-based approach is possible, I don't see a DOM-based possible in this case.

I was able to achieve the desired result in 2 lines of JavaScript code which you can test within your browser console.

var dbody = document.body;
dbody.innerHTML = "<h2>" + dbody.innerHTML.substring(dbody.innerHTML.indexOf("PREFACE"));

This code should remove everything within the document body before the PREFACE.

how about using sed: sed '2,/PREFACE/d' fileName > newFile or if you want to keep the "Preface" intact sed '2,/PREFACE/{/PREFACE/n;d}' inputFile > new file or even better sed '/CONTENTS/,/PREFACE/{/PREFACE/n;d}' fileName > newFile

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top