How to use xapian which returns a URL when indexing a web page

Question

I'm not completely clear what your filterContents() and split_string() are actually trying to do (throwing away some HTML tag contents and then word splitting), so let me talk through a similar problem that doesn't have that complexity folded into it.

Let's assume we have a function strip_tags() which returns just the textual content of an HTML document, and your get_page() function. We want to build up a Xapian database where

each document represents the resource representation pulled from a particular URL
the "words" in that representation (having been passed through strip_tags()) become searchable terms that index those documents
each document contains as its document data the URL it was all pulled from.

So you could index as follows:

import xapian
def index_url(database, url):
    text = strip_tags(get_page(url))
    doc = xapian.Document()

    # TermGenerator will split text into words
    # and then (because we set a stemmer) stem them
    # into terms and add them to the document
    termgenerator = xapian.TermGenerator()
    termgenerator.set_stemmer(xapian.Stem("en"))
    termgenerator.set_document(doc)
    termgenerator.index_text(text)

    # We want to be able to get at the URL easily
    doc.set_data(url)
    # And we want to ensure each URL only ends up in
    # the database once. Note that if your URLs are long
    # then this won't work; consult the FAQ on unique IDs
    # for more: http://trac.xapian.org/wiki/FAQ/UniqueIds
    idterm = 'Q' + url
    doc.add_boolean_term(idterm)
    db.replace_document(idterm, doc)

# then index an example URL
db = xapian.WritableDatabase("exampledb", xapian.DB_CREATE_OR_OPEN)

index_url(db, "https://stackoverflow.com/")

Searching is then simple, although it can obviously get more sophisticated if needed:

qp = xapian.QueryParser()
qp.set_stemmer(xapian.Stem("en"))
qp.set_stemming_strategy(qp.STEM_SOME)
query = qp.parse_query('question')
query = qp.parse_query('question and answer')
enquire = xapian.Enquire(db)
enquire.set_query(query)
for match in enquire.get_mset(0, 10):
    print match.document.get_data()

which will display 'https://stackoverflow.com/', since 'question and answer' is on the homepage when you aren't logged in.

I'd recommend checking out the Xapian getting started guide both for concepts and code.