store a high amount of HTML files

Question 1

Storing individual files is quite easy - wget -r http://www.example.com will store a local copy of example.com's entire (crawlable) content.

Of course, beware of generated pages, where the content is different depending on when (or from where) you access the page.

Another thing to consider is that maybe you don't really want to store all the pages yourself, but just forward to the site that actually contains the pages - that way, you only need to store a reference to what page contains what words, not the entire page. Since a lot of pages will have much repeated content, you only really need to store the unique words in your database and a list of pages that contain that word (if you also filter out words that occur on nearly every page, such as "if", "and", "it", "to", "do", etc, you can reduce the amount of data that you need to store. Do a count of the number of each word on each page, and then see compare different pages, to find the pages that are meaningless to search.

Question 2

Well, if the program is to be constantly running during operation, you could just store the pages in RAM - grab a gigabyte of RAM and you'd be able to store a great many pages. This would be much faster than caching them to the hard disk.

I gather from the question that the user is on a different machine from the search engine, and therefore cache. Perhaps I am overlooking something obvious here, but couldn't you just sent them the HTML over the connection already established between the user and the search engine? Text is very light data-wise, after all, so it shouldn't be too much of a strain on the connection.