there are multiple portions to your project.
- Bot that would crawl the data from the web and save the data in the db. (given you are considering to build your repository from web). Google Web Crawler/scraper for that.
- Data extractor/Cleanser that would clean the data and extract relevant information about a particular document. (this is important so that you could tag the information for relevant information)
- Then is the Search Engine part which enables you to retrieve relevant data from the repository. try vector similarity algorithm for that