Question

I had word files online about 5000 files , I need to search in all files about any keywords for example : "Human Resource " .

So I created function to read word files , but my problem I guess processing tasks will kill memory of server
Example Code :

<?php 
function doc_to_text($input_file){ //for doc files 
    $file_handle = @fopen($input_file, "r"); //open the file
    $stream_text = @fread($file_handle, filesize($input_file));
    $stream_line = explode(chr(0x0D),$stream_text);
    $output_text = "";
    foreach($stream_line as $single_line){
        $line_pos = strpos($single_line, chr(0x00));
        if(($line_pos !== FALSE) || (strlen($single_line)==0)){
            $output_text .= "";
        }else{
            $output_text .= $single_line." ";
        }
    }
    $output_text = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $output_text);
    return $output_text;
}


function docx_to_text($input_file){ //for docx files
    $xml_filename = "word/document.xml"; //content file name
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text = strip_tags($xml_handle->saveXML());
        }else{
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}





?>

Then I will create loop and check every file for keyword by stristr() function and if stristr() return true then script will print file name .

Do we have another solutions ?

Reference: stristr()

Was it helpful?

Solution

You need to create a structure called inverse index, which maps each word (or may be if you want even phrases to documents). Wiki page nicely documents the process and it is really straight forward.

Than you can store this structure in your database (this will be done only once in a preprocessing step) and later might be changed when you add new Doc, or Docx files.

When a user inserts his words, you search not in the files, but in your database, which will be fast and will leverage indexes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top