Question

Hi am feeding context to zend_lucene_search and it can search for the word up to special characters and after that it is not searchable.

for example:

    very well to the other job boards � one of the main things that has impressed is the variety of the applications, especially with regards to the background of the candidates" manoj � Head 

if I search for 'boards' I can get it but if I search for one or any string after the unreadable characters, I cannot search it.

How to remove these and I want to get plain text.

I got these kind of characters on converting .docx/pdf files to text.

OR

let me know how to feed only text to zend_search_lucene..

Please help.

Was it helpful?

Solution

You can use following preg_replace function call to remove all non-ASCII (so called special) characters from your string:

$replaced = preg_replace('/[^\x00-\x7F]+/', '', $str);
// produces this converted text:
//    "very well to the other job boards  one of the main things that has impressed
// is the variety of the applications, especially with regards to the background of the
// candidates" manoj  Head"

OTHER TIPS

You might need to convert the character set of the string being treated to match the character set of the current HTML document.

For example, if your HTML document is using UTF-8, then you could run your string through utf8_encode(). Otherwise if you're not sure which character set to use, try using mb_convert_encoding() and playing around with some of the more common charsets.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top