Find indexOf a byte array within another byte array

Question 1

Java strings are composed of 16-bit chars, not of 8-bit bytes. A char can hold a byte, so you can always make your byte arrays into strings, and use indexOf: ASCII characters, control characters, and even zero characters will work fine.

Here is a demo:

byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));

This prints 7.

However, considering that your large array could be up to 10,000 bytes, and the small array is only ten bytes, this solution may not be the most efficient, for two reasons:

It requires copying your big array into an array that is twice as large (same capacity, but with char instead of byte). This triples your memory requirements.
String search algorithm of Java is not the fastest one available. You may get sufficiently faster if you implement one of the advanced algorithms, for example, the Knuth–Morris–Pratt one. This could potentially bring the execution speed down by a factor of up to ten (the length of the small string), and will require additional memory that is proportional to the length of the small string, not the big string.

Question 2

The simpelst way would be to compare each element:

public int indexOf(byte[] outerArray, byte[] smallerArray) {
    for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
        boolean found = true;
        for(int j = 0; j < smallerArray.length; ++j) {
           if (outerArray[i+j] != smallerArray[j]) {
               found = false;
               break;
           }
        }
        if (found) return i;
     }
   return -1;  
}

Some tests:

@Test
public void testIndexOf() {
  byte[] outer = {1, 2, 3, 4};
  assertEquals(0, indexOf(outer, new byte[]{1, 2}));
  assertEquals(1, indexOf(outer, new byte[]{2, 3}));
  assertEquals(2, indexOf(outer, new byte[]{3, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}

As you updated your question: Java Strings are UTF-16 Strings, they do not care about the extended ASCII set, so you could use string.indexOf()

Question 3

Google's Guava provides a Bytes.indexOf(byte[] array, byte[] target).

Question 4

Is this what you are looking for?

public class KPM {
    /**
     * Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
     * @param data
     * @param start First index in data
     * @param stop Last index in data so that stop-start = length
     * @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
     * @return
     */
    public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
        if( data == null || pattern == null) return -1;

        int[] failure = computeFailure(pattern);

        int j = 0;

        for( int i = start; i < stop; i++) {
            while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
                j = failure[j - 1];
            }
            if (pattern[j] == '*' || pattern[j] == data[i]) {
                j++;
            }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j>0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }

        return failure;
    }
}

Question 5

To save your time in testing:

http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence

gives you code that works if you make computeFailure() static:

public class KPM {
    /**
     * Search the data byte array for the first occurrence 
     * of the byte array pattern.
     */
    public static int indexOf(byte[] data, byte[] pattern) {
    int[] failure = computeFailure(pattern);

    int j = 0;

    for (int i = 0; i < data.length; i++) {
        while (j > 0 && pattern[j] != data[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == data[i]) { 
            j++; 
        }
        if (j == pattern.length) {
            return i - pattern.length + 1;
        }
    }
    return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
    int[] failure = new int[pattern.length];

    int j = 0;
    for (int i = 1; i < pattern.length; i++) {
        while (j>0 && pattern[j] != pattern[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == pattern[i]) {
            j++;
        }
        failure[i] = j;
    }

    return failure;
    }
}

Since it is always wise to test the code that you borrow, you may start with:

public class Test {
    public static void main(String[] args) {
        do_test1();
    }
    static void do_test1() {
      String[] ss = { "",
                    "\r\n\r\n",
                    "\n\n",
                    "\r\n\r\nthis is a test",
                    "this is a test\r\n\r\n",
                    "this is a test\r\n\r\nthis si a test",
                    "this is a test\r\n\r\nthis si a test\r\n\r\n",
                    "this is a test\n\r\nthis si a test",
                    "this is a test\r\nthis si a test\r\n\r\n",
                    "this is a test"
                };
      for (String s: ss) {
        System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
      }

    }
}

Question 6

Using the Knuth–Morris–Pratt algorithm is the most efficient way.

StreamSearcher.java is an implementation of it and is part of Twitter's elephant-bird project.

It is not recommended to include this library since it is rather sizable for using just a single class.

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher
{
    private byte[] pattern_;
    private int[] borders_;

    // An upper bound on pattern length for searching. Results are undefined for longer patterns.
    @SuppressWarnings("unused")
    public static final int MAX_PATTERN_LENGTH = 1024;

    StreamSearcher(byte[] pattern)
    {
        setPattern(pattern);
    }

    /**
     * Sets a new pattern for this StreamSearcher to use.
     *
     * @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
     */
    public void setPattern(byte[] pattern)
    {
        pattern_ = Arrays.copyOf(pattern, pattern.length);
        borders_ = new int[pattern_.length + 1];
        preProcess();
    }

    /**
     * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
     * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
     * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
     * another reasonable default, i.e. leave the stream unchanged.
     *
     * @return bytes consumed if found, -1 otherwise.
     */
    long search(InputStream stream) throws IOException
    {
        long bytesRead = 0;

        int b;
        int j = 0;

        while ((b = stream.read()) != -1)
        {
            bytesRead++;

            while (j >= 0 && (byte) b != pattern_[j])
            {
                j = borders_[j];
            }
            // Move to the next character in the pattern.
            ++j;

            // If we've matched up to the full pattern length, we found it.  Return,
            // which will automatically save our position in the InputStream at the point immediately
            // following the pattern match.
            if (j == pattern_.length)
            {
                return bytesRead;
            }
        }

        // No dice, Note that the stream is now completely consumed.
        return -1;
    }

    /**
     * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
     * and aids in implementation of the Knuth-Moore-Pratt string search.
     * <p>
     * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
     */
    private void preProcess()
    {
        int i = 0;
        int j = -1;
        borders_[i] = j;
        while (i < pattern_.length)
        {
            while (j >= 0 && pattern_[i] != pattern_[j])
            {
                j = borders_[j];
            }
            borders_[++i] = ++j;
        }
    }
}

Question 7

Copied almost identical from java.lang.String.

indexOf(char[],int,int,char[]int,int,int)

static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
    if (fromIndex >= sourceCount) {
        return (targetCount == 0 ? sourceCount : -1);
    }
    if (fromIndex < 0) {
        fromIndex = 0;
    }
    if (targetCount == 0) {
        return fromIndex;
    }

    byte first = target[targetOffset];
    int max = sourceOffset + (sourceCount - targetCount);

    for (int i = sourceOffset + fromIndex; i <= max; i++) {
        /* Look for first character. */
        if (source[i] != first) {
            while (++i <= max && source[i] != first)
                ;
        }

        /* Found first character, now look at the rest of v2 */
        if (i <= max) {
            int j = i + 1;
            int end = j + targetCount - 1;
            for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
                ;

            if (j == end) {
                /* Found whole string. */
                return i - sourceOffset;
            }
        }
    }
    return -1;
}

Question 8

package org.example;

import java.util.List;

import org.riversun.finbin.BinarySearcher;

public class Sample2 {

    public static void main(String[] args) throws Exception {

        BinarySearcher bs = new BinarySearcher();

        // UTF-8 without BOM
        byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8");

        byte[] searchBytes = "world".getBytes("utf-8");

        List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);

        System.out.println("indexList=" + indexList);
    }
 }

so it results in

indexList=[6, 25]

So,u can find the index of byte[] in byte[]

Example here on Github at: https://github.com/riversun/finbin

Question 9

Several (or all?) of the examples posted here failed some Unit tests so I am posting my version along with the aforementioned tests over here. All of the Unit tests are BASED upon the requirement that Java's String.indexOf() always gives us the right answer!

// The Knuth, Morris, and Pratt string searching algorithm remembers information about
// the past matched characters instead of matching a character with a different pattern
// character over and over again. It can search for a pattern in O(n) time as it never
// re-compares a text symbol that has matched a pattern symbol. But, it does use a partial
// match table to analyze the pattern structure. Construction of a partial match table
// takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n).

public class KMPSearch {

    public static int indexOf(byte[] haystack, byte[] needle)
    {
        // needle is null or empty
        if (needle == null || needle.length == 0)
            return 0;

        // haystack is null, or haystack's length is less than that of needle
        if (haystack == null || needle.length > haystack.length)
            return -1;

        // pre construct failure array for needle pattern
        int[] failure = new int[needle.length];
        int n = needle.length;
        failure[0] = -1;
        for (int j = 1; j < n; j++)
        {
            int i = failure[j - 1];
            while ((needle[j] != needle[i + 1]) && i >= 0)
                i = failure[i];
            if (needle[j] == needle[i + 1])
                failure[j] = i + 1;
            else
                failure[j] = -1;
        }

        // find match
        int i = 0, j = 0;
        int haystackLen = haystack.length;
        int needleLen = needle.length;
        while (i < haystackLen && j < needleLen)
        {
            if (haystack[i] == needle[j])
            {
                i++;
                j++;
            }
            else if (j == 0)
                i++;
            else
                j = failure[j - 1] + 1;
        }
        return ((j == needleLen) ? (i - needleLen) : -1);
    }
}



import java.util.Random;

class KMPSearchTest {
    private static Random random = new Random();
    private static String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

    @Test
    public void testEmpty() {
        test("", "");
        test("", "ab");
    }

    @Test
    public void testOneChar() {
        test("a", "a");
        test("a", "b");
    }

    @Test
    public void testRepeat() {
        test("aaa", "aaaaa");
        test("aaa", "abaaba");
        test("abab", "abacababc");
        test("abab", "babacaba");
    }

    @Test
    public void testPartialRepeat() {
        test("aaacaaaaac", "aaacacaacaaacaaaacaaaaac");
        test("ababcababdabababcababdaba", "ababcababdabababcababdaba");
    }

    @Test
    public void testRandomly() {
        for (int i = 0; i < 1000; i++) {
            String pattern = randomPattern();
            for (int j = 0; j < 100; j++)
                test(pattern, randomText(pattern));
        }
    }

    /* Helper functions */
    private static String randomPattern() {
        StringBuilder sb = new StringBuilder();
        int steps = random.nextInt(10) + 1;
        for (int i = 0; i < steps; i++) {
            if (sb.length() == 0 || random.nextBoolean()) {  // Add literal
                int len = random.nextInt(5) + 1;
                for (int j = 0; j < len; j++)
                    sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
            } else {  // Repeat prefix
                int len = random.nextInt(sb.length()) + 1;
                int reps = random.nextInt(3) + 1;
                if (sb.length() + len * reps > 1000)
                    break;
                for (int j = 0; j < reps; j++)
                    sb.append(sb.substring(0, len));
            }
        }
        return sb.toString();
    }

    private static String randomText(String pattern) {
        StringBuilder sb = new StringBuilder();
        int steps = random.nextInt(100);
        for (int i = 0; i < steps && sb.length() < 10000; i++) {
            if (random.nextDouble() < 0.7) {  // Add prefix of pattern
                int len = random.nextInt(pattern.length()) + 1;
                sb.append(pattern.substring(0, len));
            } else {  // Add literal
                int len = random.nextInt(30) + 1;
                for (int j = 0; j < len; j++)
                    sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
            }
        }
        return sb.toString();
    }

    private static void test(String pattern, String text) {
        try {
            assertEquals(text.indexOf(pattern), KMPSearch.indexOf(text.getBytes(), pattern.getBytes()));
        } catch (AssertionError e) {
            System.out.println("FAILED -> Unable to find '" + pattern + "' in '" + text + "'");
        }
    }
}

Question 10

For a little HTTP server I am currently working on, I came up with the following code to find boundaries in a multipart/form-data request. Hoped to find a better solution here, but i guess I'll stick with it. I think it is as efficent as it can get (quite fast and uses not much ram). It uses the input bytes as ring buffer, reads the next byte as soon as it does not match the boundary and writes the data after the first full cycle into the output stream. Of course can it be changed for byte arrays instead of streams, as asked in the question.

    private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
    {
        try
        {
            String n = "--"+boundary;
            byte[] bc = n.getBytes("UTF-8");
            int s = bc.length;
            byte[] b = new byte[s];
            int p = 0;
            long l = 0;
            int c;
            boolean r;
            while ((c = is.read()) != -1)
            {
                b[p] = (byte) c;
                l += 1;
                p = (int) (l % s);
                if (l>p)
                {
                    r = true;
                    for (int i = 0; i < s; i++)
                    {
                        if (b[(p + i) % s] != bc[i])
                        {
                            r = false;
                            break;
                        }
                    }
                    if (r)
                        break;
                    os.write(b[p]);
                }
            }
            os.flush();
            return true;
        } catch(IOException e) {e.printStackTrace();}
        return false;
    }