Why do .index files exist in the kafka-log directory?

Question 1

Every segment of a log (the files *.log) has it's corresponding index (the files *.index) with the same name as they represent the base offset.

For understanding, the log file contains the actual messages structured in a message format. For each message within this file, the first 64bits describe the incremented offset. Now, looking up this file for messages with a specific offset becomes expensive since log files may grow in the range of gigabytes. And to be able to produce messages, the broker actually has to do such kind of lookups to determine the latest offset and be able to further increment incoming messages correctly.

This is why there is an index file. First of all, the structure of the messages within the index file describes only 2 fields, each of them 32bit long:

4 Bytes: Relative Offset
4 Bytes: Physical Position

As described before, the file name represents the base offset. In contrast to the log file where the offset is incremented for each message, the messages within the index files contain a relative offsets to the base offset. The second field represents the physical position of the related log message (base offset + relative offset) and thus, a lookup of O(1) becomes possible.

After all there is to mention, that not every message within a log has it's corresponding message within the index. The configuration parameter index.interval.bytes, which is 4096 bytes by default, sets an index interval which basically describes how frequently (after how many bytes) an index entry will be added.

Regarding the question to size of the .index file there is the following to say: The configuration parameter segment.index.bytes, which is 10MB by default, describes the size of this file. This space is reallocated and will shrink only after log rolls.

Question 2

Every log file has a corresponding index file, the purpose of the index file is used to translates logical message offsets to physical positions in the data file. as seen here

EDIT:

From the doc

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log.

In Kafka the topic partitions cannot be split across multiple brokers. Now in situations where Kafka needs to delete some messages from the partitions after the retention period is over, it needs to scan through the partitions files. This operation will be very slow in case there exists a single large partition file. To avoid this Kafka splits the partitions into multiple segments.

New segment files created when the current one (called active segment) has reached its size limit (controlled by log.segment.bytes property). So for each segments there is a log file and an index file present. Now every segment starts with their base offset which is greater than the offset in previous segments.

The log file e.g. 00000000005120942793.log is where Kafka actually stores the messages along with all the details like offset (once a message is pushed into Kafka it is given an unique sequential number called Offset.), timestamp, compression, payload etc.

The index files e.g. 00000000005120942793.index map the actual message positions in the log. It generally consists of two parts each having 4 byte. The first part stores the message offset (relative to its base offset) and the later stores the position of the message. Index files are memory mapped and Kafka uses a binary search to locate the nearest offset less than or equal to the target offset.

Source:
http://kafka.apache.org/documentation.html#brokerconfigs http://supergsego.com/apache/kafka/0.8.2.0/scaladoc/kafka/log/OffsetIndex.html https://thehoard.blog/how-kafkas-storage-internals-work-3a29b02e026