Question

I have a mid-size collection of records --about 20 million -- that I need to load into mySQL for use in data analysis. These happen to be records of people visiting places. They are uniquely identified by three data items:

  • place - a unique INT
  • person - a character string, sometimes numeric and sometimes alphanumeric , eg AB12345678
  • visit -similar to person

I don't have any control over the person and visit field contents, as these are provided by the different places, and each place does their own thing.

I can find all the records for a person by matching both place and person, and an individual record by matching all three.

I can make this work fine in mySql by creating a table like this:

CREATE TABLE ENCOUNTER (
  PLACE int(11) NOT NULL,
  PERSON varchar(255) NOT NULL,
  VISIT varchar(255) NOT NULL,
  ARRIVAL_TIME datetime DEFAULT NULL,
  DEPARTURE_TIME datetime DEFAULT NULL,
  EVENT varchar(255) NOT NULL,
  PRIMARY KEY (PLACE,PERSON,VISIT)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;

I went with MyISAM because I don't need ACID transactional integrity on this table; it's used for statistical reporting so if it's one or two rows stale it's no problem.

The table will often get hit with UPDATEs that simply change one of the fields, like DEPARTURE_TIME. These UPDATEs will most likely be about twice as frequent as new row INSERTs. There won't be a need to update place, person, or visit identifiers.

Here's some questions:

Would I be better off performance wise with a single index and key column concatenating the place/person/visit information?

How much of a hit do I take for varchar indexes? Is it worth trying to constrain them to a fixed-length field?

Any other suggestions from the collected wisdom?

Thanks.

Was it helpful?

Solution

your indexes are correct. you're not going to be able to do much better than that.

this is a perfect, non-obvious opportunity to use partitions. i have a feeling that all of your analysis is going to be based on place. if so, then create a hash partition based on the place column, like so:

ALTER TABLE encounter PARTITION BY KEY(place) PARTITIONS 12;

this will make your queries much faster since mysql knows it can skip looking at 1/12th of the rows when doing an analysis for one place.

OTHER TIPS

I can find all the records for a person by matching both place and person, and an individual record by matching all three.

If you are going to search for all places a person visited, you'll need to make an additional index on (person, place).

How much of a hit do I take for varchar indexes? Is it worth trying to constrain them to a fixed-length field?

A key hit takes the same time for INT and VARCHAR records.

A key miss is more expensive for VARCHAR fields.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top