Question

Imagine we have a set of entities each of which has its state: free, busy or broken. The state is specified for a day, for example, today on 2011-05-17 an entity E1 is free and tomorrow on 2011-05-18 it is busy.

There is a need to store ~10^5 entities for 1000 days. Which is the best way to do so?

I am thinking about 2 options:

  • represent each day as a character "0", "1" or "2" and store for every entity a string of 1000 characters
  • store each day with entity's state in a row, i.e. 1000 rows for an entity

The most important query for such data is: given start date and end date identify which entities are free.

Performance is of higher priority than storage.

All suggestions and comments are welcome.

Was it helpful?

Solution

Create a single table to hold your data. Create the table with an ID, Date, Entity name and eight boolean fields. SQL Server 2008 gave me the code below for the table:

CREATE TABLE [dbo].[EntityAvailability](
[EA_Id] [int] IDENTITY(1,1) NOT NULL,
[EA_Date] [date] NOT NULL,
[EA_Entity] [nchar](10) NOT NULL,
[EA_IsAvailable] [bit] NOT NULL,
[EA_IsUnAvailable] [bit] NOT NULL,
[EA_IsBroken] [bit] NOT NULL,
[EA_IsLost] [bit] NOT NULL,
[EA_IsSpare1] [bit] NOT NULL,
[EA_IsSpare2] [bit] NOT NULL,
[EA_IsSpare3] [bit] NOT NULL,
[EA_IsActive] [bit] NOT NULL,
 CONSTRAINT [IX_EntityAvailability_Id] UNIQUE NONCLUSTERED 
(
    [EA_Id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO

IF NOT EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[EntityAvailability]') AND name = N'IXC_EntityAvailability_Date')
CREATE CLUSTERED INDEX [IXC_EntityAvailability_Date] ON [dbo].[EntityAvailability] 
(
    [EA_Date] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

The clustered index on date will perform best for your range searches. Never allow searches without a date range and there will be no need for any index other than the clustered index. The boolean fields allows eight situations using only a single byte. The row size for this table is 35 bytes. 230 rows will fit on a page. You stated you had need to store 10^5 entities for 1000 days which is 100 million. One hundred million rows will occupy 434,782 8K pages or around 3 gig.

Install the table on an SSD and you are set to go.

OTHER TIPS

The best way is to try the simpler and more flexible option first (that is, store each day in its own row) and only devise a sophisticated alternative method if the performance is unsatisfactory. Avoid premature optimization.

10^8 rows isn't such a big deal for your average database on a commodity server nowadays. Put an index on the date, and I would bet that range queries ("given start date and end date...") will work just fine.

The reasons I claim that this is both simpler and more flexible than the idea of storing a string of 1000 characters are:

  • You'll have to process this in code, and that code would not be as straightforward to understand as code that queries DB records that contain date and status.
  • Depending on the database engine, 1000 character strings may be blobs that are stored outside of the record. That makes them less efficient.
  • What happens if you suddenly need 2,000 days instead of 1,000? Start updating all the rows and the code that processes them? That's much more work than just changing your query.
  • What happens when you're next asked to store some additional information per daily record, or need to change the granularity (move from days to hours for example)?

Depending on whether entities are more often free or not just store the dates when an entity is free or not.

Assuming you store the dates when the entity is not free then the search is where start date <= date and end_date >= date and any row matching that means that the entity is not free for that period

It sounds like you might be on the right track and I would suggest because of the sheer number of records and the emphasis on performance that you keep the schema as denormalized as possible. The fewer joins you need to make to determine the free or busy entities the better.

I would broadly go for a Kimball Star Schema (http://en.wikipedia.org/wiki/Star_schema) type structure with three tables (initially)

  • FactEntity (FK kStatus, kDate)
  • DimStatus (PK kStatus)
  • DimDate (PK kDate)

This can be loaded quite simply (Dims first followed by Fact(s)), and queried also very simply. Performance can be optimised by suitable indexing.

A big advantage of this design is that it is very extensible; if you want to increase the date range, or increase the number of valid states it is trivial to extend.

Other dimensions could be sensibly added e.g. DimEntity which could have richer information that gives categorical information that migth be interesting to slice/dice your Entities.

The DimDate is normally enriched by adding DayNo, MonthNo, YearNo, DayOfWeek, WeekendFlag, WeekdayFlag, PublicHolidayFlag. These allow some very interesting analyses to be performed.

As @Elad asks, what would ahppen if you added Time based information, then this also can be inforporated by a DimTime dimension having one record per hour or minute.

Apologies for my naming, as I don't have a good understanding of your data. Given more time I could come up with some better ones!

enter image description here

To get free entities on a date, you may try:

select
      e.EntityName
    , s.StateName
    , x.ValidFrom
from EntityState as x
join Entity      as e on e.EntityId = x.EntityId
join State       as s on s.StateID  = x.StateID
where StateName = 'free'
  and x.ValidFrom = ( select max(z.ValidFrom)
                      from EntityState as z
                      where z.EntityID   = x.EntityID
                        and z.ValidFrom <= your_date_here )
;

Note: Make sure you store only state changes in EntityState table.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top