Question

I have a database that is holding real estate MLS (Multiple Listing Service) data. Currently, I have a single table that holds all the listing attributes (price, address, sqft, etc.). There are several different property types (residential, commercial, rental, income, land, etc.) and each property type share a majority of the attributes, but there are a few that are unique to that property type.

My question is the shared attributes are in excess of 250 fields and this seems like too many fields to have in a single table. My thought is I could break them out into an EAV (Entity-Attribute-Value) format, but I've read many bad things about that and it would make running queries a real pain as any of the 250 fields could be searched on. If I were to go that route, I'd literally have to pull all the data out of the EAV table, grouped by listing id, merge it on the application side, then run my query against the in memory object collection. This also does not seem very efficient.

I am looking for some ideas or recommendations on which way to proceed. Perhaps the 250+ field table is the only way to proceed.

Just as a note, I'm using SQL Server 2012, .NET 4.5 w/ Entity Framework 5, C# and data is passed to asp.net web application via WCF service.

Thanks in advance.

Was it helpful?

Solution

Lets consider the pros and cons of the alternatives:

One table for all listings + attributes:

  1. Very wide table - hard to view to model & schema definitions and table data
  2. One query with no joins required to retreive all data on listing(s)
  3. Requires schema + model change for each new attribute.
  4. Efficient if you always load all the attributes and most items have values for most of the attributes.
  5. Example LINQ query according to attributes:
context.Listings.Where(l => l.PricePerMonthInUsd < 10e3 && l.SquareMeters >= 200)
    .ToList();


One table for all listings, one table for attribute types and one for (listing IDs + attribute IDS +) values (EAV):

  1. Listing table is narrow
  2. Efficient if data is very sparse (most attributes don't have values for most items)
  3. Requires fetching all data from values - one additional query (or one join, however, that would waste bandwidth - will fetch basic listing table data per attribute value row)
  4. Does not require schema + model changes for new attributes
  5. If you want type safe access to attributes via code, you'll need custom code generation based on attribute types table
  6. Example LINQ query according to attributes:
var listingIds = context.AttributeValues.Where(v =>
                    v.AttributeTypeId == PricePerMonthInUsdId && v < 10e3)
                .Select(v => v.ListingId)
                .Intersection(context.AttributeVales.Where(v =>
                    v.AttributeTypeId == SquareMetersId && v.Value >= 200)
                .Select(v => v.ListingId)).ToList();

or: (compare performance on actual DB)

var listingIds = context.AttributeValues.Where(v =>
                    v.AttributeTypeId == PricePerMonthInUsdId && v < 10e3)
                .Select(v => v.ListingId).ToList();

listingIds = context.AttributeVales.Where(v =>
                listingIds.Contains(v.LisingId)
                && v.AttributeTypeId == SquareMetersId
                && v.Value >= 200)
            .Select(v => v.ListingId).ToList();

and then:

var listings = context.Listings.Where(l => listingIds.Contains(l.ListingId)).ToList();


Compromise option - one table for all listings and one table per group of attributes including values (assuming you can divide attributes into groups):

  1. Multiple medium width tables
  2. Efficient if data is sparse per group (e.g. garden related attributes are all null for listings without gardens, so you don't add a row to the garden related table for them)
  3. Requires one query with multiple joins (bandwidth not wasted in join, since group tables are 1:0..1 with listing table, not 1:many)
  4. Requires schema + model changes for new attributes
  5. Makes viewing the schema/model simpler - if you can divide attributes to groups of 10, you'll have 25 tables with 11 columns instead of another 250 on the listing table
  6. LINQ query is somewhere between the above two examples.


Consider the pros and cons according to your specific statistics (regarding sparseness) and requirements/maintainability plan (e.g. How often are attribute types added/changed?) and decide.

OTHER TIPS

What I probably do:

I first create a table for the 250 fields, where I have the ID, and the FieldName, for example:

price   -> 1
address -> 2
sqft    -> 3

This table it will also hard coded on my code as enum and used on queries.

Then in the main table I have two fields together, one the type of the field ID get it from the above table, and the second the value of it, for example

Line1: 122(map id), 1 (for price), 100 (the actually price)
Line2: 122(map id), 2 (for address), "where is it" 
Line3: 122(map id), 3 (for sqft), 10 (sqft)

Here the issue is that you may need at least two fields, one for number and one for strings.

This is just a proposal of course.

I would create a listing table which contains only the shared attributes. This table would have listingId as the primary key. It would have a column that stores the listing type so you know if it's a residential listing, landing listing, etc.

Then, for each of the subtypes, create an extra table. So you would have tables for residential_listing, land_listing, etc. The primary key for all of these tables would also be listingId. This column is also a foreign key to listing.

When you wish to operate on the shared data, you can do this entirely from the listing table. When you are interested in specific data you will join in the specific table. Some queries may be able to run entirely on the specific table if all the data is there.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top