ECommerce Storefront Website: Discovering Similar Products Programmatically

Question 1

You ask a lot of questions. I will try and address each one without going into too much detail.

CTEs vs. derived tables is syntactical sugar. It makes no difference performance-wise. The only benefit to using them is that you can reuse them instead of copy/paste/typing-out a derived table again. However, you aren't reusing them in this case, so it's up to you.
Indexed views: Keep in mind, indexes on the views act like indexes on the table(s) with little exception. Picture it like another table is being created for your specific query/view and being stored on disk for faster retrieval. When the underlying data changes, these indexes have to be updated. Yes, this can create a huge resource impact. In general, I'd rather see someone write a query that uses the indexes on the base table, and if they need more indexes for a specific purpose, then look at that in detail rather than holistically on a view with multiple tables. This is far easier to maintain and a lot easier to figure out why your CRUD is taking longer than expected. There is nothing necessarily wrong with an indexed view. But, be very careful with adding this on a application database model like this because of the complexity to tables that are updated/inserted/deleted from. Most of the more appropriate uses for an indexed view is in a reporting data warehouse. Regardless, don't put an index on a view without understand what it will do to the tables for CRUD (create, read, update, delete) operations. And in a CRM or application support type of database, I'd stay away from them for the most part unless there is a static need and it's not really impacting performance.

Read this article: http://technet.microsoft.com/en-us/library/ms187864(v=sql.105).aspx

Note about 3/4 of the way down the page it talks about where NOT to use one and I think your case fits in to 4 / 5 of the scenarios where you should NOT use it.

Regarding saving joins... keep in mind FULL OUTER joins are one of the worst offenders for efficiency. It seems to me the only reason you have it there, is because you aren't including manufacturer in your CTEs. You could just include it in your CTEs and then aggregate the number of matches by cat/tag in the final query which pulls it together to get your score. This way you only have two left outer joins (one to each CTE) and then sum the two counts together and group by same manufacturer (case statement), productId etc.
Finaly... I would consider placing all of this in a de-normalized table or maybe even a cube where it's precalculated. Let's consider several things about your requirement: a. Does the correlation score for products need to be live? If yes, why? This isn't mission critical when new products are added / removed. Anyone who says it needs to be live, probably doesn't really mean it. b. Speed of retrieval. I could rewrite your query using temp tables, make sure the indexes are correct etc. and come up with a reasonably faster query in a stored procedure. But, I am still aggregating data from the DB to be displayed all over my store front every time the page loads. If the data is precalculated and stored in a separate table of productIds and scores for every product and indexed by productId, the retrieval would be very fast. You could trunk and reload the table in an ETL nightly, hourly / whatever and wouldn't have to worry about maintaining the indexes that are rebuilt every time. Of course, if your store front is 24/7/365, you'll need to write some database side code to worry about versioning so that your application never has to wait if the db is in the middle of recalculating.

Also, make sure you at least cache this information on the web/application server if nothing else. One thing is for sure, if you go with your solution above, then you'll need to build something into your site so it doesn't wait for data to return and caches it instead.

Hope all that helps.

Question 2

How about a somewhat different approach?

;WITH ProductFindings (ProductId, NbrTags, NbrCategories)
AS
(
    SELECT  t2.ProductId, COUNT(t2.TagId), 0
    FROM    ProductTagMappings AS t1 
    INNER JOIN
            ProductTagMappings AS t2  ON t1.TagId      = t2.TagId
                                     AND t1.ProductId != t2.ProductId
    WHERE   t1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
    GROUP BY t2.ProductId

  UNION ALL

    SELECT  c2.ProductId, 0, COUNT(c2.CategoryId)
    FROM    ProductCategoryMappings AS c1 
    INNER JOIN
            ProductCategoryMappings AS c2  ON c1.CategoryId = c2.CategoryId
                                          AND c1.ProductId != c2.ProductId
    WHERE c1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
    GROUP BY c2.ProductId

), ProductTally (ProductId, TotTags, TotCategories) as
(
   SELECT ProductID, sum(NbrTags), sum(NbrCategories)
   FROM     ProductFindings
   GROUP BY ProductID
)
SELECT  Tot.ProductId      AS ProductId
       ,Tot.TotTags        AS TagsInCommon
       ,Tot.TotCategories  AS CategoriesInCommon
       ,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId
               THEN 1
               ELSE 0 
         END               as SameManufacturer
       ,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId 
               THEN 1
               ELSE 0
         END + Tot.TotTags + Tot.TotCategories
                           as SimilarityScore
FROM    ProductTally as Tot
INNER JOIN  Products     AS SimilarProduct   ON Tot.ProductID = SimilarProduct.Id 
INNER JOIN  Products     AS SourceProduct    ON SourceProduct.Id = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'