Tools for discovering de facto foreign keys in databases? [closed]

https://stackoverflow.com/questions/7031203

23-12-2020
|

Question

A good way to quickly survey the information in a database is to apply a tool that automatically creates a database diagram of all tables and all relationships between them.

In my experience, such tools use foreign keys as the relationships, which most of the databases I try them do not contain. Sure, they satisfy constraints corresponding to foreign keys, but do not enforce them. And I'll end up with a 'diagram' consisting of a bunch of unrelated tables.

So what I'm looking for is software that can compute "undeclared foreign keys" and either

uses them as table relations in a database diagram, or
generates SQL code for corresponding foreign key declarations

Do you know any tools, free if possible, that can already do this?

Solution

The following products are all claiming to provide foreign-keys discovery abilities:

ERwin http://www.ascent.co.za/products/ca_erwin_data_profiler.html

Informatica https://community.informatica.com/onlinehelp/analyst/961/en/index.htm#page/data-discovery-guide/GUID-33EAF039-ECFC-49FD-96F4-A2C2A4EB857F.1.148.html

and XCaseForI http://xcasefori.com/discovering/index.html

Statistical methodologies able to provide a kind of similarity rank like range distribution and creation time as suggested by Kirk, seems to be the right way. .. I'd need to implement it using SAS EG or any free tool.

OTHER TIPS

Interesting question. You're looking to parse a database schema and data to determine which tables are relevant or should be related to each other, without any strict definition of the relationship. In effect, you're trying to infer a relationship.

I see two ways that you can infer such a relationship. First let me say that your approach might vary depending on the databases you're working with. A number of questions spring to mind (I don't want answers, but they are worth reflecting on)

are these in-house enterprise systems that follow some consistent naming convention or pattern?
or are they 'in-the-wild' databases that you come across anywhere, at any time?
what sort of assumptions are you prepared to make?
would you prefer to get more false positives or false negatives in your result?

Note that this type of inference will almost certainly give false results, and is built on a lot of assumptions.

So I offer two approachs that I'd use in concert.

Inferring a relationship through structure / naming (symbolic analysis)

Common database design is to name a PK column after the table name (e.g. CustomerId on table Customer), or alternatively name the PK column simply Id.

A table with a FK relationship to another often names its related column the same as the related table. In the Order table I'd expect a CustomerId column which refers to the CustomerId / Id column in the Customer table.

This type of analysis would include

inspecting columns across tables for similar phrases / words
looking for columns names that are similar to the names of other tables
checking for column names that contain the name of other column (e.g. FirstCustomerId & SecondCustomerId both refer to the CustomerId column in the Customer table)

Inferring a relationship through data (statistical analysis)

Looking at data, as you suggest you have done in your comments, will allow you to determine 'possible' references. If the CustomerId column in the Order table contains values which don't exist in the Id column of the Customer table then it's reasonable to question that this is a valid relationship (although you never know!)

A simple form of data analysis is using dates and times. Rows that were created with close proximity to one another are more likely to be related to one another. If, for every Order row that was created, there also exist between 2 and 5 Item rows created within a few seconds, then a relationship between the two is likely.

A more detailed analysis might look at the range and distribution of used values.

For example, if your Order table has a St_Id column - you might infer using symbolic analysis that the column is likely to relate to either a State table or a Status table. The St_Id column has 6 discrete values, and 90% of the records are covered by 2 values. The State table has 200 rows, and the Status table has 9 rows. You could quite reasonably infer that the St_Id column relates to the Status table - it gives a more greater coverage of the rows of the table (2/3 of the rows are 'used', whereas only 3% of the rows in the State table would be used).

If you perform data analysis on existing databases to gather 'real life data', I'd expect some patterns that could be used as guides to structure inference. When a table with a large number of records has a column with a small number of values repeated many times (not necessarily in order), it's more likely to this column relates to a table with a correspondingly small number of rows.

In summary

Best of luck. It's an interested problem, I've just thrown some ideas out there but this is very much a trial & error, data gathering and performance tuning situation.

This is a non-trivial exercise in most cases. If you are lucky enough to be analysing a schema for a modern framework, such as Ruby on Rails, or CakePHP or similar, and the developers have been stringent about following column conventions, then you have a reasonable chance of finding many, but not all, of the implied relationships.

I.e. if your tables use columns like user_id to refer to entries in the users tables.

Be aware: some entity names may pluralise irregularly (entity being a good example: entities, not entitys) and these are harder to catch (but still possible). However, keys such as admin_id which the developers join with the users table on user.id can't be inferred. You would need to handle those cases manually.

You didn't specify an RDBMS, but I used MySQL a lot, and I'm currently working on this problem for myself.

The following MySQL script will infer most relationships implied by column names. It then lists any relationships that it could not find table names for, so at least you know which ones you're missing. The inferred parent and child are listed, along with singular and plural names, plus the implied relationship:

-- this DB is where MySQL keeps schema information
use information_schema;

-- change this to the DB you want to analyse
set @db_name = "example_DB";

-- infer relationships 
-- NB: this won't catch names that pluralise irregularly like category -> categories or bus_id -> buses etc.
select  LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 )              as inferred_parent_singular
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")  as inferred_parent_plural 
,       C.TABLE_NAME                                                  as child_table
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME)-3), "s has many ", C.TABLE_NAME) as inferred_relationship
from    COLUMNS C
JOIN    TABLES T on C.TABLE_NAME = T.TABLE_NAME 
        and C.TABLE_SCHEMA = T.TABLE_SCHEMA 
        and T.TABLE_TYPE != "VIEW"  -- filter out views; comment this line if you want to include them
where   COLUMN_NAME like "%_id"     -- look for columns of the form <name>_id
and     C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name 
-- and     C.TABLE_NAME not like "wwp%"  -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- finally make sure to filter out any inferred names that aren't really tables
and     CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") -- this is the inferred_parent_plural, but can't use column aliases in the where clause sadly
          in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;

This will return results like this: enter image description here

Then you can examine any naming convention exceptions detected with:

-- Now list any inferred parents that weren't real tables to see see why (irregular plurals and columns not named according to convention)
select  LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ) as inferred_parent_singular
,       CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s") as inferred_parent_plural 
,       C.TABLE_NAME as child_table
from    COLUMNS C
JOIN    TABLES T  on  C.TABLE_NAME    = T.TABLE_NAME 
                  and C.TABLE_SCHEMA  = T.TABLE_SCHEMA 
                  and T.TABLE_TYPE   != "VIEW"            -- filter out views, comment this line if you want to include them
where   COLUMN_NAME like "%_id"
and     C.TABLE_SCHEMA = T.TABLE_SCHEMA and T.TABLE_SCHEMA = @db_name 
-- and     C.TABLE_NAME not like "wwp%"  -- uncomment and set a pattern to filter out any tables you DON'T want included, e.g. wordpress tables e.g. wordpress tables
-- this time only include inferred names that aren't real tables
and     CONCAT(LEFT(COLUMN_NAME, CHAR_LENGTH(COLUMN_NAME) - 3 ),"s")
          not in (select TABLE_NAME from TABLES where TABLE_SCHEMA = @db_name)
;

This will return results like this, which you can process manually: enter image description here

You can modify these scripts to spit out whatever is useful to you, include foreign key create statements, if you want to. Here, the final column is a simple 'has many' relationship statement. I use this in a tool I've built called called pidgin, which is rapid modelling tool that draws relationship diagrams on the fly based on relationship statements written a very simple syntax (called 'pidgin'). You can check it out at http://pidgin.gruffdavies.com

I've run the above script on a demo DB to show you the sort of results you can expect:

enter image description here

I haven't catered for the irregular plurals in my script, but I might have a go at that too, at least for the case of entities ending in -y. If you want to have a try at that yourself, I'd recommend writing a stored function that takes <name>_id column names as a parameter, strips the _id part and then applies some heuristics to attempt to pluralise correctly.

Hope that's useful!

I don't know about the softwares which may help in searching what you require, but The following query will help to get you started. It lists all Foreign Key Relationships within the current database.

SELECT
    K_Table = FK.TABLE_NAME,
    FK_Column = CU.COLUMN_NAME,
    PK_Table = PK.TABLE_NAME,
    PK_Column = PT.COLUMN_NAME,
    Constraint_Name = C.CONSTRAINT_NAME
FROM
    INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK
    ON C.CONSTRAINT_NAME = FK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK
    ON C.UNIQUE_CONSTRAINT_NAME = PK.CONSTRAINT_NAME
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU
    ON C.CONSTRAINT_NAME = CU.CONSTRAINT_NAME
INNER JOIN (
            SELECT
                i1.TABLE_NAME,
                i2.COLUMN_NAME
            FROM
                INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1
            INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2
                ON i1.CONSTRAINT_NAME = i2.CONSTRAINT_NAME
            WHERE
                i1.CONSTRAINT_TYPE = 'PRIMARY KEY'
           ) PT
    ON PT.TABLE_NAME = PK.TABLE_NAME

Hope this helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow