How to normalize a table with multiple cells that have multiple values?
-
16-03-2021 - |
Question
So I'm doing a task where I take a massive list (30,000+) of movies on Wikipedia that have multiple columns (such as the films name, the genre, the cast, the plot etc.) and upload it into Elasticsearch. However, after doing that I now want to make it so the table is in at least 1NF. I'm not really experienced in database design and the last time I did anything with Normal Form was a few years ago. So I'm looking at this table and thinking, how could I put this into 1NF. It's easy if for example there's only 1 column that has multiple values, but what do you do when there are multiple columns with multiple values as seen below.
Film Name | Director | Cast | Genre | Wiki Page | Plot |
---|---|---|---|---|---|
Chimmie Fadden Out West | Cecil B. DeMile | Victor Moore | Comedy, Western | https://en.wikipedia.org/wiki/Chimme_Fadden_Out_West | Chimmie is sent out west... |
20,000 Leagues Under the Sea | Stuart Paton | Lois Alexander, Curtis Benton, Wallace Clarke, Allen Holubar | Action, Adventure | https://en.wikipedia.org/wiki/20,000_Leagues_Under_the_Sea_(1916_film) | A strange... |
The Cat and the Canary | Paul Leni | Laura La Plante, Forrest Stanley, Creighton Hale | Comedy, Horror, Mystery | https://en.wikipedia.org/wiki/The_Cat_and_the_Canary_(1927_film)| In a... |
Would you just have to do something like this...
Film Name | Director | Cast | Genre | Wiki Page | Plot |
---|---|---|---|---|---|
Chimmie Fadden Out West | Cecil B. DeMile | Victor Moore | Comedy | https://en.wikipedia.org/wiki/Chimme_Fadden_Out_West | Chimmie is sent out west... |
Chimmie Fadden Out West | Cecil B. DeMile | Victor Moore | Western | https://en.wikipedia.org/wiki/Chimme_Fadden_Out_West | Chimmie is sent out west... |
20,000 Leagues Under the Sea | Stuart Paton | Lois Alexander | Action | https://en.wikipedia.org/wiki/20,000_Leagues_Under_the_Sea_(1916_film) | A strange... |
20,000 Leagues Under the Sea | Stuart Paton | Lois Alexander | Adventure | https://en.wikipedia.org/wiki/20,000_Leagues_Under_the_Sea_(1916_film) | A strange... |
20,000 Leagues Under the Sea | Stuart Paton | Curtis Benton | Action | https://en.wikipedia.org/wiki/20,000_Leagues_Under_the_Sea_(1916_film) | A strange... |
20,000 Leagues Under the Sea | Stuart Paton | Curtis Benton | Adventure | https://en.wikipedia.org/wiki/20,000_Leagues_Under_the_Sea_(1916_film) | A strange... |
20,000 Leagues Under the Sea | Stuart Paton | Wallace Clarke | Adventure | https://en.wikipedia.org/wiki/20,000_Leagues_Under_the_Sea_(1916_film) | A strange... |
20,000 Leagues Under the Sea | Stuart Paton | Wallace Clarke | Action | https://en.wikipedia.org/wiki/20,000_Leagues_Under_the_Sea_(1916_film) | A strange... |
etc? I'm surely missing something extremely simple when it comes to converting a table with multiple cells with multiple values into 1NF, but I'm not sure what.
Thanks.
Solution
So it's actually pretty easy to normalize when there's multiple fields, some with varying amounts of data points in a single field of the same row. Just follow this rule: Any column that has multiple data points within the column of the same row should become it's own table. So in your example that could be Cast
and Genre
. It's immediately apparent that those two columns represent a many-to-many relationship because of the very fact that there's multiple values stored in a single column of the same row.
As nbk mentions, you'll need a linking / bridge table to store that many-to-many relationship. So while your new Cast
table may have columns like CastId
(primary key), FirstName
, and LastName
, your linking table between Cast
and Film
would be named something like FilmCast
and have the field FilmId
(from your Film
table) with a foreign key reference, and it would also have the CastId
with a foreign key reference to the Cast
table. Then every row in that FilmCast
linking table would represent a specific single Cast
person for a single specific Film
.
You would repeat this same ideology for each other column in your Films
table with multiple data points per row. Once you have the appropriate tables for each normalized column then you have no need to store that data in the main Film
table anymore and could remove those columns from it.
OTHER TIPS
Normalisation is to remove information from tables, that are repe4ated by many times and ids as int are smaller than any text.
The bridge Tables you need, because you have a m:n relationship between film and users(cast, director, musician...)
Occupation is in my opionion a attributs of the relationship between film and user
Film (idfilm,Titel, plot,Wiki_Page, year,... )
Film2user (idfilm,iduser,idtype)
type (idtype,occupation)
user (iduser, Name, Birth,...)
genre (idgenre,name)
Film2genre(idfilm, idgenre)
as you develop further, you can add more attributes or tables if you find more such redundant information