Share on

Introduction

Foreign keys describe relationships, and the entity-relationship diagrams (ERDs) introduced in Correctness and Constraints map networks or graphs of those foreign keys. In these examples, there are only a few tables and relationships among them, but a visual layout is still a useful reference when it comes time to make sure every required relationship is accounted for. With larger databases, ERDs are invaluable, full stop. Many database clients have built-in tools to generate diagrams, although manual adjustment is usually required to make them readable.

The first steps toward a database schema design for tracking books and patrons in a library system.

Several ERD notations exist. The full "crow's foot" notation, one of the oldest and most influential, defines symbols for 0 (a ring), 1 (a dash), or many (the eponymous crow's foot, as above) records. Each line represents a relationship between two tables and has not just one but two of these symbols on each end, with each pair establishing the minimum and maximum for that side.

This attention to detail is at least in part an historical artifact from the days when running a database server on a workstation would be unheard of, and in the modern era few ERDs are that formally specified. As with the diagrams here, a symbol for "one at most" and a symbol for "zero to many" are enough to get the gist across, and there's rarely a need for a level between that and sharing the SQL scripts themselves anymore.

Cascading behaviors

Inserting an invalid author_id into books isn't the only way to violate a foreign key constraint: changes to authors can also invalidate existing data in books. Back in Correctness and Constraints, unenforced foreign keys led to a copy of Don Quixote with a spurious author_id. How to resolve the contradiction between Pierre Menard and Miguel Cervantes?

If the Menard record could be deleted from authors, the copy of the Quixote in question would no longer have a valid author_id. The database rejects this, as violations cannot be allowed either from the child or the parent table. To get rid of Pierre Menard, one must first dispose of Don Quixote, either by deleting it or by changing its author_id.

As the web of constrained relationships grows larger, cleaning up such dependent records becomes more and more complicated. Deleting an author entails deleting all their books; deleting a library requires the same, plus removing its patrons -- and any table with a foreign key to books or patrons must be deleted first of all, lest those foreign key constraints be violated in turn.

A DELETE against a parent table is often intended to prune an entire tree of relationships: a library and its books and its patrons, in one fell swoop (sometimes it isn't meant to do this, which makes knowing where your CASCADEs are very important!). Since foreign key constraints reify these relationships, making of them objects to be acted upon, they can also help automate responses to changes in the parent table. A constraint declaring ON DELETE SET NULL will void the first foreign key values only, without traversing the relationship graph any further. ON DELETE CASCADE ensures that a DELETE to authors will automatically delete those authors' books as well, and onward through any foreign key that declares books a parent table.

On occasion, natural primary key values may also change as standards and formats update or as the assumption that the natural key was immutable turns out to have been incorrect. Most RDBMSs support an ON UPDATE CASCADE behavior for this eventuality.

The future is here and everything needs to be destroyed

Even if real libraries or authors are never supposed to be deleted (only deactivated, or "soft-deleted"), both automated and manual tests often require a fresh, empty database up front or even for each individual test. Dropping and recreating the database breaks connections, requires elevated permissions, and is the slowest possible solution to boot.

The usual recourse is a "teardown" function or script which deletes anything previous tests might have inserted into the database, table by table. Without CASCADE directives, these deletions must be arranged carefully in a topological sort order around the relationship graph to avoid ever violating a foreign key constraint. With CASCADEs, teardown mostly takes care of itself once you delete records at the center of each of the database's various relationship graphs.

Key positioning

Both libraries and authors preexist any useful record of the books they lend and write respectively. These cases correspond to the "has-a" relationship type in object-oriented programming, which in database design requires the foreign key to be stored in the dependent table, books.

Other cases aren't so clear. Imagine that some books are themselves on loan to libraries from outside collections, and that their original provenances are tracked separately. Should all books have a provenance_id, or should the provenances table have a book_id column?

Expanding the libraries schema to begin tracking provenance for individual books.

Both solutions will serve the purpose of tracking provenance. However, in the case of books.provenance_id, there's no way to follow the link back to a book from its provenance -- one would have to search books for a matching provenance_id. And since most books have no special provenance, most values for provenance_id will be NULL.

In this situation, the provenances.book_id approach is clearly superior. The book_id link is traceable, columns are used efficiently, and provenances.book_id is even a primary key, since a single book shouldn't have come to a library from more than one place. De Haan and Koppelaars would call provenances a specialization of books, a table which adds supplementary information to records in its parent identified by the same primary key. The connection between books and provenances is a "one-to-one" relationship, since only one of any book_id value can exist in either table.

CREATE TABLE provenances (
book_id INT NOT NULL PRIMARY KEY,
collection TEXT NOT NULL
);

Properly speaking, provenance includes the entire chain of custody of an artifact, not just who had it last. If necessary for our purposes, this complicates the picture somewhat: with multiple records per book in provenances, book_id is no longer a primary key. A provenances table in which records are not (or not only) identified by the foreign key is no longer a specialization, but the "many" side of a "one-to-many" relationship -- or, viewed from the other direction, of a "many-to-one".

CREATE TABLE provenances (
book_id INT NOT NULL REFERENCES books (book_id),
-- a numeric index (most recent, second most recent, third,
-- and so on) is not strictly required, since the duration
-- could form part of the primary key. However, a range in
-- the primary key makes certain queries, like "who last
-- held most of our books?", more difficult to formulate.
custody_index INT NOT NULL DEFAULT 1,
collection TEXT NOT NULL,
duration DATERANGE NOT NULL,
PRIMARY KEY (book_id, custody_index),
-- custody of the same book shouldn't overlap; remember
-- that the btree_gist Postgres extension is required!
EXCLUDE USING GIST (
book_id WITH =,
duration WITH &&
)
);

Many-to-many relationships

Elsewhere in the example schema, patrons have a library_id value. This expresses a very important -- and likely a very wrong -- assumption: any person will patronize one library and one library only. If someone goes to another library, they'll have to enter all their information all over again. This violates another important assumption, namely, that a single record in patrons corresponds to a single person. Both can't be true.

There's a second problem with a similar solution: we aren't yet tracking who's checked a book out. A single patron can borrow many books, while a single book can be checked out many times. Structurally, this is nearly identical to the case of a single library having many patrons, who themselves may borrow from multiple libraries.

Adding junction tables to the model allows the relationships between patrons and books, and patrons and libraries, to be fully represented.

A "many-to-many" relationship has to be represented in a dedicated table, often called a junction or bridge table, among other names. The junction table maintains foreign keys to each table it mediates, making it the "many" side of its relationships to those tables. A primary key across each foreign key prevents duplicates of the same relationship.

library_patrons, a textbook example of a junction table, looks like this:

CREATE TABLE library_patrons (
library_id INT NOT NULL REFERENCES libraries (library_id),
patron_id INT NOT NULL REFERENCES patrons (patron_id),
PRIMARY KEY (library_id, patron_id)
);

checkouts, you might have noticed, doesn't follow the same naming convention as library_patrons -- it's not patron_books or vice versa. That's because it's more than a junction table. Like library_patrons, checkouts maintains foreign keys to the tables it connects in a many-to-many relationship, but it must also include information about each patron-book connection: the date checked out, the date due or returned, whether an extension has been granted. It's also perfectly possible for someone to check out the same book multiple times, so (patron_id, book_id) is not a viable primary key.

Building blocks

Many-to-many relationships are only one possible composition of the two primary relationship types. They're common enough that diagrams often omit library_patrons-style junction tables entirely in favor of representing both ends with a "many" symbol. But every network of connections among tables, no matter how complicated, is reducible into its constituent one-to-one and one-to-many relationships.

Boundaries

A single database may (and often does) contain multiple networks of foreign key relationships. The reverse, however, is usually not true. MySQL and MariaDB alone among the popular relational databases conflate schemas and databases, and therefore allow cross-database foreign keys as long as both databases are hosted on the same server. Others do not.

We'll come back to organizing tables in databases and in schemas within databases later, but it's useful to consider a multi-table relationship graph the indivisible unit of database layout in the same way that the combined attributes of a given concept form the basis for a table layout.

About the Author(s)
Dian Fay

Dian Fay

Dian didn't exactly plan to drop out of college to specialize in SQL and backend development, but that’s how it happened. Fifteen years later, she's designed databases supporting everything from industrial logistics and traceability systems to million-plus user social media games. She is the current maintainer of MassiveJS, an open source data mapper for Node.js focused on using PostgreSQL to the fullest.