Introduction to databases

Glossary of common database terminology

Share on


When dealing with databases, there is a lot of terminology that you must learn in order to understand the technology, how best to use it, and how it relates to other parts of your environment. This glossary aims to collect common terminology used in the database community and provide definitions and context to help you grow your knowledge.

This glossary is a work in progress and a living document. We intend to update it to add new topics and refine the existing entries as time goes on. We have a backlog of terms we hope to add in the near future, but if you have anything you'd like us to talk about, please open a GitHub issue to add your suggestions.


1NF, or first normal form, describes a type of database normalization where each table column only has a single value. A column that has a nested table as a value or multiple values is not in 1NF.

2NF, or second normal form, describes a type of database normalization that: 1) satisfies the requirements of 1NF, 2) has no values that are tied directly to a subset of a candidate key. In other words, a relation is in 2NF if it is in 1NF and all of the non-candidate values are dependent on the composite key in whole, not just a portion of the candidate key. For example, a `book` table that has a candidate key composed of `title` and `author` cannot be in 2NF if it also includes a `dob` field describing the author's date of birth. That column value is dependent only on the value of `author` and could lead to inconsistencies if the values get out of sync.

3NF, or third normal form, describes a type of database normalization that: 1) satisfies the requirements of 2NF, 2) each non-key attribute is not transitively dependent on a key attribute. For example, if a `user` table has a `user_id` column as a primary key, a `user_city` column, and a `user_state` column, it would not be in 3NF because `user_state` is transitively dependent on `user_id` through `user_city` (the city and state should be extracted to their own table and referenced together).

4NF, or fourth normal form, describes a type of database normalization that: 1) satisfies the requirements of BCNF, 2) for every non-trivial multivalued dependency, the determining attribute in the dependency is either a candidate key or a superset of it. In other words, if a field has multiple dependent fields that are independent from one another, it can lead to redundancies that violate 4NF rules.

ACID — an acronym created from the words atomicity, consistency, isolation, and durability — describes a set of characteristics that database transactions are meant to provide. Atomicity guarantees that all operations in a transaction will complete successfully or will be rolled back. Consistency, often considered a property maintained by the application rather than the database, is often achieved through transactions to make sure that all related values are updated at once. Transaction isolation aims to allow simultaneous transactions to execute independently. Durability means that transactions are meant to be stored on non-volatile storage when committed.

Access control list (ACL)
An access control list, often shorted to ACL, is a security policy list that dictates which actions each user or process can perform on which resources. There are many different types of ACLs, but they each describe the permissions and access patterns that are allowed by a system.

Active record ORM
An active record ORM is an object-relational mapper that functions by trying to represent each table in a database as a class in the application. Each record in the table is represented as an instance of the class. Database entries are added and managed by interacting with these representations in the application.

Anti-caching is a strategy that can be used when data is not found in the faster in-memory cache and must be retrieved from slower, persistent storage. The technique involves aborting the transaction and kicking off an asynchronous operation to fetch the data from the slower medium to memory. The transaction can be retried later and the information will be ready to served from memory.

Atomicity is a quality mainly associated with database transactions that means that the operations encapsulated in the transaction are handled in an all-or-nothing fashion. This prevents partial updates from occurring where some operations were performed before an error condition arose, leading to inconsistent data. In the case of transactions, either all of the operations are committed or every operation is rolled back to leave the database in the same state that it was in when the transaction began.

Attributes are characteristics that describe a certain entity in a database. In the ER (entity-relationship) model, attributes are any additional properties that are not relationships that add information about an entity.

Authentication is an action that validates an identity. In computing and databases, authentication is mainly used as a way to prove that the person or process requesting access has the credentials to validate that they can operate with a specific identity. In practical terms, this might include providing an identity (like a username) and associated authentication material (such as a password, a certificate or key file, or a secret generated by a hardware device belonging to the person associated with the identity). Authentication is used in conjunction with authorization to determine if a user has permission to perform actions on a system.

Authorization is an action that determines if a certain user or process should be allowed to perform a certain action. Authorization involves checking the requested action against a set of guidelines that describe who should be allowed perform what actions. Authorization usually relies on a trusted authentication process to take place before the request in order to confirm the subject's identity.

Availability is a property that describes the degree to which a system is running and capable of performing work. In terms of computing systems like databases, for a single machine, availability is synonymous with the uptime of the application on that computer. For distributed systems, availability is subject to rules that dictate in what capacity the system is allowed to continue functioning if a subset of the system is unavailable.

BASE — an acronym created from the words Basically Available, Soft-state, and Eventually consistent — describes a set of characteristics of some NoSQL databases. It is offered as an description for certain databases that do not conform to the properties described by ACID-compliance (atomicity, consistency, isolation, and durability). BASE databases choose to remain available at the expense of strict data consistency in cases of network partitions. The soft-state component refers to the fact that the state of the system can be in flux as the different members negotiate the most correct values in the system. Eventually consistent is another related statement indicating that the system will eventually achieve consistency given enough time and assuming new inconsistencies aren't introduced during that time.

BCNF, or Boyce-Codd normal form, describes a type of database normalization that: 1) satisfies the requirements of 3NF, 2) where the determining attribute in each dependency (the attribute that dictates another attribute's value) is either a _superset_ of the dependent attribute, is a candidate key, or is a superset of a candidate key.

Blue-green deployments
Blue-green deployments are a technique for deploying software updates with little to no downtime by managing active traffic between two identical sets of infrastructure. New releases can be deployed to the inactive infrastructure group and tested independently. To release the new version, a traffic routing mechanism is switched to direct traffic from the current infrastructure to the infrastructure with the new version. The previously-active infrastructure now functions as the target for the next updates. This strategy is helpful in that the routing mechanism can easily switch back and forth to roll backwards or forwards depending on the success of a deployment.

In computing, a bottleneck occurs when the performance or capacity of a system is limited by contention around a single component. In databases, this can be related to the hardware that the database runs on or the network environment that is available. Application usage patterns can also affect which resource is most under contention. To solve bottlenecks, you must first identify the resource limiting your system's performance and then either add additional capacity or take measures to reduce the rate of usage.

CAP theorem
CAP theorem is a statement about distributed databases that states that any system can only provide at most two out of the following three qualities: consistency, availability, and partition tolerance. Generally, it is agreed that partition tolerance must be a feature of any distributed system (as the only way to avoid all network partitions is to have a non-distributed system). Therefore, each distributed system must make a decision as to whether they want to prioritize data consistency (by not accepting new changes in the case of a partition) or system availability (by sacrificing some consistency for the sake of still being able to introduce new changes during the partition).

CRUD — an acronym standing for Create, Read, Update, and Delete — describes the basic operations that one uses to operate on stored data. In SQL, the components of CRUD broadly correspond to the operations `INSERT`, `SELECT`, `UPDATE`, and `DELETE`, but many other operations facilitate more granular actions. More generally, CRUD is also often discussed in the context of user interfaces and APIs as a description of the types of actions that a system may permit.

A cache is a component of a system designed to allow faster retrieval for high value or frequently requested pieces of data. In general, caches function by storing a useful fraction of data on media that is either higher performance or closer to the client than the general use persistent media focused on long term, non-volatile storage. In general, caches tend to be higher performance but tend to have more limited capacity and be more expensive.

Cache invalidation
Cache invalidation is the process of targeting and removing specific items from a cache. Most often, this is performed as part of a routine when updating records so that the data in the cache does not serve stale data to clients.

Canary releases
A canary release describes a release strategy where new versions of software are deployed to a small subset of servers to test new changes in an environment with limited impact. The deployment and resulting behavior of the test group are observed and the team can then decide if they want to roll back the changes or continue to deploy the changes to a wider range of hosts. Canary releases are a way of testing in production while limiting the number of clients impacted by any problems.

Candidate key
A candidate key in a relational database is the term for a minimal superkey. In other words, a candidate key is any column or combination of columns that can be used to uniquely identify each record in a relation without including columns that do not help in specificity. In a `cars` table, a unique `car_id` column would be a candidate key as well as a combination of the `make`, `model`, and `year` columns (assuming that's specific enough to eliminate any duplicates). However, `car_id` and `make` would not be a candidate key since in this instance, `make` does nothing to narrow down the uniqueness of each row.

In relational databases, cascade is an option for how to handle deletes or updates for records that have related entries in other tables. Cascade means that the operation (delete or update) should be applied to the child, dependent rows as well. This helps you avoid orphaned rows in the case of deletes and out of sync values in the case of updates.

Apache Cassandra is a distributed, wide-column NoSQL database focused on operating on and managing large volumes of data. Cassandra scales incredibly well and each node in the cluster can accept reads or writes. Data is stored in rows that are uniquely identifiable and partitioned based on partition key. Each partition key returns a row of data with both column names and values defined internally, meaning each row in the same column family may contain different columns.

Check constraint
A check constraint is perhaps the most flexible table or column constraint that can be added to a relational database. It is defined as a boolean condition that must be met for the proposed data to be accepted by the system. Because of the nature of the condition is fairly open-ended, check constraints can be used to model many different types of requirements to ensure that the data coming into the system conforms to expectations.

In computing, a cluster is a group of computers all dedicated to helping with a shared task. Database clusters are used to increase the capacity, availability, and performance of certain types of actions compared to database deployed on a single computer. There are many different topologies, technologies, and trade-offs that different clustered systems employ to achieve different levels of performance or fault tolerance. Because of the diversity of different implementations, it can be difficult to generalize specific characteristics that apply to all clustered database systems.

Collation in databases refers to the ordering and comparison characteristics of different character systems. Most databases allow you to assign collation settings, which impact how text in the system are sorted, displayed, and compared against one another. Collation is often defined using a set of labels that describe the character set, language context, and different options about sensitivity or insensitivity to capitalization, accents, and other character modifiers.

In document databases, collections are containers that are used to store groups of documents together. The collections may have semantic meaning assigned by the application and database designers, but otherwise are simply a way to partition different sets of documents from one another in the system. Different collections can be assigned different properties and actions can be performed targeting specific collections of documents.

Columns are a component of table-oriented databases that label and potentially define the type of each value stored in that column. In traditional relational databases, the properties of a series of columns are one of the primary ways of defining the properties of the table in general. Each row added to the table must provide values that conform to the requirements associated with the table's columns. In non-relational databases, columns can have many different properties. Generally, however, they are used to label and define the characteristics for values that records choose to store in that column.

Column database
A column database or column-oriented database is a table-oriented database similar to a traditional relational database that stores data in the background by column instead of by record. This means that the data associated with a single column are stored together rather than grouping all of the data associated with a single record. This can provide different performance characteristics depending on usage patterns, but generally doesn't affect how the user interacts with the data in the table on a daily basis. Although often confused in the literature, column databases are not to be confused with wide column databases or column family databases.

Column family
A column family is a database object that stores groups of key-value pairs where each key is a row identifier and each value is a group of column names and values. All together, a column family constructs something that is akin to a table in relational databases. However, each row can define its own columns, meaning that rows are of varying lengths and do not have to match each other in the columns represented or the data types stored.

Command query responsibility segregation
Command query responsibility segregation is a application design pattern that allows you to separate operations based on their impact on the underlying database. In general, this usually means providing different mechanisms for queries that read data versus queries that change data. Separating these two contexts allows you to make infrastructure and system changes to scale each use-case independently, increasing performance.

In the context of databases, committing data is the process whereby you execute and durably store a set of proposed actions. Many databases are configured to automatically commit each statement as it is received by the system, but transactions, for example, are one mechanism through which you can control the commit behavior of the database by grouping multiple statements together and committing them as a group. Committing in database is the action that is actually responsible for performing a permanent action on the system.

Composite key
In relational databases, a composite key is a key composed of two or more columns that can be used to uniquely identify any record in a table. For example, if we have a `shirts` table that only stores a single record for each combination of size and color could have a composite key defined by a combination of the `color` and `size` columns.

Concurrency is the ability of a system to work on multiple tasks at once without affecting the overall result. Concurrency allows systems to execute operations in parallel, increasing the relative performance of the group of tasks.

Consistency is a property of data systems that means that the individual data entities do not conflict and continue to model the information they intend to even as changes are introduced. Each piece of data and change must be validated to ensure that it conforms to the rules imposed on the data structures and care must be taken to balance out any changes that should impact other data (like debiting and crediting different accounts at the same time).

A constraint is a limitation imposed on a specific column or table that impacts the range of values accepted by the system. Constraints are used to define rules that the database system can enforce to ensure that values conform to requirements.

A database cursor is a way for clients to iterate over records and query results in a controlled, precise manner. Cursors are primarily used to page through results that match a query one-by-one by iteratively returning the next row for processing. This can help you operate on an unknown number of records by accessing the results as a queue. Care must be taken when using cursors as they take up resources on the database system, can result in locking, and often result in many more network round trips than would be required otherwise.

In the broadest sense, data are facts or pieces of information. They are measurements or values that contain information about something. In some contexts, data is defined as distinct from information in that information is analyzed or processed data while data consists only of raw values. Practically speaking, however, these terms are often used as synonyms and typically encapsulate any fact along with the relevant context necessary to interpret or contextualize it. Data is an essential component of almost all communication and activity and it can gain meaning and value as it is collected, analyzed, and contextualized.

Data definition language (DDL)
A data definition language, or DDL, is a set of commands or actions that are used to define database structures and objects. They are a key component to relational and other databases and are expressed as a subset of the available commands available to manage data in languages like SQL. Data definition language is the portion of the language dedicated to describing, creating, and modifying structures and the frameworks that will hold data.

Data independence
Data independence is a term used to describe the separation of database clients or applications from the underlying structure responsible for representing and storing the data. Data independence is achieved if the database is able to abstract the structure in a way that allows user applications to continue running even if additional attributes are added to a relation (logical independence) or if the details of the storage medium changes (physical independence), for instance.

Data mapper ORM
A data mapper ORM, or just simply a data mapper, is an application component that acts as a go between to translate between database representations and the data structures present in applications. Data mappers allow your application logic and database data representations to remain independent. The data mapper manages and translates data between these two mediums so that each representation is independent and can be structured intelligently.

Data type
A data type is a category or attribute that expresses a constraint on valid values. For example, an integer type specifies that only whole numbers are appropriate and expected for a variable or field. Data types allow you to specify expectations and requirements about your data during when defining a field or container. The programming language or application can then validate that introduced data meets the necessary criteria. Data types are also help determine the available operations that can be performed on a piece of data.

A database is a structure used to organize, structure, and store data. Databases are often managed by a database management system which provides an interface to manipulate and interact with the database and the data it manages. Databases can be highly structured or allow more flexible data storage patterns and can store many different types of data in a way that allows for querying, recalling, and combining data at the time of retrieval.

Database abstraction layer
A database abstraction layer is a programming interface that attempts to abstract differences between underlying database technologies to provide a unified experience or interface to the application layer. Database abstraction layers are often helpful for developers because they help to normalize the implementation differences between various offerings and can stay stable even as the underlying technology evolves. However, there are some challenges as well, such as leaking abstractions, masking implementation-specific features or optimizations from the user, and creating a dependency that can be difficult to dislodge.

Database administrator (DBA)
A database administrator, or DBA, is a role responsible for configuring, managing, and optimizing database systems and the related ecosystem of software and hardware. Some responsibilities they may be involved with include architecture planning, configuration, schema and change management, migrations, replication and load balancing, sharding, security considerations, managing backup strategies, and more. Database administrators are typically expected to have expertise in database design and theory and be able to help organizations make decisions about database technology selection and implementation. In many modern organizations, the responsibilities traditionally held by DBAs are now distributed between various members of the development and operations teams or have been offloaded to external providers to simplify some of the infrastructure management portions of the job.

Database engine
A database engine is the piece of a database management system responsible for defining how data is stored and retrieved, as well as the actions supported for interacting with the system and data. Some database management systems support multiple database engines that offer different features and designs, while other systems only support a single database engine that has been designed to align with the goals of the software.

Database management system (DBMS)
A database management system, often called a DBMS or even just a "database", is an application responsible for organizing and managing data. DBMSs can follow many different paradigms and prioritize certain goals. Generally, at the very least, they are responsible for persisting data, organizing and categorizing data, and ingesting, manipulating, and querying data. Most often, DBMSs offer a client / server model where the server is responsible for controlling and managing the data while clients, libraries, or APIs can be used to interact with the server to add or query data, change data structures, or manage other aspects of the system.

Database model
A database model is the overall strategy used by a database management system for storing, organizing, and providing access to data. There are many different database models available, but the relational model, which uses highly structured tables to store data in a specific format, is perhaps the most common type. Other types of databases include document databases, wide-column databases, hierarchical databases, key-value stores, and more. Some database systems are designed to be "multi-model", meaning they support databases with different types of models running within the same system.

Database proxy
A database proxy is a software component responsible for managing connections between database clients and database servers. Database proxies are used for a number of reasons including organizing access to a limited number of connections, allowing transparent scaling of the database layer, and redirecting traffic for deployments and similar scenarios. Database proxies are usually designed to be transparent for applications, meaning that the applications can connect to the proxy as if they were connecting directly to the backend database.

A dataset, sometimes spelled data set, is a single collection of data. Typically, this represents a chunk of related data applicable to a certain task, application, or area of concern. Typically, datasets are a combination of the data itself as well as the structure and context necessary to interpret it. They often consist of a combination of quantitative and qualitative values that can act as the raw data for further analysis and interpretation.

Denormalization is a process where the data and structure within a database is "denormalized" or taken out of a normalized state. This can happen accidentally if a data structure that is intended to be normalized is ill defined or mismanaged. However, it is often also performed intentionally in certain scenarios. Denormalization tends to allow faster access to data by storing values redundantly in different places. The drawback of this is that write performance suffers and there is a possibility that data can get out of sync since multiple locations are used to represent the same data.

Dirty read
A dirty read is a specific type of anomaly that can occur where one transaction can read data that hasn't been committed by another transaction. If the second transaction is rolled back instead of committed, the first transaction will be using a value that doesn't reflect the actual state of the database. Dirty reads are possible at certain isolation levels for transactions and represent a risk that can lead to inconsistency when manipulating data in parallel.

Distributed database
A distributed database is a database system that spans multiple physical systems. Data is spread across a number of machines for the sake of performance or availability. While distributed systems can help scale a database to handle more load, they also represent a significant increase in complexity that can lead to consistency and partition challenges as well as certain negative performance impacts like an increase in data writes in some cases.

In the context of document databases, a document is considered a container for information representing a single record or object containing related descriptive data. Documents can have a flexible structure that does not have to match the other documents on the system and can often be nested. Documents are typically represented in a data serialization format like JSON or YAML that can organize the document with labels and metadata.

Document database
A document database is a database model that represents items in individual objects called documents. While documents can be grouped together for organization, they don't have to share the same structure and can be designed to uniquely capture the data required to describe the item in question. Document databases typically don't support robust join operations to link different documents together, but are often praised for their flexibility and quick time-to-productivity due to their flexibility and ease in representing programmatic data structures.

Durability is a quality of data that signifies that it has been captured on persistent storage that will survive in the event of a program crash. Typically, this means flushing the data to a non-volatile storage medium like a hard drive that doesn't require electricity to maintain state.

Encoding is a system that translates between a character system that can represents the components used in written language and a digital representation that the computer can store and operate on. Different encoding systems have been developed with a wide variety of character ranges. Some are targeted at specific languages or language families (like ASCII) while others attempt to provide representation for much larger character sets appropriate for different many languages (like the UTF unicode varieties).

Encrypted transport
Encrypted transport is any type of communication process that encrypts its messages prior to sending them to the recipient. Transport encryption is necessary to ensure privacy (prevent others from seeing sensitive information) as well as avoid tampering (making manipulation of the data obvious). Many different encrypted transport systems can used when deploying databases, including TLS/SSL encryption, VPNs, and private networks.

Ephemerality is a characteristic that indicates that a piece of data or circumstance is not permanent. In many ways, it is the opposite of durability. In databases, certain items, like data you wish to persist, should not be ephemeral. However, other components, like a secret key used to encrypt a connection between a database and client, can benefit from being ephemeral by preventing key leakage from effecting future or past sessions.

Ephemeral storage
Ephemeral storage, also sometimes called volatile or non-durable storage, is any storage medium that persists for a short time, often associated with certain conditions. For instance, in applications, data being stored in memory will only survive while the process is running. Similarly, data stored to a temporary directory is only available until the system reboots. Often, ephemeral storage is useful for temporary data or as a holding area before data can be stored on a more permanent medium.

Eventual consistency
Eventual consistency is a description of a consistency / availability strategy implemented by certain distributed computing or database systems. The CAP theorem of distributed systems states that systems must choose whether prioritize availability or data consistency in the face of a network partition. Eventual consistent systems make the choice to favor availability by continuing to serve requests even if the server's peers are not available to confirm operations. Eventually, when the partition is resolved, a consistency routine will run to decide on the most correct state of any inconsistent data, but there will be a time where the data on different servers are not in agreement.

In the context of caches, eviction is a process where a piece of data is removed from a cache. This can happen because the current value has been invalidated by an operation or it can occur automatically as a result of policies designed to remove the data that is the oldest or least used.

Expand and contract pattern
The expand and contract pattern is a strategy for introducing new changes to a database schema without affecting existing applications. It works by introducing changes in carefully controlled stages by first adding new or changed structures alongside existing structures and then expanding the application logic to use both structures simultaneously. Eventually, after testing, the application can stop writing to original structure and it can be removed.

Extract-transform-load (ETL)
Extract, transform, and load, often abbreviated as ETL, is a process of copying and processing data from a data source to a managed system. First the data is extracted from its current system to make it accessible to the destination system. Next, the data is manipulated and modified to match the requirements and format of the new system. Finally, the reconstructed data is loaded into the new system.

Feature flags
A feature flag, or a feature toggle, is a programming strategy that involves gating functionality behind an external switch or control. The switch is typically first set to indicate that the feature should not be active. When the organization is ready, they can activate the switch and the program will start using its new functionality. This allows new features to be deployed without immediately activating them. It decouples the deployment of new software from the release of the software, offering greater control over how a change is introduced and for greater testing in a production environment.