SQL Data Science Foundations
About Lesson

Introduction

Data is central to how many of today’s applications and websites function. Comments on a viral video, changing scores in a multiplayer game, and the items you left in a shopping cart on your favorite online store are all bits of information stored somewhere in a database.

This Lesson serves as an introduction to numerous database topics. It provides a brief overview of what databases are and highlights a few concepts central to their design and function.

Relational Databases

Since the 1970s, most DBMSs have been designed around the relational model. The most fundamental elements in the relational model are relations, which users and modern relational DBMSs (RDBMSs or relational databases) recognize as tables. A relation is a set of tuples, or rows in a table, with each tuple sharing a set of attributes, or columns:

Diagram example of how relations, tuples, and attributes relate to one another

You can think of each tuple as a unique instance of whatever type of people, objects, events, or associations the table holds. These instances might be things like employees at a company, sales from an online business, or test results in a medical lab. For example, in a table that holds employee records of teachers at a school, the tuples might have attributes like namesubjectsstart_date, and so on.

In the relational model, each table contains at least one column that can be used to uniquely identify each row, called a primary key. Building on the example of a table storing employee records of teachers at a school, the database administrator could create a primary key column named employee_ID whose values automatically increment. This would allow the DBMS to keep track of each record and return them on an ad hoc basis. In turn, it would mean that the records have no defined logical order, and users have the ability to return their data in whatever order or through whatever filters they wish.

If you have two tables that you’d like to associate with one another, one way you can do so is with a foreign key. A foreign key is essentially a copy of one table’s (the “parent” table) primary key inserted into a column in another table (the “child”). The following example highlights the relationship between two tables, one used to record information about employees at a company and another used to track the company’s sales. In this example, the primary key of the EMPLOYEES table is used as the foreign key of the SALES table:Diagram example of how the EMPLOYEE table's primary key acts as the SALES table's foreign key

The relational model’s structural elements help to keep data stored in an organized way, but storing data is only useful if you can retrieve it. To retrieve information from an RDBMS, you can issue a query, or a structured request for a set of information. Most relational databases use a language called Structured Query Language — better known as SQL and informally pronounced like “sequel” — to manage and query data. SQL allows you to filter and manipulate query results with a variety of clauses, predicates, and expressions, giving you fine control over what data will appear in the result set.

There are many open-source RDBMSs available today, including the following:

Non-relational Databases

Today, most applications still use the relation model to store and organize data. However, the relation model cannot meet the needs of every application. For example, it can be difficult to scale relational databases horizontally, and though they’re ideal for storing structured data, they’re less useful for storing unstructured data.

These and other limitations of the relational model have led to the development of alternatives. Collectively, these database models are often referred to as non-relational databases. Because these alternative models typically don’t implement SQL for defining or querying data, they are also sometimes referred to as NoSQL databases. This also means that many NoSQL databases implement a unique syntax to insert and retrieve data.

It can be helpful to think of “NoSQL” and “non-relational” as broad umbrella terms, as there are many database models that are labeled as NoSQL, with significant differences between them. The remainder of this section highlights a few of the more commonly used non-relational database models:

Key-Value Databases

Key-value databases, also known as key-value stores, work by storing and managing associative arrays. An associative array, also known as a dictionary or hash table, consists of a collection of key-value pairs in which a key serves as a unique identifier to retrieve an associated value. Values can be anything from simple objects, like integers or strings, to more complex objects, like JSON structures.

Redis is an example of a popular, open-source key-value store.

Document-Oriented Databases

Document-oriented databases, or document stores, are NoSQL databases that store data in the form of documents. Document stores are a type of key-value store: each document has a unique identifier — its key — and the document itself serves as the value. The difference between these two models is that, in a key-value database, the data is treated as opaque and the database doesn’t know or care about the data held within it; it’s up to the application to understand what data is stored. In a document store, however, each document contains some kind of metadata that provides a degree of structure to the data. Document stores often come with an API or query language that allows users to retrieve documents based on the metadata they contain. They also allow for complex data structures, as you can nest documents within other documents.

MongoDB is a widely used document database. The documents you store in a MongoDB database are written in BSON, which is a binary form of JSON.

Columnar Databases

Columnar databases, sometimes called column-oriented databases, are database systems that store data in columns. This may seem similar to traditional relational databases, but rather than grouping columns together into tables, each column is stored in a separate file or region in the system’s storage. The data stored in a columnar database appears in record order, meaning that the first entry in one column is related to the first entry in other columns. This design allows queries to only read the columns they need, rather than having to read every row in a table and discard unneeded data after it’s been stored in memory.

Apache Cassandra is a widely used open-source column store.

Working with Databases

Most database management systems come installed with a command line tool that allows you to interact with the database installation. Examples include the mysql command line client for MySQL, psql for PostgreSQL, and the MongoDB Shell. There are also third-party command line clients available for many DBMSs. One such example is Redli, which serves as an alternative to Redis’s default redis-cli tool and comes with certain added features.

However, managing data through a command line interface may not be intuitive for every user, which is why there are graphical database administration tools available for many open-source DBMSs. Some, like phpMyAdmin or pgAdmin, are browser-based while others, like MySQL Workbench or MongoDB Compass, are meant to connect to a remote database from a local machine.

As an application continues to operate and grow, the data held within the database will require more and more storage, to the point where it could slow down the entire application. There are several common strategies for dealing with issues like this, the two most common of which are replication and sharding.

Replication is the practice of synchronizing data across multiple separate databases. When working with databases, it’s often useful to have multiple copies of your data. This provides redundancy in case one of the database servers fails and can improve a database’s availability and scalability, as well as reduce read latencies. Many DBMSs include replication as a built-in feature, including MongoDB and MySQL. Some like MySQL even provide multiple replication methods for greater flexibility.

Database sharding is the process of splitting up data records that would normally be held in the same table or collection and distributing them across multiple machines, known as shards. Sharding is especially useful in cases where you’re working with large amounts of data, as it allows you to scale your base horizontally by adding more machines that can function as new shards.

Join the conversation
brian mwirigi 6 months ago
great introduction
Reply
0% Complete