Database tip: Understanding joins and set theory

Sybase expert Mich Talebzadeh provides an excerpt from his book explaining set theory and different types of joins.

Mich Talebzadeh
Site expert Mich Talebzadeh answers your questions about Sybase, the first enterprise DBMS on Linux. The following tip -- which answers a user's question about different types of joins -- is excerpted, with permission of the author, from Talebzadeh's book Sybase Transact SQL Guidelines Best Practices (ISBN: 0975969307), co-authored by Ryan Thomas Putnam. -- Editor

To understand joins, one needs to understand the set theory. Here is some explanation:

Set theory is a branch of mathematics which defines sets (thought of a grouping of things) and operations on them. In a relational database, a dataset is composed of the data in a table and its columns.

Example:

CREATE
 TABLE test_table(
       column01 INT NOT NULL,
       column02 INT NOT NULL,
 CONSTRAINT ix01_test_table
    PRIMARY KEY NONCLUSTERED ( column01 )
)
go

INSERT
  INTO test_table(
       column01,
       column02
)
SELECT 1,
       1
go

The test_table dataset:

SELECT column01,
       column02
  FROM test_table
go
1         1

Let's give an overview of some basic set properties. Let's define three sets A, B, and C as follows:

JOINT SETS

Joint sets are sets that share elements. Joint sets can be visualized as follows:

DISJOINT SETS

Disjoint sets are sets that have no shared elements. Disjoint sets can be visualized as follows:

The following SET operations are discussed in more detail in the section on Joins. First, though, we will review how basic set operations are carried out for datasets.

SET Unions

A set union of datasets is the dataset which arises from the combining of all datasets. In addition, a set union can include the overlaps once or twice from a pair of datasets which are joint. These types of unions are typically carried out by outer joins, a union, or a union all.

SET Intersections

A set intersection of datasets is their overlap: it consists of those elements that exist in all of the datasets. Or simply put, an element which exists in one dataset must exist in the other dataset. Intersections are typically carried out by equijoins (aka natural joins or even inner joins), an IN, and/or an EXISTS.

SET Subtractions

A set subtraction is the removal of one dataset from a second dataset. Subtractions are typically carried out by a NOT IN and/or a NOT EXISTS.

SET Multiplications

A notion of set multiplication for datasets is the cartesian product. The cartesian product of a given collection of datasets "explodes" to a dataset consisting of all possible combinations of elements from all datasets. They are typically creating by forming joins with no conditions.

Joins

A join is an operation which combines elements of two or more datasets by specifying which combinations (within the cartesian product) are valid. In Sybase Adaptive Server Enterprise (ASE), we do this by specifying conditions on the columns or fields of the datasets row by row, and linking conditions between datasets row by row. A join then displays the combined results as a new dataset.

Note: In ASE, the tables or views specified in the join can be in the same database or in different databases.

Example:

SET ROWCOUNT 1

SELECT title, 
       pub_name, 
       advance 
  FROM titles, 
       publishers 
 WHERE titles.pub_id = publishers.pub_id 

SET ROWCOUNT 0
go
You Can Combat Computer Stress!     New Age Books       10,125.00 

A join may be thought of as the product of two or more sets, pivoted about some key or column and a comparison operation. The results that are returned or that are linked are those that match the search criteria specified. The results are those rows or records that are duplicated or joint on the join criteria.

You can embed a join in a select, update, insert, delete, or subquery. Other search conditions and clauses may follow the join conditions.

Joins use the following syntax:

start of SELECT, UPDATE, INSERT, DELETE, or subquery 
  FROM {table_list | view_list} 
 WHERE [NOT] 
    [table_name. | view_name.]column_name 
                   join_operator 
    [table_name. | view_name.]column_name
[{AND | OR} [NOT]
    [table_name.|view_name.]column_name
                  join_operator 
    [table_name.|view_name.]column_name]...
End of SELECT, UPDATE, INSERT, DELETE, or subquery

Joins and the relational model

The join operation is the hallmark of the relational model of database management. More than any other feature, the join distinguishes relational database management systems from other types of database management systems.

More on this topic

Continue reading the second part of this excerpt, "All about subqueries"

Ask Mich your Sybase question

In a relational database management system, relationships among data values are left unstated in the definition of a database. They become explicit when the data is manipulated when you query the database, not when you create it. You can ask any question that comes to mind about the data stored in the database, regardless of what was intended when the database was created.

According to the rules of good database design, called normalization rules, each table should describe one kind of entity – a person, place, event, or thing. That is why, when you want to compare information about two or more kinds of entities, you need the join operation. Relationships among data stored in different tables are discovered by joining them.

A corollary of this rule is that the join operation gives you unlimited flexibility in adding new kinds of data to your database. You can always create a new table that contains data about a different kind of entity. If the new table has a field with values similar to those in some field of an existing table or tables, it can be linked to those other tables by joining.

How joins are processed

Conceptually speaking, the first step in processing a join is to form the Cartesian product of the tables with all the possible combinations of the rows from each of the tables. The number of rows in a Cartesian product of two tables is equal to the number of rows in the first table, multiplied by the number of rows in the second table.

To illustrate joins, we'll consider the sample database included with ASE. The Cartesian product of the authors table and the publishers table is 69 (23 authors multiplied by 3 publishers). You can have a look at a Cartesian product with any query that includes columns from more than one table in the SELECT list, more than one table in the FROM clause, and no WHERE clause. For example, if you omit the WHERE clause on a join between the authors and publishers table, ASE combines each of the 23 authors with each of the 3 publishers, and returns all 69 rows.

This Cartesian product does not contain any particularly useful information. In fact, it is downright misleading, because it implies that every author in the database has a relationship with every publisher in the database, which is not true at all.

That is why you must include a WHERE clause in the join, which specifies the columns to be matched and the basis on which to match them. It may also include other restrictions. Once ASE forms the Cartesian product, it eliminates the rows that do not satisfy the join by using the conditions in the WHERE clause.

Equijoins and natural joins

Joins based on equality (=) are called equijoins. Equijoins compare the values in the columns being joined for equality and then includes all the columns in the tables being joined in the results.

Example:

 
SELECT * 
  FROM authors, 
       publishers 
WHERE authors.city = publishers.city 

In the results of that statement, the city column appears twice. By definition, the results of an equijoin contain two identical columns. Because there is usually no point in repeating the same information, one of these columns can be eliminated by restating the query. The result is called a natural join. The query that results in the natural join of publishers and authors on the city column is:

SELECT publishers.pub_id, 
       publishers.pub_name,
       publishers.state,
       authors.* 
  FROM publishers, 
       authors 
 WHERE publishers.city = authors.city 

The column publishers.city does not appear in the results.

Another example of a natural join is:

SELECT au_fname, 
       au_lname, 
       pub_name 
  FROM authors, 
       publishers 
 WHERE authors.city = publishers.city 

You can use more than one join operator to join more than two tables or to join more than two pairs of columns. These "join expressions" are usually connected with "AND", although "OR" is also legal.

The following are two examples of joins connected by "AND". The first lists information about books (type of book, author, and title), ordered by book type. Books with more than one author have multiple listings, one for each author.

SELECT type, 
       au_lname, 
       au_fname, 
       title 
  FROM authors, 
       titles, 
       titleauthor 
 WHERE authors.au_id = titleauthor.au_id 
   AND titles.title_id = titleauthor.title_id 
 ORDER 
    BY type 

The second finds the names of authors and publishers that are located in the same city and state:

SELECT au_fname, 
       au_lname, 
       pub_name 
  FROM authors, 
       publishers 
 WHERE authors.city = publishers.city 
   AND authors.state = publishers.state 

Joins with additional conditions

The "where" clause of a join query can include selection criteria as well as the join condition. For example, to retrieve the names and publishers of all the books for which advances of more than $7500 were paid (that is $ as in currency), the statement is:

SELECT title, 
       pub_name, 
       advance 
  FROM titles, 
       publishers 
 WHERE titles.pub_id = publishers.pub_id 
   AND advance > $7500
go
You Can Combat Computer Stress!   New Age Books           10,125.00 
The Gourmet Microwave             Binnet & Hardley        15,000.00 
Secrets of Silicon Valley         Algodata Infosystems     8,000.00 
Sushi, Anyone?                    Binnet & Hardley         8,000.00 
 

The columns being joined (pub_id from titles and publishers) do not need to appear in the select list and, therefore, do not show up in the results.

You can include as many selection criteria as you want in a join statement. The order of the selection criteria and the join condition is not important.

Much of the previous material on joins was taken from Chapter 4 of the ASE 12.5 Transact-SQL User's Guide.

Troubleshooting joins

Joining N tables will result in N-1 Join clauses

  • Join columns should have matching data types.
  • Join columns should have clustered indexes or non-clustered indexes on them with Index Page Cluster Ratio (IPCR) > 0.9. This value can be obtained through the function DERIVED_STAT('table_name', 'index_name','ipcr')
  • Provide a valid SARG on the outer or driving table, otherwise expect a table scan!
  • Join queries 8 tables at the time. Use #tables for intermediate results. This makes queries more consistent and your code more readable.

Mich Talebzadeh is a database expert with large DBA and architecture experience with special interests in Sybase and Oracle. He specializes in creating database architectures for large global trading systems involving heterogeneous databases. Mich is based in London and serves as Chairman of the Sybase Future Forum. As resident Sybase expert on SearchEnterpriseLinux.com, he can answer any of your Sybase questions.
This was first published in August 2005

Dig deeper on Open source databases

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchDataCenter

SearchServerVirtualization

SearchCloudComputing

SearchEnterpriseDesktop

Close