There is a school of thought that one programming language could be used for all purposes in developing a solution. For example, if I am a java programmer, I must be able to do not only my business logic and presentation in Java but also my persistence in Java as well. Good goal, certainly, to have. But the practicality has a wicked sense of humor and shows up at the table as SQL.
With sleeves drawn up, we, the java programmers, start to tackle the first citizen of the SQL called select. Very quickly we meet these quirky things called joins. Our original suspicion that we should be sticking to java is quickly reinforced as the syntax and oddities of joins put us on the defense and suddenly seem to show Java in a shining light. But the task at hand is unwavering and your boss is quite certain that it can't be all that hard to write a few SQLs.
The story of one such journey from a simple select to a practical, every-day-select is presented here in a set of 11 principles. Let my reader be in no puzzlement that the total number of notes came out to be 11 and there is no significance to this number. I am sure there would be more if I were to spend some more time on that endeavor.
I am hoping this article will be a good companion to non-relational programming languages such as Java, C#, and Perl. The provided examples use SQL Server syntax but should be applicable to all databases.
Database tables got to be some of the simplest yet practical data structures that computer science invented for the general public. They are no different from their paper cousins where young and old keep track of their list of things. When they are viewed as such, databases are simple to understand. This is an important insight in learning. When you see things for what they really are (As the saying goes: Keeping it real) the element of fear is taken out of the problem and then we begin to learn.
I digress, but I want to tell you an interesting insight about this idea of knowing things for what they are. During the last passage rites of a person in my part of the world the dear ones of the departed carry with them to the site a pair of mud pots in a wooden sling to carry offerings. Once the rites are complete these things get abandoned at the site at times. There is a large sense of fear attached to these two objects when you see them in any context, auspicious or not, as death is equated to the pair. Discounting this, a famous balladeer pronounces "Listen, the sling is just a piece of wood, and the pots of mud, if you are able to see perceive the truth".
Databases are quite easy to understand if you see them as primarily a place to keep your data in tabulated structures. The fact that there is a whole relational calculus behind it is not as relevant to the immediate learning process.
When information is kept in more than one table, and when those two tables express an idea about a common item, it is a common practice to join the two tables to retrieve information about that item. The join is necessary because we need information about that item, but the information is kept in two tables.
Following that idea, you can join the following six tables: t1,t2,t3,t4,t5,t6
Select *
From t1
,t2
,t3
,t4
,t5
,t6
Where (any clause involving t1 through t6)
ex: t1.id=10 and t2.id=10 and t3.id=10 and t4.id=10 and t5.id=10 and t6.id=10
This is called an equi-join. There is equal weight for every table. If the joined tables don't have a matching row those rows from both tables will be eliminated. When the rows match, the rows from both tables are combined and returned. This is the simplest of joins. There is no weird ansi syntax where you are indicating if this is an outer join or left outer join or right outer join etc.
The above statement is saying that there is an item called 10 in all of the tables. Just retrieve all of the information pertaining to the item 10 which exists in all of the tables. But the behavior of this default join is somewhat unintuitive. If t6 doesn't' have an item called 10, then the join won't give out any information at all from the other tables as well. You have to resort to something called an outer join where you tell the join that, after all, "t6" is not that important.
The six tables: t1,t2,t3,t4,t5,t6
Assume:
T1
t2
t3
T4
T5
T6
Where t2 and t3 are independent of each other but both are dependent on T1 and will give T1 more weight. Meaning if there is a row in T1, give it out irrespective of T2 and T3. This is represented in SQL Server as
Select *
From t1
left outer join t2 on t2.f1=t1.f1 and t2.f2 = t1.f2
left outer join t3 on t3.f3=t1.f3
,t4
,t5
,t6
Where (any clause involving t1 through t6)
This piece of code is demonstrative of a couple of things. You can get the basics of the syntax for left outer joins. For example "t1" is left outer joined with "t2" and "t3". You can also see how you can combine some left outer joins and some regular joins in the same select statement. For example t1 (having already outer joined) is equi joined with t4, t5, and t6.
The previous example treated "t2" and "t3" at par and individually outer joined to "t1". What if you want to narrow the rows in "t2" first based on "t3" and then outer join the result to "t1".
The six tables: t1,t2,t3,t4,t5,t6
Assume:
T1
t2
t3
T4
T5
T6
That would necessitate that "t2" and "t3" are equi joined first. The result is then outer joined to T1. Subsequently T1 is equi joined to T4, T5, and T6. This can be done in SQL Server as follows:
Select *
From t1
left outer join
(select *
from t2,t3
where (any where clause involving t1, t2, t3)
) t2_t3_composite_table on t1_t2_composite_table.f1 = t1.f1
,t4
,t5
,t6
Where (any clause involving t1, t2_te_composite_table, t4, t5, t6)
Let us think about the above statement for a second. If t1 and t2 are to be outer joined, we can say that rows from t1 will be displayed irrespective of t2. To this if we are to equi-join t3, (as t3 is known to depend on t2), then t1 will not display any rows even though we mentioned that t2 is not important. This is because in the next line we said that t3 is important. So the crux is how can "t3" be important when "t3" depends on "t2" and "t2" is not important. This situation led us to the inner select where we hashed out "t2" and "t3" dependencies first and then joined the results to "t1".
The issue with the above is that the syntax is a bit clumsy as you now have to invent an intermediate table and an intermediate select not to mention the side affects on the optimizers (may be positive or negative). But on some thought we can avoid the inner select by telling "t1" that both "t2" and "t3" are unimportant. This is because if "t2" is unimportant then "t3" is automatically unimportant as it is dependent on "t2".
This is similar to an algebraic operation such as -(a+b) equals -a and -b. So we can rewrite the above SQL as follows
Select *
From t1
left outer join t2 on t2.f1=t1.f1 and t2.f2 = t1.f2
left outer join t3 on t3.f3=t1.f3
,t4
,t5
,t6
Where (any clause involving t1 through t6)
Notice that this is similar to the solution of principle 2.
The above principle of "propagating unimportance" will lead us to this principle here as well. This principle guards against some obvious errors in select statements that involve outer joins. Once you outer join a table, then the field comparisons on these fields with other tables need to be outer joined as well.
Otherwise, Your intention of outer joining (or less emphasizing) the table will be lost. Because when that table returns a null, the subsequent equi-joins on that field will fail and will result in removing the rows from your primary table.
Following principle 4 these inner select joins can be accomplished by listing those tables directly outside. As mentioned, as -(a+b), becomes -a-b, you can flatten the join structure. This is based on a quick intuitive conclusion. I might be wrong but my suspicion is that I may not be that far off. What I am not so sure is what are the implications to optimization. i.e., Whether an inner select or an outer select is better from a performance perspective.
For example, you can use a select to retrieve orders. You can use a select to retrieve parts. But you rarely go to retrieve orders, parts, and invoices at the same time. Even when you do there is always a primary entity you are focusing on. When you retrieve orders, you may retrieve parts that belong to each order. But still the focus is an order. Similarly when you retrieve parts, you have an order id associated with each part, but still the focus is a part in that case.
This entity of focus in a select statement could be called the primary table.
You can do joins on other tables to provide additional columns. Typically these other tables will have a 1 to 1 relationship with the primary table. When they have a 1 to many relationship with the primary entity you may decide to use "derived 1 to 1" relationships. For example, an order has many parts. That is a 1 to many relationship. When you are retrieving a list of orders it doesn't make sense to join with a parts table. But when you do you may want to know some aggregate properties of parts belonging to an order. For example the following are "derived 1 to 1 relationships":
1. The total number of parts in a given order 2. The total cost of parts in a given order 3. The 1 and 2 are examples of aggregate functions 4. The first part in an order 5. The last part in an order
In the following example a function is used to add a derived additional property to the primary entity
Select
col1 as col1,
my_function(col1,col2,col3) as my_derived_column
From
table1
Inside the function you can use complex logic, including many selects, joins etc to arrive at a value which gets returned. Function are quite useful when the total number of rows returned are in 10's of rows as opposed to 100s of rows. Because a function gets called for every row returned. When your result set is large you are better of doing joins where you can. When your result set is small, you may be able to minimize joins using functions.
Select
pt.column1,
case when pt.conditional_column is null
then st1.column1
else st2.column1
end as column2,
pt.column2
From
primary_table pt,
secondary_table1 st1,
secondary_table2 st2
Where 1=1
and (additional where clause)
This piece of code demonstrates a couple of things. The syntax for "case" in sql server. It also demonstrates a "derived" column could come from two tables based on a condition. This is useful when you manage information in two tables based on a certain attribute. This may not be a good data model example, but it will happen as your database progresses through its lifetime.
Select
pt.column1,
(select col1 from secondary_table where col2=pt.column2) as derived_column2
pt.column3
From
primary_table pt
Where 1=1
and (additional where clause)
Following the functional approach for columns (where applicable), you can also use a simple sub-select to retrieve column. The draw back is that you can only return one column for this inner select. Technically there is no reason why you can't imagine such a thing. For example one can do this in an appserver quite easily. Where possible such an aggregation will save multiple inner select calls for each outer row.
Instead you can also join the additional tables to get more columns. This option is always there. But the whole intent of using functions or sub-selects is to minimize the pressures on joins when the rows retrieved are small.
1. SQLServer Code Samples Knowledge Folder (kf: /akc/satya/cs-sqlserver). This knowledge folder documents some helpful information on SQL server and databases.
2. A list of all my knowledge folders (kf: /akc/satya/public folders).