- Natural or Surrogate Keys
- A New Database
- Which Datatype Should Be Used for Surrogate Keys?
- Pros and Cons of GUIDs as Primary Keys
- The Tests
- The Results
- A Few Tweaks
- Other Approaches
- Conclusion
Pros and Cons of GUIDs as Primary Keys
Let's say that we plan to use GUIDs for the primary keys. What are the advantages and disadvantages when compared to INTEGERs? We should always start with the positive, so here goes:
We get a more object IDlike datatype because each primary key typically will be unique for the complete database. This gives good effects, for example, when designing a custom solution for pessimistic locking (but that's another long story).
@@IDENTITY has been a source of problems when INSERT triggers have INSERTed rows into other tables with IDENTITYs. (You can use SCOPE_IDENTITY() in SQL Server 2000 to avoid this problem.)
You can move the creation processing of the values from the data tier to the business or consumer tiers. For example, the consumer can create complete structures of orders and details that can be sent all the way through the business tier to the data tier in one go, which can reduce the number of round trips. A more basic way of saying this is that you can get the ID value before you use it, not just afterward, as is the case with IDENTITYs.
The span of values for GUIDs is enough for most scenarios. INTEGERs are not big enough for some scenarios. (By the way, with INTEGERs, you don't have to start the seed on 1. You can just as easily start it on the smallest negative INTEGER and thereby double the span of values.)
You can prepare for merge replication. When merge replication is to be used, each table to be replicated must have a GUID column (called ROWGUID in SQL Server). You also don't have to give each server in a merge replication scenario a unique set of values to use for the keys. The risk of duplicate GUIDs is so small that this shouldn't be a problem.
IDENTITY can't be used for tables in distributed partitioned views.
A GUID will not mean anything to the users. An INTEGER can be read, understood, and remembered. It's common to see INTEGER keys in the UI, and that might cause problems later.
You know what a nightmare it can be when doing a manual merge between two tables with INTEGERs as primary keys. Not only do you have to create a new sequence for the union of both tables, but you also must change all the foreign keys for all the dependent tables. If GUIDs are used, this sort of problem does not arise.
NOTE
The algorithm for generating GUIDs doesn't use the MAC address of the network cards in recent Windows versions anymore. Instead, it just creates a random value. In theory, this presents a risk of getting duplicate GUIDs, but, in practice, it shouldn't be a problem.
The reason for excluding the use of the MAC address is that it not only couples users to GUIDs, but some network cards don't use unique MAC addresses.
I will discuss this further when we reach the section about the COMBs. Stay tuned.
NOTE
In the case of .NET and ADO.NET, it's a big advantage to know the ID value at the time rows are added to the DataSet. Otherwise, for example, autogenerated INSERTs can't be used for master-detail relationships within the DataSet.
I'm not a salesman, so here are a few drawbacks:
There is a huge overhead in size. As I said, a GUID is four times larger than an INT. This is very important when it comes to indexes.
Portability can be a problem because not every database has GUID as a native datatype.
You can't use GUIDs with aggregates such as MIN. This is unfortunate when it comes to using WHILE plus an aggregate function, such as MIN or MAX for looping instead of a CURSOR. (Well, a workaround is to use a CONVERT or a CAST to CHAR(36) or BINARY(16) of the GUID before it's used in the aggregate, but watch out for performance problems with large tables.)
It's detrimental to developer-friendliness because it's really hard to edit GUIDs manually, to remember them when you browse the database, and so on. You have to get used to navigating the database by using SQL scripts, but this is not usually that much of a problem.
So, the overhead related to size is the main problem. Most developers seem to have an idea of how big the overhead is, but few have tested it. It's time that this was changed, so let's take a look at a few performance tests and get an idea of how large the overhead is.