Designing Databases: Picking The Right Data Types

« »

When I started writing this blog post, I had titled it “Tips for Designing Databases” and I planned to talk about various database design techniques. However, as I did more and more research, it dawned on me that one of the most crucial, and most overlooked, components of database development, is the selection of data types for columns.

Much of the information presented in this article was taken from presentations by Jay Pipes and a talk by Ronald Bradford. The talks are The Top 20 Design Tips For MySQL Enterprise Data Architects, Join-fu: The Art of SQL Tuning and SQL Query Tuning: The Legend Of Drunken Query Master.

MySQL supports a large number of data types (with Postgres supporting even more). For example, MySQL supports some 10 different numeric data types (INTEGER, TINYINT, SMALLINT, MEDIUMINT, BIGINT, DECIMAL, NUMERIC, FLOAT, REAL, and DOUBLE PRECISION), meaning that database designers need to know and understand how to use each one of them properly. Using them improperly adds stress to the database and generally reflects bad database design.

Since it would be impossible to discuss every data type in a blog post, I will instead discuss some of the most common MySQL (and this applies to other database platforms as well) mistakes, as highlighted by the presentations and blog posts I will cite.

Understanding how disk I/O affects databases
Databases are typically stored on disk (with the exception of some, like MEMORY databases, which are stored in memory). This means that in order for the database to fetch information for you, it must read that information off the disk and turn it into a results set that you can use. Disk I/O is extremely slow, especially in comparison to other forms of data storage (like memory).

When your database grows to be large, the read time begins to take longer and longer. This is a natural occurrence as the engine must read over more and more of the disk in order to find the information you have requested. Poorly designed databases exacerbate this problem by allocating more space on the disk than they need; this means that the database occupies space on the disk that is being used inefficiently.

Picking the right data types can help by ensuring that the data we are storing makes the database as small as possible. We do this by selecting only the data types we need, rather than choosing data types willy-nilly. This helps reduce the size of our rows, and by extension, our database, making reads and writes faster and more efficient.

Picking the right numeric type
The type of integer you select affects the amount of space that integer occupies on the disk – regardless of the value of the number you actually store in it.

For example, a BIGINT occupies 8 bytes of space, while a TINYINT occupies 1 byte of space. While the 8 byte integer gives you the ability to store huge numbers, it also means that you must store all eight bytes every time you store a record into that table. If you’re storing numbers like 2,000 or 5,000, you’re wasting lots and lots of bytes. This will inevitably make your reads slower, because the database must read over multiple sectors of the disk.

Also, many people assume that they must pick a larger integer size because the integer size they picked might not allow for enough values. For example, a SMALLINT will allow you to store up to 32,767 – if you leave it signed. Leaving it signed leaves the very last bit as a determination of whether or not the value is a positive or negative value. But if you declare the value to be UNSIGNED, you make that bit available for use, allowing a SMALLINT to store up to 65,535 – twice what it can store as a signed integer. Your primary keys should always be unsigned, especially if they are auto-incremented; MySQL will never assign a negative value as a primary key.

Refer to the documentation when picking an integer type, and use it to determine how big an integer you need.

Storing text in a database
Obviously a database is less useful if we only store numbers in it. MySQL, as well as other databases, allow us to store text values (strings or full text values) as a way of making our databases more useful. However, when doing so we must be cautious.

Pop quiz: what’s the difference between CHAR(4) and VARCHAR(4)? If you said that CHAR(4) will always be 4 bytes while VARCHAR(4) will be a variable number of bytes, you’re right! That’s a central difference between the two.

CHAR will right-pad any value that does not fit the width of the column defined, making all values the same number of bytes, while VARCHAR allows for variable length values. However, this variable nature comes with a cost – VARCHAR stores additional information, meaning that the overall bytes are higher than the CHAR for the same data.

For the most part, CHAR should be used when the length of the data is known and fixed (for example, as a MD5 hash, which is always 32 characters long). VARCHAR should be used when the length of the data is not fixed. You should also be careful to ensure that the VARCHAR is not wider than it has to be; for example, if you allow for usernames up to 12 characters, your VARCHAR should not be 255 characters wide.

Another favorite (and bad) development practice is to use TEXT columns to store large blocks of text (for example, this blog uses TEXT columns to store the contents). TEXT columns are extremely inefficient, however. When using these types of columns, it is a good idea to consider abstracting the TEXT columns to a different table, especially if fast lookup is required on the primary table. And if you are defining indexes on the TEXT columns (for example, for full-text searching), this is doubly important (indexes slow down write time, and the more data you store, the slower writing will be).

Oh, and as for BLOB columns – use the file system for what it was designed to do: store files. If you’re storing them in the database, to paraphrase, “ur doin it rong.” This does, of course, have room for interpretation; however, for the most part, you should use these columns extremely sparingly.

Data types you should avoid
Lots of new developers love the day they discover ENUM and SET – and if their bosses are smart, they teach them never to use them ever again.

ENUM and SET make sense to developers for a number of different reasons: with ENUM you can enforce certain types of data being stored (since you can only store data that’s in the enumerated list). With SET you can store multiple values from a specified list, which makes the SET seem almost like an array.

The problem here is that databases aren’t designed to work like this. If you ever decide you have to add something else to the ENUM or SET declaration, MySQL must rebuild the entire table which, if you have 3 million records, could take you a week or two (hyperbole). This is clearly not optimal.

This is clearly a case where having the programming language do the work is preferable to having the database do it. Sure, it’s easy to validate whether or not the value is allowed by having the database enforce a constraint, but ultimately you will cause yourself more trouble than it’s worth.

Wait, isn’t this micro optimization?
Yes and no.

The ways in which this is similar to micro optimization are that the performance boost most sites will get is negligible. It will take a fair bit of time to make these changes, thus it might qualify as micro optimization.

But where this is clearly not micro optimization is that this is an adoption of a best practice. Choosing the proper data type is a crucial component of good database design, not simply an optimization technique. While this will have major performance benefits in high traffic databases, it is still the given best practice for all databases, regardless of traffic.

Brandon Savage is the author of Mastering Object Oriented PHP and Practical Design Patterns in PHP

Posted on 11/20/2009 at 1:00 am
Categories: Best Practices, Databases, Technology
Tags: , ,

Dominik (@blam4c) wrote at 11/20/2009 3:57 am:

One thing that I don’t understand – yet – is, why there are no “hexdata” or “uuid” fields. For example you mention md5 hashes and say that they are always CHAR(32). Sadly, that’s true, because it’s the only existing way to store them at this moment.

However actually md5 is a sequence of 16 “byte size” hex numbers, which could be stored in a 16 byte long “BSOB” (as in “Binary SMALL Object” :) – or “BIN” or “HEXDATA” or whatever you might want to call it). Same applies to UUIDs (which are also 16 bytes, just having some dashes in the hex representation).

Why is there no data type to put those in?

FractalizeR wrote at 11/20/2009 4:36 am:

>If you ever decide you have to add something else to the ENUM or SET declaration, MySQL must rebuild the entire table which, if you have 3 million records, could take you a week or two (hyperbole). This is clearly not optimal.

This is quite incorrect. If you add new values TO THE END of the ENUM or SET declaration, table rebuild is not needed: http://dev.mysql.com/doc/refman/5.1/en/alter-table.html (see fast table alterations).

Ben wrote at 11/20/2009 5:39 am:

Personally, I don’t see the problem with using enum – if you have a case where a field can be one of two or three values it keeps your data readable without having to cross reference your code in order to understand exactly what is in a particular row.

If you must forbid the use of enums, the best alternative is to normalise the table, and extract the enum field into a seperate table – that way you can add an extra value easily – and it removes dependency between your database and your code.

Regardless, I dissagree with you on the inconvenience of adding an extra enum option at a later stage. It is a single operation, which, even if it takes several minutes, you can make the deployment at a time when the server isn’t busy – if you must deploy to a live database. Surely long term maintainability is more of a consideration than the length of time it will take to perform a single deployment?

Brandon Savage (@brandonsavage) wrote at 11/20/2009 7:16 am:

Ben and FractalizeR, I’m sorry, but when it’s Jay Pipes (who works for MySQL) against you, I’m going to go with him every time. See slide 13 of the Drunken Query Master talk where he talks about having to rebuild the table.

It’s poor schema design, because you’re expecting the database to do what PHP should have done for you.

FractalizeR wrote at 11/20/2009 7:25 am:

Dear Brandon, with all respect, I think blindly follow respected persons is a bad practice. Those slides you mention date back to 2008. And they are correct, in MySQL 5.0, there WAS a problem with ENUM and SET alerations leading to table restructures. Fast ALTER TABLE, as I understand, was implemented in 5.1 (in 5.0 manual there is no such entry, it appeared in 5.1 only).

Ben wrote at 11/20/2009 7:30 am:

I don’t claim to be any kind of MySQL or PHP authority, however, removing dependencies between database and application is a basic design principle – surely normalising the enum into a separate table is a better alternative to – presumably – using PHP constants to achieve the same result?

With regards to rebuilding the table – even if this is the case, it’s a single operation, and not something that will affect the overall performance of the database.

Jon wrote at 11/20/2009 10:44 am:

I think the bigger problem here is, Brandon, you’re trying to tell people how to design DB’s. No offense, but you aren’t a DBA, and DBA’s should be the ones to make those decisions. Not developers.

When developers design tables and databases, they design them to suit their needs for their application. This is very bad philosophy to follow. But if you must create your own tables because you have not a DBA, I would highly recommend that you learn Normalization rules, and get some books on MySQL best practices.

Some of this article clears up some confusion that intermediate developers will encounter. Some of the document makes assumptions based on “well..he said this so it must be true”. I think that unless you fully understand the issue and the causes, you shouldn’t be writing articles as a “matter of fact” when it’s a “matter of hearsay”.

Ryan (@popthestack) wrote at 11/20/2009 11:19 am:

Database-enforced constraints (like enum values) are, quite frankly, safer. When the database itself does the constraining, you (or another programmer) won’t inadvertently screw up data with bad code.

An acceptable compromise, I think, is to enforce enum values in a db layer. If the db layer knows it’s an ‘enum’ and what the accepted values are then no one can mess up the data*.

(Doctrine ORM defaults to simulated enums with varchar columns as not every DBMS supports enums).

* So then you run into possible issues when/if you do any raw SQL on your database (from CLI, scripts, whatever).

Michael Crumm (@mcrumm) wrote at 11/20/2009 3:51 pm:

Brandon,

Thanks for the post – I’ve often wondered the difference between certain MySQL data types, and this writeup did a nice job of explaining them.

Dave Rowe (@dwrowe) wrote at 11/20/2009 4:15 pm:

Interesting points. Though, in most cases, schema changes aren’t necessarily organic, thus a design may need to handle the ‘what ifs’ and potential growth issues. Deciding on SMALLINT versus BIGINT because of a few bytes isn’t really a concern, because the difference is only really seen in _large_ databases, at which point, this point is moot, since you’ll need BIGINTs to handle those millions of rows.

Design ideas themselves, should also be organic. As was demonstrated with your recommendation of avoiding ENUMs because of table rebuilds. Advancements in the technology allow for changes without the re-builds as referenced by FractalizeR, so the resulting point should be, make the best decision given what the underlying technology supports. But, be willing to be flexible on old rules.

Good post though! Keep ’em coming.

Jesper Wisborg Krogh wrote at 11/20/2009 8:28 pm:

One thing also to be aware of is that how data types are handled varies between storage engines. InnoDB effective stores both CHAR and VARCHAR the same way. Also InnoDB at least in MySQL 5.0 uses four bytes to store a medium int (and I think small int as well) in which case I believe it is better to choose an int.

Regarding enum and sets, then I prefer not to use them and rather use a reference table (and for sets a join table) as it is my impression those data types are MySQL specific.

Herman Radtke (@hermanradtke) wrote at 11/21/2009 12:30 am:

No, one of the most crucial, and most overlooked, components of database development, is the use of natural keys. The use of auto_increment and uuid/guid should be minimal.

The information on TEXT data types is just plain wrong. Look at http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html for reference. The only real difference between VARCHAR and TEXT is that VARCHAR has a length constraint.

Hard drives are cheap. RAID, proper indexes and caching are proper solutions to I/O bottlenecks.

For a discussion on what really matters in SQL design, I would suggest checking out Joe Celko’s book on SQL Style: http://www.amazon.com/exec/obidos/ASIN/0120887975

Hodicska Gergely (@felhobacsi) wrote at 11/21/2009 6:36 pm:

“if you allow for usernames up to 12 characters, your VARCHAR should not be 255 characters wide.”
This is not true: varchar(12) and varchar(255) is the same as the length is stored on one byte in both cases. The change is here when you pass 255. It is always a good practice to use a bigger (under 255) value with varchar if you are not really sure that you wont need more to avoid unnecessary schema change.

“if you have 3 million records, could take you a week or two (hyperbole).”
This is very-very hyperbolist ;), it takes usually few minutes, of course this depends on the given setup.

“It’s poor schema design, because you’re expecting the database to do what PHP should have done for you.”
I disagree with you: you should defense you data model even on database level, while it is possible that not your code is the only one which interacts with the database. Of course you should consider other factors too like the cost of a possible schema modification, but for example with an MMM setup you can do this even with a running system without an outge.

“See slide 13 of the Drunken Query Master talk where he talks about having to rebuild the table”
I am a big fun of Jay Pipe, but you should believe the manual too. ;)

Brice Burgess wrote at 11/23/2009 7:36 pm:

Fine introduction — although I side on an enum happy approach as it does a good job in intrinsically documenting the schema — but I would LOVE to hear more about approaches to handling DATES. Specifically dates pre 1970.

Keep up the good posts :)

Purencool (@purencool) wrote at 11/23/2009 8:33 pm:

Thanks for the great article. I have always wondered about the topic you discussed and what happens in large data sets

« »

Copyright © 2023 by Brandon Savage. All rights reserved.