Data Formatting: It IS Our Job

« »

It’s happened to each and every one of us: we fill out a long form, complete with username and password. We double and triple check everything, because want to make sure the submission works. We verify our email address, our date of birth, and even maybe retype our password, just to make sure they’re both right and they both match. And then we fill out the CAPTCHA, with so much care (passing those things is still random, whether you’re a human or not). And then we hit submit.

And we wait. Breathless.

What happens next? Well, our form pops up back in our face, with a bunch of red over it (if we’re lucky) saying that we’ve done something wrong. Seems that we entered our phone number as “1234567890” instead of “123-456-7890” or we entered our date of birth as “6/8/1995” instead of “1995-06-08”.

Oh, and our password is gone (because that’s how password fields work) and the CAPTCHA we beat…we have to beat it again.

Why on earth has this happened? The simple answer is that whomever designed the form decided to place the validation of the data, and its massaging into the proper format, onto the end user. But there’s a more complicated issue at hand here: the fact that the developer either felt it wasn’t his responsibility to do the data formatting, or didn’t realize that not everyone would think to place dashes or format dates the way he does.

The sad thing is that data formatting is both easy and often overlooked. Developers in a hurry will place the data formatting obligations onto the end user, rather than writing code to do it themselves, when really the code is so very simple. Take, for example, a small function that formats a phone number properly:

function formatPhoneNumber($number)
{
$phone = filter_var($number, FILTER_SANITIZE_NUMBER_INT);
$phone = str_replace('-', '', $phone); // We want to put our own dashes in the right place
$phone = str_replace('+', '', $phone); // + is a numeric character but doesn't belong in a phone number
//It's possible that this number is a foreign number.
if(strlen($number) < 10 || strlen($number) > 10)
{
return $number;
}

$areaCode = substr($number, 0, 3);
$prefix = substr($number, 3, 3);
$lastFour = substr($number, 6);
$number = $areaCode . '.' . $prefix . '.' . $lastFour;

return $number;
}

Admittedly, the above function isn’t perfect; it certainly doesn’t validate the format for each and every number and in fact doesn’t test to ensure that the phone number contains anything more than a single number. But it is a step in the right direction. With a little bit more work we could make it fully functional.

PHP also includes a number of date options, one of my favorites being the strtotime() function. strtotime() takes almost any date and converts it into a Unix timestamp. It even takes arguments like “today” and “tomorrow” and “next Wednesday” and gives a Unix timestamp for those values. PHP 5.3 has improved DateTime object handling, meaning that any value you give to strtotime() can be given to the constructor of the DateTime object, and you can then manipulate that object to format the date in any way you see fit.

Each of these solutions would prevent the user from having to format the data for us, unless the user failed to enter the appropriate data. But in 90% of the cases, the form would submit properly on the first try if it was filled out correctly and the users would be much happier for it.

The bottom line here is that data formatting is our job as developers. We have an obligation to make the forms we create as easy to use as we can; it is not the job of the end user to send us data that matches the format of what we want. We should be prepared to format the data ourselves, and ask the user for help only when the data they submit is incompatible with what we’re trying to accomplish. This is the point of feedback; use it wisely.

Thanks to Marco Tabini for reminding me that this was something I had needed to write about for a long time.

Brandon Savage is the author of Mastering Object Oriented PHP and Practical Design Patterns in PHP

Posted on 12/7/2009 at 1:00 am
Categories: Technology, Usability, Best Practices
Tags: , ,

christian wrote at 12/7/2009 1:49 am:

When I booked a flight recently and entered my credit card number, the system complained about the spaces I put in.

This lead me to the worrying question: Should I trust my life to a company which doesn’t even manage to filter out spaces from a form field?

Rob... (@akrabat) wrote at 12/7/2009 2:36 am:

This annoys me too and most form validation issues I come across seem to be a direct result of programmer laziness.

There are certain classes of data that that we can’t format correctly automatically. Dates are a good example. “3/2/2009” requires the additional input of knowing if the user is from the UK or the USA before you can parse it. In those situation, it’s far better to provide an input system where the ambiguity is removed.

As a minor point, the + does belong in an international phone number and some UK numbers are 10 digits long :)

Regards,

Rob…

David wrote at 12/7/2009 2:53 am:

Agree with Rob. Dates are always a pain unless you supply a date-picker. And phone numbers are worse, as you can’t rely on built in validators. If someone has a good phone number validator, let me know.

Maarten wrote at 12/7/2009 4:00 am:

Adding to Rob’s comments, please be very wary of international address, naming and other ‘conventions’.
Some cultures and countries are very different, making it next to impossible to do any formatting *and* checking on data other then it being numeric or it not having illegal data like html.

Brandon Savage (@brandonsavage) wrote at 12/7/2009 7:37 am:

Rob, I’d probably still drop the + symbol and format it appropriately after that. Reason being that while it is necessary to have it for international telephone numbers, I don’t want them in the wrong place. But you are right.

Christian, I wouldn’t worry about spaces being required. It’s probably not a security issue, but a lazy developer.

jon wrote at 12/7/2009 10:24 am:

@this article: duh.

@brandon’s response to rob: I disagree. There are many regex expressions out there that properly validate the # the user is inputing (both in the US, and foreign numbers). We should use those regex expressions to validate the code format. There is no reason to “remove” a symbol, validate, then put it back in. That defeats the purpose of a proper validation, and i would think that you would understand this.

Use the regular expressions that exist for this sort of thing. They exist for a reason…

Les wrote at 12/7/2009 12:12 pm:

> … some UK numbers are 10 digits long …

Which ones?

UK telephone numbers are 11 characters long, even those 0845 et al numbers; there hasn’t been a 10 character telephone number in the UK for years.

Also if there are spaces left in the input (as in the example given today) that may not be down to a developer being lazy but that they are just following the rules enforced upon them.

Jakefolio (@jakefolio) wrote at 12/7/2009 1:13 pm:

@jon: I agree. I use regex for my phone # validation. Also, remember Brandon stated the code example was a step in the right direction, not complete code.

@Brandon: Once again good article pointing out “lazy programming” at it’s best. I believe the developer (frontend) should give an example for each input on the format they “hope” the user will follow, but once again……NEVER TRUST THE USER!

Robin (@schuilr) wrote at 12/7/2009 5:52 pm:

Great article, Brandon. I really enjoyed reading this.

Chris Hughes wrote at 12/8/2009 4:44 am:

@Les – incorrect. There are still 10-digit numbers in some regions of the UK, as well as some numbers with an 0500 prefix. I’m not sure on the full list of areas still on 10-digit numbers, but I know Lancaster is definitely one.

Martijn Dijksterhuis (@mdijksterhuis) wrote at 12/8/2009 5:40 am:

Proper input validation should in an ideal world be done both at the client and the server. Preferably by the same code (AJAX would be nice for this).

Data format filters on the form fields would be the correct solution. (dd/mm/yyyy)

By the time the user has completed the form he should already have been forced to correct any input errors thus sidestepping the “retype the CAPTCHA” problem.

And because we are paranoid, we check again on submission at the server side.

I had great hopes for XForms, but that hasn’t really materialized (and its not going to be part of HTML 5). dBase III~ / Oracle / MS Access have had this for decades now.

A script should however not try to determine what the user “meant”. That is not its place.

If you allow for flexible input, just use a plain text field with no restrictions.

I prefer this solution if a human (not an automated script) interprets the data. If there is anything unclear they can always fire off a quick e-mail asking the customer to verify things.

And I can fill in my out-of-state-,out-of-country telephone number without being restricted to a UK or USA input field.

Jason Lotito wrote at 12/8/2009 9:06 am:

Spent a lot of time with phone numbers. The best way I found to deal with it was spend time studying phone numbers from most of the popular regions, and then started figuring out what we needed and what we didn’t. Our system automated calls out to these people (long before you had services do this for you). I settled on making sure all numbers were callable by our sales staff as presented. This meant I would change the phone number. Calling Europe from the states meant we’d have to adjust the phone number with a particular prefix depending on the country.

Found that this worked the best, and enabled us to automate calls out easily. We’d get people entering in all variations of phone numbers using all sorts of symbols between parts of the number. We adjusted for all of them. It takes some time, thought, and care, but it worked amazingly well once all was said and done.

Keep in mind, the idea was to allow the user to type in whatever, and we’d figure out what they meant. If they’re phone number and country didn’t match, we could correct things. More importantly, their country allowed us to figure out country codes, etc.

Without this system, people outside North America would have suffered, having to enter a phone number that we could automatically call from the US.

Les wrote at 12/8/2009 11:10 am:

> There are still 10-digit numbers in some regions of the UK,…

Are you sure Chris? I don’t validate for 10 characters anymore so if you’re sure then it really must be a rare occasion indeed!

> … without being restricted to a UK or USA input field.

:agree: I only check for numerics and a single space to separate code and number upto a certain length. Just like Postal codes; way too much variation to make a fair comparison to cover all.

Chris Hughes wrote at 12/10/2009 4:35 am:

Les – yup, certain – see the contact number at the top of Lancaster University’s website for one such example – http://www.lancs.ac.uk/. There’s other towns also still on a ten digit system – unable to find a comprehensive list, but they do exist. Better get that regex checked!

Chris Henry (@chrishnry) wrote at 12/12/2009 3:42 pm:

Phone number validation is a really sore spot for me. Unless you’re feeding a list of phone numbers to some sort of automated dialing machine (which I hope no one is), a valid phone number can be formatted in a plethora of irritating ways. People use periods or slashes to separate out the area code and exchange, some like to surround the area code with parenthesis.

It got so bad that on certain sites I’ve worked with, I’ve wound up just doing the bare minimum checks, and let users enter their phone number as they please.

sjm wrote at 1/18/2010 9:41 pm:

Most UK telephone numbers can have either 9 or 10 digits (NSN) after the 0 trunk prefix.

The initial 0 is omitted when calling from abroad.

01 and 02 area codes should have parentheses around them if the local number part does not begin with a 0 or 1.

01 and 02 area codes do not have parentheses around them if the local number part begins with a 0 or 1. These are National Dialling Only ranges.

All other area codes do not have parentheses around them as the area code is required for all calls.

Number formats are expressed as:

2+8 to represent (02x) xxxx xxxx [in 5 areas] or 05x xxxx xxxx or 070 xxxx xxxx.

3+7 to represent (011x) xxx xxxx [in 6 areas] or (01×1) xxx xxxx [in 6 areas] or 03xx xxx xxxx or 08xx xxx xxxx or 0800 xxx xxxx or 09xx xxx xxxx.

3+6 to represent 0500 xxxxxx or 0800 xxxxxx.

4+6 to represent (01xxx) xxxxxx [in 580 areas] or 07xxx xxxxxx.

4+5 to represent (01xxx) xxxxx [in 41 areas].

5+5 to represent (01xx xx) xxxxx [in 12 areas].

5+4 to represent (01xx xx) xxxx [in 1 area].

Valid formats include:

(011x) – 3+7.
(01×1) – 3+7.
(01xxx) – 4+6 or 4+5.
(01xx xx) – 5+5 or 5+4.
(02x) – 2+8.
03xx – 3+7.
05x – 2+8.
0500 – 3+6.
07xxx – 4+6.
070 – 2+8.
08xx – 3+7.
0800 – 3+7 or 3+6.
09xx – 3+7.

There are a small number of exceptions such as 0800 1111 and 0845 4647.

The UK system is quite complex!

ellen la rue wrote at 10/15/2011 5:09 pm:

To add to the information above, here’s a more detailed list for the UK:

7 digit NSNs:

0800 1111

0845 46 47

9 digit NSNs:

(016977) 2xxx

(016977) 3xxx

(01xxx) xxxxx

0500 xxxxxx

0800 xxxxxx

10 digit NSNs:

(013873) xxxxx

(015242) xxxxx

(015394) xxxxx

(015395) xxxxx

(015396) xxxxx

(016973) xxxxx

(016974) xxxxx

(016977) xxxxx

(017683) xxxxx

(017684) xxxxx

(017687) xxxxx

(019467) xxxxx

(011x) xxx xxxx

(01×1) xxx xxxx

(01xxx) xxxxxx

(02x) xxxx xxxx

03xx xxx xxxx

055 xxxx xxxx

056 xxxx xxxx

070 xxxx xxxx

07624 xxxxxx

076 xxxx xxxx

07xxx xxxxxx

08xx xxx xxxx

09xx xxx xxxx

Valid formats include 2+8, 3+7, 4+6, 4+5, 5+5 and 5+4 (and 0+10 for NDO numbers).

The international format adds +44 and a space before the NSN digits.

The national format adds the 0 trunk code before the NSN. For 01 and 02 numbers the area code should be in parentheses, except for NDO numbers (NDO numbers are those where subscriber number begins 0 or 1).

Robbie wrote at 11/7/2011 3:36 pm:

To add to the information above, here’s a more detailed list for the UK:

7 digit NSNs:

0800 1111
0845 46 4x

9 digit NSNs:

(016977) 2xxx
(016977) 3xxx
(01xxx) xxxxx
0500 xxxxxx
0800 xxxxxx

10 digit NSNs:

(013873) xxxxx
(015242) xxxxx
(015394) xxxxx
(015395) xxxxx
(015396) xxxxx
(016973) xxxxx
(016974) xxxxx
(016977) xxxxx
(017683) xxxxx
(017684) xxxxx
(017687) xxxxx
(019467) xxxxx
(011x) xxx xxxx
(01×1) xxx xxxx
(01xxx) xxxxxx
(02x) xxxx xxxx
03xx xxx xxxx
055 xxxx xxxx
056 xxxx xxxx
070 xxxx xxxx
07624 xxxxxx
076 xxxx xxxx
07xxx xxxxxx
08xx xxx xxxx
09xx xxx xxxx

Valid formats for geographic numbers include 2+8, 3+7, 4+6, 4+5, 5+5 and 5+4 (and 0+10 for NDO numbers).

Non-geographic numbers mostly use 0+10 format, but some 0800 numbers and all 0500 numbers use 0+9 format.

The international format adds +44 and a space before the NSN digits.

The national format adds the 0 trunk code before the NSN. For 01 and 02 numbers the area code should be in parentheses, except for NDO numbers (NDO numbers are those where subscriber number begins 0 or 1).

« »

Copyright © 2023 by Brandon Savage. All rights reserved.