Ever bigger

One of the things that came out from the Netezza conference is the number of large databases (not just on that platform) out there. Ten terabytes plus is common, I was speaking to one person with a 100 TB system whose daily batch is the same size as some people's data warehouses, I mean, 600 million rows loaded per day, that sounds a phenomenal amount of data. To be truthful this increase in size is well known, the Winter Corporation has regularly documented it in their survey of large databases; in fact I guess that the survey after next will have a DW leader in the multi-petabyte size range, maybe even by the next survey.

But the striking thing here is level of detail needed by the data users - we are not looking at aggregated sales data to provide some overview of yesterday's trading, we are instead carrying out sophisticated analysis on raw data. Telephone companies are loading the whole of their call data records from their switches to enable the accountants to identify flaws in the revenue collection process, security services scan the same types of records to find call pattens of interest, market researchers analyse millions of shopping baskets. The volume of low-level data to be tracked is astounding: RFID data from logistics systems, especially for goods that need rapid handling. EPOS data from supermarket chains, weblogs and email records from Internet Service Providers to track email and weblogs, vehicle movement data from license plate recognition systems for road toll charging (and other uses).

This leads to two challenges: loading the data in a timely fashion without disrupting the analysis process and doing the analysis itself! Both these problems head towards a constraint imposed by the laws of physics - how fast can we move data around. Disk drives are mechanical, there is a limit to their speed, there is also a limit to how fast data once off the disk can move around the computer, this flow of data is limited, to use a plumbing analogy, by the diameter of the pipe and the rate of flow. Ultimately the rate of flow is going to be limited by the speed of light, so if we need to have ever shorter processing times for data the only thing we can do is shorten the distance the data moves. Or is it?

to be continued...