Thoughts on extremely large databases and searching the unstructured
Nuno Souto posts an interesting set of thoughts on Extremely Large Databases. As usual, it is a well thought through post from someone who is probably scarred for life from actually working with large databases. In a data warehouse (or even a very large transactional system) context the reader is lead to an inevitability of partitioning (distributing) the data so that techniques such as indexing or table scans can succeed in a usably short time frame.
It can be argued that coupled with sensible partitioning schemes we can minimise IO in a data warehouse (of any large size) by building pre-aggregated summary tables; the pain of the build query happens just once instead of each time the aggregate is needed and by getting the partition granularity right we can minimise aggregations operations to just the partitions that need rebuild.
But what when stray from the domain of of structured DSS type queries? That is away from those queries where (even for ad-hoc queries) we remain able to use our wisely chosen aggregates or exploit the partitioning scheme of the base data. Suppose we look at data mining where the statistical relationships between data items are being determined or we search unstructured data for patterns, relationships or just index its content. And now-a-days unstructured data is not just text (if it ever was) we have speech and speaker recognition, image recognition from the (now) trivial OCR, through fingerprints and facial recognition to the more complex ability search libraries of image components to identify photographic locations. These require inventive indexing techniques but they also require fast access to the underlying data, and for big datasets is that going to be possible?
Maybe ELDBs are not the way to go for unstructured data or data that needs extensive analysis. Perhaps the way to go here is through database federation, keeping the computation close to the disk and coordinating the outputs to produce the end result