Token databases

Mark Rittman recently posted on the subject of column orientated databases such Sybase IQ and various SAND Technology Inc products. One aspect Mark mentioned was the use of tokens to store attribute data.

In a previous role (before moving back into the mainstream world of Oracle data warehouses) I worked on 'free-text' information systems and in particular those produced by the Memex company. The basic idea behind these databases is that each unique word in the database is a assigned an integer token and instead of storing the actual text only a stream of integers is stored. On data input each word is checked against a dictionary - if the word exists it is substituted with its token, if it is a new word the next available integer is assigned as the token and the word added to the dictionary. For example if the first phrase ever to be entered in to the database was 'The cat sat on the mat' then: the will be assigned 1, cat = 2, sat = 3, on = 4, the is already 1, and mat will be 5. Of course this is a gross simplification but you can see how this works. Data output requires each token to be decoded and built into a text stream.

Of course this word dictionary based approach has its own set of problems. We are storing each unique character string as a token - bad typing or spelling can cause multiple tokens for the 'same' thing, however, it is not always appropriate to 'correct' the input text to a standard form. Likewise the use of synonyms can be problematic - if the text says 'car' is this the same as 'automobile' or 'motor vehicle' or indeed is 'colour' the same as 'color' and then there are parts of speech - 'drive', 'driven' and 'drove' are distinct words, but people may need to search for any of the words. The approach adopted by Memex is to store the 'as entered' text (tokenized) in the database but to produce tools that change the search to use, in effect, an in-list of tokens for alternate words.
Searching for the use of a single word in the database is fast - if it is not in the dictionary then it is not the database otherwise the database is scanned for a bitwise match with the token. It is possible to enhance the search by looking for multiple words, phrase searching and proximity searching (that is word 1 appearing within n words of word 2). Further enhancements allow the documents to be tagged in much the same way as XML so that data between specific tags takes a particular semantic meaning (or if you like, structure) Tagged data is especially useful when coupled with specialised data mining and data relationship graphing tools.

Back to Mark's piece - is this a useful technique in BI? And can something similar be implemented in Oracle? Firstly, the Memex approach is non-relational and as such does not fit readily into the conventional relational model DW approaches (Kimball or Inmon) The target user of these free-text databases are typically looking for single (or very few) record based on a search, this is perhaps closer to the Electronic Content Management community (interMedia?) than BI where users are looking for quantitative results and trends based on multiple records. But can any part of this technique be used in a conventional Oracle data warehouse context? Of course it is possible to build a token dictionary, perhaps as IOT of word token pairs, this will need to be indexed on both tokens and word to permit efficient coding and decoding. But where could we use these tokens in a data warehouse? If we adopt a model where the dimensional attributes are only stored in stored in one or more 'reference' data tables and not stored in the 'fact' tables at all (well apart from the join key which is, in a way, a token!). If tokens are to save space in the database (because the token takes less storage than the original word - which may not be true anyway) we will need to use each token at least twice in the reference data table to recover the extra storage needed for the word-token dictionary. For a 3NF DW we will have a hierarchy of reference data tables with each table consisting of unique records, this reduces the potential for duplicating attribute tokens; it does not eradicate it though as items could easily share attributes, for example in a fashion DW a tee-shirt and a dress could share the attribute of 'blue'. In a fully denormalized DW there is far more scope for attribute sharing. But is it worthwhile? - the reference tables themselves are short - how many customers or products do you have?, probably less than 1000000 rows. And add the necessity to join all the queries to token dictionary table to create the selection criteria and again to decode the tokens to make the output human friendly then tokens seem to add a performace overhead that may not be worth living with