Oracle Endeca (OEID) - Exploring Unstructured Data - Twitter & Blog Analytics

Recently the entire BI landscape of how we do reporting and how the data is consumed seems to be on the change constantly. Until a year back, the whole world of BI (at least from a perspective of Oracle) revolved around building Data Warehouses (& OLAP/Essbase cubes) and then the old school (if i may call that) way of modeling those in the RPD. Majority of BI is still done this way. But i have always wondered whether the traditional BI tools are capable of handling the massive amounts of data generated around the internet/social media these days. Not only will it take insane amount of time in modeling all the data, it will still be "model" dependent.

In addition, BI tools have always needed the underlying sources to be structured. Last year, as part of an engagement i took a look at both SAP Business Objects and IBM Cognos. In particular i was impressed by one specific tool called BOBJ Explorer. This tool unlike traditional BI where we pick & choose columns, was based on attribute key-value pair reporting i.e. you could do a search across all the data across all columns and then it will bring out key metrics related to the search. BI EE just did not have a tool that matched what BOBJ Explorer did. And i wondered how Oracle was going to plug that gap in the product set since unstructured data search and reporting is a key requirement these days. And then the Endeca acquisition happened and the rest as they say is history. Mark did a series of postings on how Endeca works here. Endeca acquisition opens up a lot of possibilities which were considered out of reach from a perspective of BI EE. Lets take for example Twitter feeds or blog posts. These data are all unstructured but there is so much information that we can glean out of them. For example, by just monitoring Twitter feeds one can find out the common issues of a recently launched product (say apple iPhone 4 reception issue). Or when you go to movies, you basically look at the related twitter feeds to find out whether it is a good movie in the first place to go and watch. Basically Sentiment Analysis is a lot more important in the context of social networks. BOBJ Explorer though did the search well, it just could not handle unstructured data well (at least when i took a look sometime back). So Endeca acquisition to me personally was interesting as that opened up an opportunity for Oracle to do unstructured data reporting. Also, to me personally Endeca+Data Mining makes a lot more sense than BIEE + Data Mining. Oracle Data Mining is so good when it comes to text analytics/mining - i am intrigued at the opportunities this has now opened up.

In today's post, we will be looking at how Endeca handles unstructured data. I will showcase 2 examples over 2 blog posts - Loading & reporting out of the data from Twitter and loading & reporting out of the data from Rittman Mead Blog posts (basically the last 10 blog posts). Before we start there are 2 things that we need to understand

The anatomy of Endeca ETL - Done using Latitude Data Integrator (basically an extension of Clover ETL)
The anatomy of reporting using Latitude Studio

Endeca comes with its own in-memory data engine called MDEX. This engine stores the data in the form of attributes and name-value pairs. It is not a traditional relational store and hence the process of doing a data load into the engine requires an understanding of the MDEX engine. Also, MDEX does not have a standard C or Java API. Instead all data loads and extracts happen through Web Service calls. MDEX is exposed through a set of standard web services that have to be explicitly called as part of the loads. Since it is based on web services, theoretically any ETL tool like ODI, OWB can all connect & load into it. But in practice, without proper adapters it becomes very difficult to formulate the requests for the web service calls - hence we have to use LDI (Latitude Data Integrator) or a pre-built adapter for other tools like Informatica. I would imagine it should not take long for Oracle to come out with an adapter to MDEX soon for ODI (i will try covering a post later on how ODI in its current form can be used to load data into Endeca).

Since MDEX is loaded/extracted using Web Service calls, bulk loads have to be done in batches. In addition, the core feature of MDEX engine is, it is very immune to structural changes of the underlying data sources. For example, in a traditional ETL tool, we will start off with reverse engineering the metadata of the sources & targets. Then we map how the source data flow into the target through a series of transformations. If we have to add a new attribute to a dimension, we will have to reverse engineer the new columns, change the mappings and then do a reload. In the case of MDEX, it is a bit different. We have to define the structure of the sources like in any other ETL tool but target does not need to have a structure at all. Automatically every source column that gets loaded into the MDEX engine is treated as an attribute. To illustrate what we will do is, we will take the example of the twitter feed extracts (latest 100 only) for the search term exalytics as shown below.

The above is basically a CSV extract of the twitter feeds. Remember the unstructured data here are the tweets and we are not specifying any size for the source or the target. Of course, if you have access to the twitter API we can write a java hook to do this in real time. But for illustration purposes we will stick with the csv file. To load this file for data exploration purposes within Endeca we will first have to start the MDEX engine. Like any other database it has to be started first to be loaded. That is done by using following command script

We can test the MDEX engine status by checking whether web services are available (http://localhost:5555/admin)

If we get the above url working then that means that the MDEX engine is up. Now, lets open up the Endeca Data Integrator to basically load our input twitter data into the MDEX engine. We start off with creating a project called Twitter Feed as shown below

Lets copy the csv file into the data-in directory within the project workspace directory. To begin with lets create a source metadata for the csv file. This is needed for the MDEX engine to interpret the source data and load it in.

Now, lets create a graph (interface/mapping in ETL terminology) and then include a Universal Data Reader & Bulk Add & Update objects into the graph as shown below

Lets update the properties of the Data Reader to point to the CSV file and skip the header record while loading as shown below

In the MDEX Bulk Update object, lets enter the MDEX engine details (host, port) along with the Spec. Spec is nothing but a primary key on the source that will be used for updates (since this is a add/update object). For spec we shall use the Sno column in the source.

Lets now connect both the objects. If you notice, as soon as the objects are connected we will get a dotted line. This means that the source still does not have a metadata assigned.

Lets assign the metadata now by right clicking the connection and assigning the metadata file we created above.

Lets run this graph now.

As you see we have loaded the CSV file into the MDEX engine. Now to see what has got loaded, lets start the Latitude Studio server. If we look at the attributes, you should see the attributes that we loaded

There are 3 things that we can notice

All attributes are grouped under Other group.
All attributes have a sorting value of Record Count.
All attributes have Single search capability

In addition all attributes we loaded have a default data type of string. So, in general when are doing a load into MDEX engine the attributes get created automatically. Also, all properties take default values. These default values are system default settings for MDEX. These default settings can be changed before the data load. That is all handled through custom MDEX properties which have to be pre-loaded as an input to the MDEX web services. These MDEX properties are generally defined as part of PDR definitions. Sample PDR definition along with the default settings are given below

If you notice, there are two properties system-navigation_select has a default value of select and system-navigation_Sorting has a default value of record-count. This is why our attributes before got the default values. A little further in this article we will see how the default properties can changed before the load as part of PDR definition. For now, lets go to Latitude Studio and add a new page and add custom components like Search Box, Breadcrumbs, Data Explorer

As you see we have automatically got a tag cloud in our twitter feed and we can highlight the important words based on frequency or relevance. Also we have an out of the box data explorer that we can search on as shown below

Not bad. What i would ideally like is to apply some sort of sentiment analysis on top of this. Also, it will be interesting to apply some contextual analytics like number of times a specific word has appeared in a set of tweets. We will see that in another article (on how LQL can be used to do reporting and to what extent we can apply analytics). But if you notice our data still needs to be prepped up a little bit more. What i would ideally like to do is to create custom attribute groups(instead of Other so that i can do more context based searches). Also, i would like to create assign sort-order, search capability etc all directly as part of the load. To do that, we will start off with creating a spreadsheet containing attribute groupings as shown below

All we are doing is just creating 2 groups - one called System group and the other called Tweet group. All the source attributes are grouped under these. We will follow the same process as above in adding this csv file and its corresponding metadata. In the same original graph include the universal data reader and ensure that its target is mapped to a DENORMALIZER object. What this will do is, it will basically convert each group into a single record by grouping at group level and then each attribute key will be added along in the mdex-group property (in Latitude Data Integrator this is done through Java Code as shown below). This is where i think the Clover ETL might be a bit challenging to use. What i would have preferred is sort of a ODI KM that will accept all the properties as input and then these will be automatically converted to XML request strings. Unfortunately thats not the case and hence we have to write code.

The DENORMALIZER will have the code as shown below - this is a standard code that we can use across projects (can be customized as well)

Now lets connect the DENORMALIZER to the Web Services caller API for the MDEX engine. Basically formatted XML will be sent as a request input to the MDEX engine.

In the same way we will add one more excel spreadsheet to load in the Attribute Properties as shown below

As for groups, loading the PDR for attributes require 2 custom objects to reformat the data in a Web Service form. This is shown below

Lets reset the MDEX and then start the attributes & metadata load. The execution should be successful as shown below

If we now take a look at the attribute properties in Latitude Studio we can see the custom attribute groups along with the custom properties

So far so good. Now, in the next post we will see how to use LQL to basically query out of the unstructured data.