Detecting & Masking Personal Identifiable Information: Using OCI Language Model & Oracle Analytics
Why is personal information identification a big deal?
Personal Information Identification (PII) is crucial for maintaining privacy, complying with regulations, mitigating risks, ensuring data quality, and enabling trust between organisations and their customers.
In recent years, numerous organisations worldwide have encountered legal issues due to breaches of PII. Notable incidents include a major data breach at a credit reporting agency in 2017, affecting over 147 million people and resulting in lawsuits and a settlement of up to $700 million. Another significant case was the 2018 Cambridge Analytica scandal, where the exposure of millions of users' data led to significant legal challenges for a social media giant and a $5 billion fine from the Federal Trade Commission (FTC).
As organisations strive to comply with data regulations, such as the EU General Data Protection Regulation (GDPR), by identifying private information from your data so that you can obfuscate it before publishing - Oracle’s pre-trained AI language models and Oracle Analytics can provide the right solution. Here is Rittman Mead’s exploration of the capabilities.
The OCI Language Models
The OCI Language is a server-less and multi-tenant service accessible using REST API calls. With pre-trained and custom models, you can process unstructured text and extract insights without data science expertise.
Depending on your requirements, you can choose from the 7 pre-trained models of OCI Language. Take a look at the options here.
- Language Detection: Based on the text provided, detects the languages and includes a confidence score.
- Text Classification: Identifies the document category and subcategory of the text.
- Named Entity Recognition: Identifies common entities, people, places, locations, email, etc.
- Key Phrase Extraction: Extracts an important set of phrases from a block of text.
- Sentiment Analysis: Identifies particular features from the provided text and classifies the sentiments as positive, negative, or neutral.
- Text Translation: Translate text into your choice of language.
- Personal Identifiable Information: The latest addition - Personal Identifiable Information (PII) detection identifies, classifies, and de-identifies private information in unstructured text.
To get a feel of how the models function, head to your OCI tenancy and from the navigation panel select Analytics & AI > AI Services > Language.
Then, select Text Analytics under Pretrained Models as highlighted below.
On the Text Analytics page, you can enter text to be analysed by the pre-trained models, choose the language and click on Analyze button.
In less than a second, the results are available to you that categorise the output from each model demarcating the identified information as shown below.
Now, this was just a high-level taster of the power and efficiency of the OCI Language models. Let us now see how the PII model can be accessed via Oracle Analytics.
Pre-trained PII Model in Oracle Analytics Cloud
To get the OCI Language pre-trained models onto Oracle Analytics, first create a data connection to the OCI tenancy where the language models are running.
From your OAC instance, select Create > Connection > OCI Resource.
Once you provide the relevant information corresponding to your OCI tenancy where the language models are present.
In the Public API Key, click on Generate and then Copy to take a copy of the generated key.
Before hitting Save, go to Identity & Security > Identity > Users, select the required user and access the API Keys option from the bottom left of the page.
Click on the Add API Key button:
In the Add API key window, select the Paste a public key option, enter the public key generated and click on the Add button.
Now hit the Save button in the Create Connection window where we previously added details to connect to the OCI Language Resource.
Once notified of the connection created successfully, you can verify the same by navigating to Data > Connection.
Model Registration
To register the pre-trained language models to OAC, click on the kebab menu option > Register Model/ Function > OCI Language Models.
Select the AI-LanguageModels connection created from the Register a Language Model window routed to.
Select the right compartment and then the available models will be listed to be chosen from.
In our case, we can see all the OCI Language Service models listed. Let's select our Pretrained PII Identification model.
Make sure to add the right staging bucket under the Staging compartment selected, click on Register.
And that is it, you have now registered the Pretrained PII Identification model into Oracle Analytics Cloud. This will ensure your access to the model for any further dataflow operations. We will cover that later.
PII Detection & Obfuscation in Oracle Analytics Cloud Datasets
To apply the PII model to the datasets in Oracle Analytics Cloud, click on Create > Data Flow from the Oracle Analytics home page.
First, add the required dataset which needs to be analysed for personal information. In the Data Flow editor, click Add a step and choose Apply AI Model.
Now select the pre-trained PII Identification model that was registered previously in your Oracle Analytics Cloud environment.
In the Apply AI Model editor, you can see that 2 new outputs are marked by default which would be added to the result dataset with relevant information.
Under the Parameters, select the Input Column to Obfuscate which in our case is selected as SSN.
The Character for Masking by default is the asterisk * and can be changed as per preference.
You also have the flexibility to either mask all the characters by leaving the value as default (0) or leave a specific number of characters unmasked - by specifying the value for the Number of characters to be unmasked.
Unmask Options will let you start the Unmasking either From the start or From the end.
PII entities to mask - by default select all the PII data entities. You can deselect any of the entities not targeted by your requirement here.
To complete the data flow, add the Save Data step, define a name for the dataset and adjust the output column aggregations and attribute/measure assignments accordingly.
I saved the data flow name as PII_Test_DF and executed the data flow which notified me of completion within less than a minute.
Once the data flow execution completes, the preview of the results can be verified from the same page. As you can see here, the SSN values are fully masked by asterisks now.
Based on different configurations attempted on the Apply AI Model configuration step, it was observed that the data flow execution with any non-text input columns was failing with the error notification as ‘Input column has to be a text-based column’.
Any date or number datatype columns can be transformed to text using ‘Convert to text’ function in the dataset editor or at the data source step in the data flow editor. I have applied the text conversion in the Birthdate column which was a date datatype in the data source.
The input column for the PII model was then changed to Birthdate and the entity selection was restricted to just Date & Time. When the data flow was executed again with this configuration, the issue encountered with the non-text column was bypassed and masking was applied successfully.
The resulting dataset created by the data flow can be verified under Data > Datasets in the navigation menu.
Create a Workbook with the dataset created, to verify/analyse the results visually.
Conclusion
Overall, the experience connecting to OCI Language models, registering the pre-trained model in OAC and using it within a dataflow to detect and apply masking to the selected data source was seamless and fast.
I would like to see future releases of OAC giving the user the power to choose multiple columns for the masking to be applied in one go.
If you are looking into Oracle Analytics and want to find out more, please get in touch or DM us on Twitter @rittmanmead. Rittman Mead can assist your analytic journey and help you with a product demo, or training.