How To Access Cloud Storage from Your On-Premises Oracle Database.

Image by OralNwaka from Pixabay

Did you know it’s possible to access data stored in cloud object storage directly from your on-premises Oracle database?

You may be wondering why you would want to do this? Well apart from the fact that object storage is extremely economical at $0.01 USD a GB/Month, there are plenty of good reasons to do so, perhaps you wish to run machine learning algorithms in your database with data stored in AVRO or Parquet files staged in the cloud? Or perhaps you want to access data stored in a third-party cloud using Swift APIs? Or perhaps you want to use Oracle Hybrid Partitioning to hold your archived data in partitions mapped to the cloud? These are just some examples and I’m sure you can think of other applications too.

In this post I will show you how to set up a Hybrid Partitioned Table, but the other approaches are just as straightforward.

At a high level this blog post is broken into the following 5 steps:

  1. Install the DBMS_CLOUD package in the database
  2. Create an Object Storage Bucket in the cloud, and upload sample data into the bucket
  3. Create a credential to access the object storage bucket
  4. Create an External Table that maps to the bucket
  5. Create a Hybrid-Partitioned table with partitions stored in the bucket

First off, there are a few requirements for this to work:

  1. You must be running Oracle Database version 19.9 and above.
  2. This is only supported in a Container Database or Multitenant database.
  3. You need to install the DBMS_CLOUD package.
  4. You will need a C##CLOUD$SERVICE schema to hold the package.
  5. The database will require an SSL wallet and certificates for accessing the object storage and HTTPS URIs.

1. Install the DBMS_CLOUD package in the database

In order to access Cloud Object Storage, the database requires the DBMS_CLOUD package. The package provides a nice set of functions to perform the storage operations and if you’ve used the Oracle Autonomous Database, you should already be familiar with the package and what it can do for you.

Installing DBMS_CLOUD is fairly straightforward and is documented here:
Oracle Support Document 2748362.1 – How To Setup And Use DBMS_CLOUD Package.

The support document provides code that you can simply cut-and-paste to run in your favorite SQL editor. It even includes a script to create the common user called C##CLOUD$SERVICE which will own the package, but don’t worry, the schema is locked so you will not be left with a new account in your database to secure.

After creating the DBMS_CLOUD package, the next step is to setup the security components to connect to your cloud object store. This includes creating a Wallet to store your cloud certificates, downloading certificates and defining your Access Control Entries to permit your database to communicate over HTTPS. The good news is that all of these steps are also included in the support document and can be run by simply using cut-and-paste too.

Once you’ve installed the DBMS_CLOUD package and set up your cloud connection, you’re now ready to start using the DBMS_CLOUD package to connect to your cloud object storage bucket.

Note: Credit for the support document and great instructions go to: Hermann Baer Oracle Product Manager

2. Create an Object Storage Bucket and upload sample data into the bucket

Before we get started we will create a Bucket to store our data that we will be mapping to our database. Rather than guide you step by step, I have provided a short 30 second video to demonstrate how easy this is to do.

The following steps are being performed in the viewlet:

  1. Navigate to Object Storage
  2. Create a bucket
  3. Copy 4 files from my local disk to the bucket
  4. Get the object-handle for one of the files and paste into a notepad

We now have some sample data to work with.

3. Create a credential to access the object storage bucket

Next, we need a way to authenticate our database to our cloud object store. This is done using a Database Credential. In our cloud console we can create an Auth Token to permit the connection. The following animation demonstrates how to do this.

Creating a credential to authenticate to the Oracle Cloud

You should have noticed that the very last step was to “copy” the token to my clipboard. It is important to do this as the Token can never be displayed again. Though as you’ve seen, creating a new Token is a piece of cake.

Back to the database…

Once you have copied your token you are ready to create a Credential

BEGIN
   DBMS_CLOUD.CREATE_CREDENTIAL(
       credential_name => 'DBCLOUD_TOKEN',
       username => '<your cloud username>',
       password => '<paste your auth token string here>' );
END;
/

The parameters required here are:

  • credential_name – The name we gave our Auth Token in the cloud console.
  • username – Our cloud account user name. In my case the cloud username includes a prefix: “oracleidentitycloudservice/email.address@email.com”.
  • password – it’s the string that I copied from the cloud console.

btw – To drop (and recreate) your credential inside the database you can run the following command:

exec dbms_cloud.drop_credential('DBCLOUD_TOKEN') ;

You can see the credential by running the following query:

select  *  from USER_CREDENTIALS ;

4. Create an External Table that maps to the object storage bucket

Now that we’ve set up our connection for our cloud account, the next step is to check that we can see the contents of our Cloud Object Storage Bucket. We can use the following query to check this:

select * 
from dbms_cloud.list_objects( 
'DBCLOUD_TOKEN', 
'https://objectstorage.us-ashburn-1.oraclecloud.com/n/ixxxxxxxxxxy/b/MyBucket/o/' ) /

The string used in this query is the URI we copied from our object storage bucket in Step 2. You should see something like the following.

In my first example I am creating an External Table that is mapped to a file stored in my Object Storage bucket. This step is not required to use Hybrid Partitioned Tables, but it provides a nice way to demonstrate how easy this is to use.

For this example, the file I am using is relatively small in size, it has just 100 rows with 5 columns. It contains the population changes for all 100 counties of North Carolina from 2010 to 2020.

For reference – here’s a sample of what my .csv file looks like: (displayed in excel and a text editor)

Creating the External Table is very straightforward- you specify the format being loaded using the json_object function. In my case it’s ‘csv’, but this could just as easily have been ‘avro’, ‘orc’ or ‘parquet’.

Additional information on the formatting operations is available in the online documentation for the Autonomous Database: https://docs.oracle.com/en/cloud/paas/autonomous-database/adbsa/format-options.html

BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
table_name => 'COUNTY_GROWTH',
credential_name => 'DBCLOUD_TOKEN',
file_uri_list => 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/ixxxxxxxxxxy/b/MyBucket/o/countygrowth.csv',
format => json_object('type' value 'csv', 'skipheaders' value '1'),
field_list => 'COUNTY,
  CY2010,
  CY2020,
  CHANGE,
  PERCENTAGE',
column_list => 'COUNTY VARCHAR2(20),
  CY2010 NUMBER,
  CY2020 NUMBER,
  CHANGE NUMBER,
  PERCENTAGE NUMBER' );
END;
/

It’s possible to verify that the external table contents are mapping as expected with the following query:

BEGIN
   DBMS_CLOUD.VALIDATE_EXTERNAL_TABLE(
      table_name => 'COUNTY_GROWTH',
      schema_name => 'SCOTT'
);
END;
/

The output of this statement is placed into two tables called: VALIDATE$1_LOG and VALIDATE$1_BAD. Querying the VALIDATE$1_LOG table should look similar to this:

NOTE: The number placeholder in the table name will increment every time you call the package DBS_CLOUD.VALIDATE_EXTERNAL_TABLE…

select * from VALIDATE$1_LOG ;

SQL> select * from VALIDATE$1_LOG ;

RECORD
-----------------------------------------------------------------------

 LOG file opened at 03/12/21 08:55:23

Bad File: VALIDATE$1_2205.bad

Field Definitions for table COUNTY_GROWTH
  Record format DELIMITED BY
  Data in file has same endianness as the platform
  Rows with all null fields are accepted

  Fields in Data Source:

    COUNTY			    CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    CY2010			    CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    CY2020			    CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    CHANGE			    CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    PERCENTAGE			    CHAR (255)
      Terminated by ","
      Enclosed by """ and """


 LOG file opened at 03/12/21 08:55:23

Total Number of Files=1

Data File: https://objectstorage.us-ashburn-1.oraclecloud.com/n/ixxxxxxxy/b/MyBucket/o/countygrowth.csv

Log File: VALIDATE$1_9160.log

38 rows selected.
 

In my case the External Table was created without any issues so querying the VALIDATE$1_BAD returns no rows:

SQL> select * from VALIDATE$1_BAD

no rows selected

As I mentioned earlier, the file I used for this demo is relatively small, it has just 100 rows with 5 columns of information. So it’s possible to run a query against the entire table (or flat-file) to see it’s contents. But where’s the fun in that? Instead I can use a SQL query to be more selective with the data I am returning (This time I’m looking for the counties with over 100,000 residents that experienced a drop in population from 2010 to 2020.


SQL> select *
  2  from   COUNTY_GROWTH
  3  where  CY2020 <= CY2010
  4  and    CY2020 >= 100000
  5  /

COUNTY			 CY2010     CY2020     CHANGE PERCENTAGE
-------------------- ---------- ---------- ---------- ----------
Craven			 103498     103016	 -482	     -.5
Robeson 		 134229     131238	-2991	    -2.2

But I am guessing this is not a very exciting demonstration. So let’s shift gears and take a look at using cloud object storage to host portions of a larger data-set.

5. Create a Hybrid-Partitioned table with partitions stored in the Object Storage bucket

For this example I will create a Hybrid Partitioned Table. The data set spans 17 years: 2005 to 2021 and I am placing the oldest data into 3 separate csv files – one for each 5 year grouping. I have uploaded the files to my object storage bucket and am creating partitions to transparently map the data using Range Partitioning with the most recent partitions remaining locally in my on-premises database.

Here’s the statement I am using to create the new table. You will notice that there is a specific procedure for defining a Hybrid Partitioned Table:

BEGIN
DBMS_CLOUD.CREATE_HYBRID_PART_TABLE (
   table_name => 'RESTAURANT_INSPECTIONS',
   credential_name => 'DBCLOUD_TOKEN',
   format => json_object('type' value 'CSV', 'skipheaders' value '1'),
   column_list => 'id number, inspect_date date, restaurant_name varchar2(100), address varchar2(200), city varchar2(40), state varchar2(2), zip_code varchar2(12), lat number, lon number',
   partitioning_clause => 'partition by range (INSPECT_DATE) (
   partition inspections_arch_2005 values less than (TO_DATE(''01-01-2006'',''DD-MM-YYYY''))
      external location (''https://objectstorage.us-ashburn-1.oraclecloud.com/n/ixxxxxxxxxxy/b/MyBucket/o/inspections_arch_2005.csv''),
   partition inspections_arch_2010 values less than (TO_DATE(''01-01-2011'',''DD-MM-YYYY''))
      external location (''https://objectstorage.us-ashburn-1.oraclecloud.com/n/ixxxxxxxxxxy/b/MyBucket/o/inspections_arch_2010.csv''),
   partition inspections_arch_2015 values less than (TO_DATE(''01-01-2016'',''DD-MM-YYYY''))
      external location (''https://objectstorage.us-ashburn-1.oraclecloud.com/n/ixxxxxxxxxxy/b/MyBucket/o/inspections_arch_2015.csv''),
   partition inspections_2020 values less than (TO_DATE(''01-01-2021'',''DD-MM-YYYY'')),
   partition inspections_2021 values less than (TO_DATE(''01-01-2022'',''DD-MM-YYYY''))  
   )'
 );
END ;
/

After creating the table it’s a good practice to verify that the data has been mapped without issue. The good news is that DBMC_CLOUD can do this for you:

BEGIN
   DBMS_CLOUD.VALIDATE_HYBRID_PART_TABLE (table_name => 'RESTAURANT_INSPECTIONS') ;
END ;
/

Executing the procedure simply produces the following output:

PL/SQL procedure successfully completed.

But it does create 2 tables with the results: VALIDATE$n_LOG and VALIDATE$n_BAD. (The “n” will be a number that is incremented every time you call the procedure).

When I query the VALIDATE$2_LOG table I see a nice report of the DDL operation as follows:

SQL> select * from VALIDATE$2_LOG ;

RECORD
-------------------------------------------------------------------

 LOG file opened at 03/15/21 12:50:27

Bad File: VALIDATE$2_2205.bad

Field Definitions for table RESTAURANT_INSPECTIONS
  Record format DELIMITED BY
  Data in file has same endianness as the platform
  Rows with all null fields are accepted

  Fields in Data Source:

    ID                              CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    INSPECT_DATE                    CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    RESTAURANT_NAME                 CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    ADDRESS                         CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    CITY                            CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    STATE                           CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    ZIP_CODE                        CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    LAT                             CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    LON                             CHAR (255)
      Terminated by ","
      Enclosed by """ and """

Date Cache Statistics for table RESTAURANT_INSPECTIONS
  Max Size:      1000
  Entries :       204
  Hits    :       672
  Misses  :         0


 LOG file opened at 03/15/21 12:50:27

Total Number of Files=3

Data File: https://objectstorage.us-ashburn-1.oraclecloud.com/n/ixxxxxxxxxxr/b/MyBucket/o/inspections_arch_2005.csv
Data File: https://objectstorage.us-ashburn-1.oraclecloud.com/n/ixxxxxxxxxxr/b/MyBucket/o/inspections_arch_2010.csv
Data File: https://objectstorage.us-ashburn-1.oraclecloud.com/n/ixxxxxxxxxxr/b/MyBucket/o/inspections_arch_2015.csv

Log File: VALIDATE$2_9160.log

 LOG file opened at 03/15/21 12:50:27

Bad File: VALIDATE$2_2207.bad

Field Definitions for table RESTAURANT_INSPECTIONS
  Record format DELIMITED BY
  Data in file has same endianness as the platform
  Rows with all null fields are accepted

  Fields in Data Source:

    ID                              CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    INSPECT_DATE                    CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    RESTAURANT_NAME                 CHAR (255)
      Terminated by ","
      Terminated by ","
      Enclosed by """ and """
    ADDRESS                         CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    CITY                            CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    STATE                           CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    ZIP_CODE                        CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    LAT                             CHAR (255)
      Terminated by ","
      Enclosed by """ and """
    LON                             CHAR (255)
      Terminated by ","
      Enclosed by """ and """

Date Cache Statistics for table RESTAURANT_INSPECTIONS
  Max Size:      1000
  Entries :       477
  Hits    :      1798
  Misses  :         0

SQL>

But when I query VALIDATE$2_BAD I see no rows returned, which is a good thing:

SQL> select * from VALIDATE$2_BAD ;

no rows selected

Let’s go ahead and try a quick query of the table before inserting any data into the table:


SQL> select to_char(inspect_date,'YYYY') YEAR, count(*)
  2  from   restaurant_inspections
  3  group  by to_char(inspect_date, 'YYYY')               
  4  order  by 1 ;


YEAR	       COUNT(*)
---- ------------------
2005		    876
2010		  1,086
2015		  1,189

You can see the external data has already been mounted and is visible to query.

A quick query of the user_tables shows me the table is a Hybrid Partitioned Table:


SQL> select table_name, hybrid                 
  2  from   user_tables
  3  where  table_name = 'RESTAURANT_INSPECTIONS' ;

TABLE_NAME		  HYB
------------------------- ---
RESTAURANT_INSPECTIONS	  YES

I can also query USER_TAB_PARTITIONS

col table_name form A25
col partition_name form A25

select table_name,
       partition_name,
       tablespace_name,
       segment_created,
       indexing,
       read_only
from   user_tab_partitions
where  table_name = 'RESTAURANT_INSPECTIONS' ;



TABLE_NAME		  PARTITION_NAME	    TABLESPACE_NAME		   SEGM INDE READ
------------------------- ------------------------- ------------------------------ ---- ---- ----
RESTAURANT_INSPECTIONS	  INSPECTIONS_2020	     USERS			    NO	ON   NO
RESTAURANT_INSPECTIONS	  INSPECTIONS_2021	     USERS			    NO	ON   NO
RESTAURANT_INSPECTIONS	  INSPECTIONS_ARCH_2005      SYSTEM			    YES	OFF  YES
RESTAURANT_INSPECTIONS	  INSPECTIONS_ARCH_2010      SYSTEM			    YES	OFF  YES
RESTAURANT_INSPECTIONS	  INSPECTIONS_ARCH_2015      SYSTEM			    YES	OFF  YES

SQL> 

You will see that the 3 partitions that are in the object storage are Read-Only and listed as residing in the SYSTEM tablespace.

I’m now ready to insert data into the remaining (or in-database) partitions. I happen to have a table called “STAGE_TBL” with the recent/remaining sample data that I will use for this purpose. This table contains entries for every year after 2015. So it’s very straight forward to populate my table:

SQL> insert into RESTAURANT_INSPECTIONS
  2      select * from STAGE_TBL ;

1206 rows created.

…and if we rerun our query we can see the additional rows of data without any indication that the data is stored in a completely different location:

SQL> select to_char(inspect_date,'YYYY') YEAR, count(*)
  2  from   restaurant_inspections
  3  group  by to_char(inspect_date, 'YYYY')
  4* order  by 1 ; 


YEAR	       COUNT(*)
---- ------------------
2005		    876
2010		  1,086
2015		  1,189
2016		    219
2017		    213
2018		    234
2019		    249
2020		    274
2021		     17

9 rows selected.

The last thing I want to demonstrate is what this looks like to the optimizer. To keep things simple I just enabled AUTO TRACE in my my session and re ran my query:

set autotrace on

Notice the line TABLE ACCESS HYBRID PART FULL in the explain plan.

Wrapping things up

Hopefully you found this post to be of use to you. I also hope you can see how very straightforward it is to setup your on-premises database to access data hosted in Oracle Cloud Object Storage. The same procedures can also be applied to leverage other cloud storage platforms like Amazon S3 and Azure BLOBs. Obviously the credentials would need to be defined to work with those platforms and the URI-string would follow the SWIFT protocol for the file mapping, but other than that, the steps remain the same.