Quantcast
Channel: Blog dbi services
Viewing all articles
Browse latest Browse all 1431

Oracle 18c – select from a flat file

$
0
0

By Franck Pachot

.
This post is the first one from a series of small examples on recent Oracle features. My goal is to present them to people outside of Oracle and relational databases usage, maybe some NoSQL players. And this is why the title is “select from a flat-file” rather than “Inline External Tables”. In my opinion, the names of the features of Oracle Database are invented by the architects and developers, sometimes renamed by Marketing or CTO, and all that is very far from what the users are looking for. In order to understand “Inline External Table” you need to know all the history behind: there were tables, then external tables, and there were queries, and inlined queries, and… But imagine a junior who just wants to query a file, he will never find this feature. He has a file, it is not a table, it is not external, and it is not inline. What is external to him is this SQL language and what we want to show him is that this language can query his file.

I’m running this in the Oracle 20c preview in the Oracle Cloud.

In this post, my goal is to load a small fact and dimension table for the next posts about some recent features that are interesting in data warehouses. It is the occasion to show that with Oracle we can easily select from a .csv file, without the need to run SQL*Loader or create an external table.
I’m running everything from SQLcl and then I use the host command to call curl:


host curl -L http://opendata.ecdc.europa.eu/covid19/casedistribution/csv/ | dos2unix | sort -r > /tmp/covid-19.csv

This gets the latest number of COVID-19 cases per day and per country.

It looks like this:


SQL> host head  /tmp/covid-19.csv
dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
31/12/2019,31,12,2019,27,0,China,CN,CHN,1392730000,Asia
31/12/2019,31,12,2019,0,0,Vietnam,VN,VNM,95540395,Asia
31/12/2019,31,12,2019,0,0,United_States_of_America,US,USA,327167434,America
31/12/2019,31,12,2019,0,0,United_Kingdom,UK,GBR,66488991,Europe
31/12/2019,31,12,2019,0,0,United_Arab_Emirates,AE,ARE,9630959,Asia
31/12/2019,31,12,2019,0,0,Thailand,TH,THA,69428524,Asia
31/12/2019,31,12,2019,0,0,Taiwan,TW,TWN,23780452,Asia
31/12/2019,31,12,2019,0,0,Switzerland,CH,CHE,8516543,Europe
31/12/2019,31,12,2019,0,0,Sweden,SE,SWE,10183175,Europe

I sorted them on date on purpose (next posts may talk about data clustering).

I need a directory object to access the file:


SQL> create or replace directory "/tmp" as '/tmp';

Directory created.

You don’t have to use quoted identifiers if you don’t like it. I find it convenient here.

I can directly select from the file, the EXTERNAL clause mentioning what we had to put in an external table before 18c:


SQL> select *
   from external (
    (
     dateRep                    varchar2(10)
     ,day                       number
     ,month                     number
     ,year                      number
     ,cases                     number
     ,deaths                    number
     ,countriesAndTerritories   varchar2(60)
     ,geoId                     varchar2(30)
     ,countryterritoryCode      varchar2(3)
     ,popData2018               number
     ,continentExp              varchar2(30)
    )
    default directory "/tmp"
    access parameters (
     records delimited by newline skip 1 -- skip header
     logfile 'covid-19.log'
     badfile 'covid-19.bad'
     fields terminated by "," optionally enclosed by '"'
    )
    location ('covid-19.csv')
    reject limit 0
   )
 .

SQL> /
      DATEREP    DAY    MONTH    YEAR    CASES    DEATHS                       COUNTRIESANDTERRITORIES       GEOID    COUNTRYTERRITORYCODE    POPDATA2018    CONTINENTEXP
_____________ ______ ________ _______ ________ _________ _____________________________________________ ___________ _______________________ ______________ _______________
01/01/2020         1        1    2020        0         0 Algeria                                       DZ          DZA                           42228429 Africa
01/01/2020         1        1    2020        0         0 Armenia                                       AM          ARM                            2951776 Europe
01/01/2020         1        1    2020        0         0 Australia                                     AU          AUS                           24992369 Oceania
01/01/2020         1        1    2020        0         0 Austria                                       AT          AUT                            8847037 Europe
01/01/2020         1        1    2020        0         0 Azerbaijan                                    AZ          AZE                            9942334 Europe
01/01/2020         1        1    2020        0         0 Bahrain                                       BH          BHR                            1569439 Asia
ORA-01013: user requested cancel of current operation

SQL>

I cancelled it as that’s too long to display here.

As the query is still in the buffer, I just add a CREATE TABLE in front of it:


SQL> 1
  1* select *
SQL> c/select/create table covid as select/
   create table covid as select *
  2   from external (
  3    (
  4     dateRep                    varchar2(10)
  5     ,day                       number
...

SQL> /

Table created.

SQL>

While I’m there I’ll quickly create a fact table and a dimension hierarchy:


SQL> create table continents as select rownum continent_id, continentexp continent_name from (select distinct continentexp from covid where continentexp!='Other');

Table created.

SQL> create table countries as select country_id,country_code,country_name,continent_id from (select distinct geoid country_id,countryterritorycode country_code,countriesandterritories country_name,continentexp continent_name from covid where continentexp!='Other') left join continents using(continent_name);

Table created.

SQL> create table cases as select daterep, geoid country_id,cases from covid where continentexp!='Other';

Table created.

SQL> alter table continents add primary key (continent_id);

Table altered.

SQL> alter table countries add foreign key (continent_id) references continents;

Table altered.

SQL> alter table countries add primary key (country_id);

Table altered.

SQL> alter table cases add foreign key (country_id) references countries;

Table altered.

SQL> alter table cases add primary key (country_id,daterep);

Table altered.

SQL>

This create a CASES fact table with only one measure (covid-19 cases) and two dimensions. To get it simple, the date dimension here is just a date column (you usually have a foreign key to a calendar dimension). The geographical dimension is a foreign key to the COUNTRIES table which itself has a foreign key referencing the CONTINENTS table.

12c Top-N queries

In 12c we have a nice syntax for Top-N queries with the FETCH FIRST clause of the ORDER BY:


SQL> select continent_name,country_code,max(cases) from cases join countries using(country_id) join continents using(continent_id) group by continent_name,country_code order by max(cases) desc fetch first 10 rows only;

CONTINENT_NAME                 COU MAX(CASES)
------------------------------ --- ----------
America                        USA      48529
America                        BRA      33274
Europe                         RUS      17898
Asia                           CHN      15141
America                        ECU      11536
Asia                           IND       9304
Europe                         ESP       9181
America                        PER       8875
Europe                         GBR       8719
Europe                         FRA       7578

10 rows selected.

This returns the 10 countries which had the maximum covid-19 cases per day.

20c WINDOW clauses

If I want to show the date with the maximum value, I can use analytic functions and in 20c I don’t have to repeat the window several times:


SQL> select continent_name,country_code,top_date,top_cases from (
  2   select continent_name,country_code,daterep,cases
  3    ,first_value(daterep)over(w) top_date
  4    ,first_value(cases)over(w) top_cases
  5    ,row_number()over(w) r
  6    from cases join countries using(country_id) join continents using(continent_id)
  7    window w as (partition by continent_id order by cases desc)
  8   )
  9   where r=1 -- this to get the rows with the highes value only
 10   order by top_cases desc fetch first 10 rows only;

CONTINENT_NAME                 COU TOP_DATE    TOP_CASES
------------------------------ --- ---------- ----------
America                        USA 26/04/2020      48529
Europe                         RUS 02/06/2020      17898
Asia                           CHN 13/02/2020      15141
Africa                         ZAF 30/05/2020       1837
Oceania                        AUS 23/03/2020        611

The same can be done before 20c but you have to write the (partition by continent_id order by cases desc) for each projection.

In the next post I’ll show a very nice feature. Keeping the 3 tables normalized data model but, because storage is cheap, materializing some pre-computed joins. If you are a fan of NoSQL because “storage is cheap” and “joins are expensive”, then you will see what we can do with SQL in this area…

Cet article Oracle 18c – select from a flat file est apparu en premier sur Blog dbi services.


Viewing all articles
Browse latest Browse all 1431

Trending Articles