Configuring Parsers with gsqlcmd
Contents:
Parsing HTML
gsqlcmd allows extracting data from HTML tables.
You can extract the required table data using the /rootPath option.
For example, you can download dividends from the nasdaq.com website like
gsqlcmd download http://www.nasdaq.com/symbol/aapl/dividend-history aapl.htm
Then, get available tables using the command:
gsqlcmd show-html-tables aapl.htm tables.htm /openOutput
The resulting tables.htm file contains data like this:
You can see that the dividends are in table 3.
So, use the command like this to convert the downloaded file:
gsqlcmd convert aapl.htm dividends.csv /rootPath=3 /quoteChar=
or like this to extact data from the web page:
gsqlcmd convert http://www.nasdaq.com/symbol/aapl/dividend-history dividends.csv /rootPath=3 /quoteChar=
The resulting dividends.csv contains data like the following:
ExEffDate;Type;CashAmount;DeclarationDate;RecordDate;PaymentDate 2019-02-08;Cash;0.73;2019-01-29;2019-02-11;2019-02-14 2018-11-08;Cash;0.73;2018-11-01;2018-11-12;2018-11-15 2018-08-10;Cash;0.73;2018-07-31;2018-08-13;2018-08-16 2018-05-11;Cash;0.73;2018-05-01;2018-05-14;2018-05-17 2018-02-09;Cash;0.63;2018-02-01;2018-02-12;2018-02-15
gsqlcmd detects the correct table in most cases automatically.
So, you can try a simpler command first:
gsqlcmd convert aapl.htm dividends.csv
If the required page contains several tables with the required data, apply the convert mode for every table.
If the tables have the same structure, specify multiple indexes of the tables in the /rootPath option separated by commas like /rootPath=20,21,22.
If the table does not contain headers, use /noSourceHeaders option. In this case, gsqlcmd names columns like F1, F2, and so on.
Parsing XML
gsqlcmd allows extracting table data from hierarchical XML files.
In most cases, gsqlcmd detects the desired data automatically.
However, you can use the /rootPath and other parser options to customize the parser.
For example, test.xml contains the following contents:
<?xml version="1.0" encoding="utf-8"?> <root> <parent> <row id="1"><f1>data11</f1><f2>data12</f2></row> <row id="2"><f1>data21</f1><f2>data22</f2></row> <row id="3"><f1>data31</f1><f2>data32</f2></row> </parent> </root>
The basic convert command looks like this:
gsqlcmd convert test.xml test.csv
The resulting test.csv will have the following content:
id;"f1";"f2" 1;"data11";"data12" 2;"data21";"data22" 3;"data31";"data32"
gsqlcmd selects the "best" root of the data.
This command is equivalent to
gsqlcmd convert test.xml test.csv /rootPath=root.parent.row
You can change the root path to get data from the specified node.
For example:
gsqlcmd convert test.xml test.csv /rootPath=root.parent
The result is the following:
row_id;"row_f1";"row_f2" 1;"data11";"data12" 2;"data21";"data22" 3;"data31";"data32"
You can use the /quoteChar option to remove quotes.
For example:
gsqlcmd convert test.xml test.csv /rootPath=root.parent /quoteChar= row_id;row_f1;row_f2 1;data11;data12 2;data21;data22 3;data31;data32
Use the show-roots mode to list applicable roots.
For example:
gsqlcmd show-roots test.xml root root.parent root.parent.row root.parent.row.id root.parent.row.f1 root.parent.row.f2
Take a look at the following options that allows customizing the parser:
/attributedColumns=<col>.<attr>[;...] /collapsedNodes=<node>[;...] /ignoredTags=<tag>[;...] /keptNodes=<node>[;...] /requiredColumns=<name>[;...] /skippedNodes=<node>[;...]
For example, you can skip required nodes using /skippedNodes option like:
gsqlcmd convert test.xml test.csv /quoteChar= /skippedNodes=f2 id;f1 1;data11 2;data21 3;data31
Parsing JSON
Parsing JSON is similar to the parsing XML as both document formats have hierarchical structures.
For example, test.json contains the following contents:
{"root": {"parent":[ {"id":1,"f1":"data11","f2":"data12"}, {"id":2,"f1":"data21","f2":"data22"}, {"id":3,"f1":"data31","f2":"data32"} ] }
The basic convert command returns the following:
gsqlcmd convert test.json test.csv Id;"F1";"F2" 1;"data11";"data12" 2;"data21";"data22" 3;"data31";"data32"
The JSON parser capitalizes the first letters of column names.
Use the show-roots mode to list applicable roots.
For example:
gsqlcmd show-roots test.json root root.parent root.parent.row root.parent.row.id root.parent.row.f1 root.parent.row.f2
You can select the required data using the /rootPath option.
For example:
gsqlcmd convert test.json test.csv /rootPath=root /quoteChar= Parent_id;Parent_f1;Parent_f2 1;data11;data12 2;data21;data22 3;data31;data32
The JSON parser support multiple roots. For example:
gsqlcmd convert test.json test.csv /rootPath=root.parent.f1,root.parent.f2 /quoteChar= F1;F2 data11;data12 data21;data22 data31;data32
You can use * instead of node names. For example:
gsqlcmd convert test.json test.csv /rootPath=*.*.f1,*.*.f2 /quoteChar=
You can find a lot of ready-to-use samples in the Downloading and converting to CSV folder of the downloaded gsqlcmd package.
Also, take a look at the following options that allows customizing the parser:
/keptNodes=<node>[;...] /requiredColumns=<name>[;...] /skippedNodes=<node>[;...]
Parsing Plain Text
gsqlcmd allows getting data from plain text files if the text contains detected columns like these:
id f1 f2 1 data11 data12 2 data21 data22 3 data31 data32
Use the convert command to extract data:
gsqlcmd convert test.txt test.csv
The resulting test.csv file contains the following:
id;"f1";"f2" 1;"data11";"data12" 2;"data21";"data22" 3;"data31";"data32"
Note you can also use Microsoft ODBC and OLE DB text drivers to select and convert plain text data.
In the initial step, update the schema.ini file using the make-ini command:
gsqlcmd make-ini test.txt schema.ini
Then use the exec command to select data:
gsqlcmd exec test.txt "SELECT * FROM test.txt" test.csv
You can use SQL-supported features like WHERE or GROUP in such commands.
Parsing CSV
gsqlcmd allows extracting data from CSV files and converting to any supported output format.
Take a look at the useful formatting options:
/add=<header>=<value | function>[;...] /addRowNum /dateFormat=<format> /dateTimeFormat=<format> /escapeChar=<char> /noBOM /outputCodepage=<codepage> /outputCulture=<name> /[outputFormat=]asText | asTxt | asCsv | asHtm | asHtml | asXml | asJson /[[output]Separator](formatting-options.htm#option-outputSeparator)=<separator> | Tab /quoteChar=<char> /rowNumBase=<integer> /timeFormat=<format>
For example, you can use the following command to add new columns, like Symbol and Date, and convert the output file to the UTF-8 encoding.
gsqlcmd convert test.csv test-65001.csv /outputEncoding=65001 /add=Symbol=AAPL;Date=Date()
Note you can also use Microsoft ODBC and OLE DB text drivers to select and convert CSV data.
In the initial step, update the schema.ini file using the make-ini command:
gsqlcmd make-ini test.csv schema.ini
Then, use the exec command to select data:
gsqlcmd exec test.csv "SELECT 'AAPL' AS Symbol, :date AS [Date], * FROM test.csv" output.csv /set=date=Date()
You can use SQL-supported features like WHERE or GROUP in such commands.