Quickstart¶
Getting started with giraffez begins with ensuring that the giraffez package is imported:
>>> import giraffez
Executing statements¶
Executing statements is a fundamental part of interacting with Teradata and giraffez makes use of the Teradata CLIv2 to provide this critical functionality. This includes any tasks that are normally executed through official Teradata tools like SQL Assistant or BTEQ.
The first step to executing a statement is setting up a giraffez.Cmd
object using the with
context. Using with
is important because it automatically closes the connection when the context ends, and ensures that open connections aren’t littered all over Teradata server. All examples going forward are going to use and expect this context so let’s set it up:
with giraffez.Cmd() as cmd:
# Connection is established
# Any code using the connection will go here at this indent level
# since the 'with' context maintains the connection for us
# Connection is closed when leaving context
Next, let’s perform a common function of working with any database by querying a table/view in the database:
with giraffez.Cmd() as cmd:
results = cmd.execute("select * from dbc.dbcinfo")
This returns a Cursor
for the provided query and printing it shows the various options that are set for execute
:
>>> print(results)
Cursor(statements=1, multi_statement=False, prepare=False, coerce_floats=True, parse_dates=False)
Here we can see that it contains only one query statement and multiple parallel statement mode is disabled. Reading from the cursor is fairly straight forward:
>>> for row in results: # the same as doing `for row in result.values()`
... print(row)
...
{'infodata': '15.00.05.01c', 'infokey': 'VERSION'}
{'infodata': '15.00.05.01', 'infokey': 'RELEASE'}
{'infodata': 'Standard', 'infokey': 'LANGUAGE SUPPORT MODE'}
It is just as easy to execute multiple synchronous statements:
with giraffez.Cmd() as cmd:
results = cmd.execute("""
select * from dbc.dbcinfo;
select count(*) as total from dbc.dbcinfo;
""")
Multiple parallel statement mode is helpful when you need to run insert statements in parallel or need to execute multiple queries at the same time (similar to a union but they are returned as different logical result sets). Taking advantage of this only requires giving execute
multiple statements separated by a semi-colon and setting the multi_statement
keyword argument:
with giraffez.Cmd() as cmd:
results = cmd.execute("""
select * from dbc.dbcinfo;
select count(*) as total from dbc.dbcinfo;
""", multi_statement=True)
The two statements will execute as part of the same request, and yield from the same results iterator:
>>> for row in results:
... print(row)
...
{'infodata': '15.00.05.01c', 'infokey': 'VERSION'}
{'infodata': '15.00.05.01', 'infokey': 'RELEASE'}
{'infodata': 'Standard', 'infokey': 'LANGUAGE SUPPORT MODE'}
{'total': 3}
Working with cursors¶
Executing statements returns a cursor-like object
that enables traversal over the executing statements. This works in a similar manner to other database client software that also uses a cursor, and just as with libraries like pyodbc one must read completely from the cursor to execute the statements and exhaust the records returned by Teradata:
>>> for row in results:
... pass
...
In cases where one does not need the data but needs to make sure all statements execute, the readall
method can be used:
>>> results.readall()
3
The only value returned is the number of rows that were read.
Accessing statement metadata¶
Consider the following set of statements:
with giraffez.Cmd() as cmd:
results = cmd.execute("""
select * dbc.dbcinfo;
select * from dbc.dbcinfo;
select count(*) as total from dbc.dbcinfo;
""", panic=False)
The first statement will fail due to a syntax error and with panic turned off execute
will move onto the next statement without raising an exception. The results end up looking like this:
>>> for row in results:
... print(row)
...
{'infodata': '15.00.05.01c', 'infokey': 'VERSION'}
{'infodata': '15.00.05.01', 'infokey': 'RELEASE'}
{'infodata': 'Standard', 'infokey': 'LANGUAGE SUPPORT MODE'}
{'total': 3}
The column information can be accessed easily via the Columns
object:
>>> results.columns
Column(total integer(4) N -(10)9)
However, in the case where there are more than one statement passed to execute
these will be the columns for the last statement to be executed successfully. Luckily, getting the columns and other useful metadata (like if an error occured) is pretty easy:
>>> for i, statement in enumerate(results.statements, 1):
... print("Statment[{}]:".format(i))
... print(repr(statement.columns))
...
Statement[1]:
None
Statement[2]:
Column(infokey varchar(60) N X(30))
Column(infodata varchar(32768) Y X(16384))
Statement[3]:
Column(total integer(4) N -(10)9)
Other useful metadata is available as well, such as the statement errors:
>>> for i, statement in enumerate(results.statements, 1):
... print("Statement[{}]: {!r}".format(i, statement.error))
...
Statement[1]: TeradataError('3706: Syntax error: SELECT * must have a FROM clause.')
Statement[2]: None
Statement[3]: None
Additionally, the error is the actual error instance so can be raised:
>>> for i, statement in enumerate(results.statements, 1):
... if statement.error:
... raise error
Traceback (most recent call last):
File "./example.py", line 20, in <module>
raise statement.error
File "../giraffez/cmd.py", line 91, in _execute
self.conn.execute(str(statement), prepare_only=self.prepare_only)
giraffez.TeradataError: 3706: Syntax error: SELECT * must have a FROM clause.
Loading data into a table¶
giraffez.Cmd
has the method insert
that attempts to provide a simple interface for loading data when performance is not necessarily critical. It does so by generating insert statements and submitting them in parallel execution mode. Here is inserting data from a file:
with giraffez.Cmd() as Cmd:
stats = cmd.insert('database.table_name', 'my_data.txt')
This requires a delimited header to be provided as the first line of the file and it returns a dict
with the number of rows inserted and the number of errors encountered:
>>> print(stats)
{'count': 3, 'errors': 0}
A more involved example can be found in Advanced Usage that loads the rows individually using insert
.
Exporting large amounts of data¶
When exporting a large amount of data from Teradata (many millions of rows), giraffez.BulkExport
should be used. It makes use of the bulk driver provided by the Teradata Parallel Transporter API to retrieve data from Teradata effificently. This powerful export can be setup without needing to create complex fastexport scripts:
with giraffez.BulkExport('dbc.dbcinfo') as export:
for row in export.to_list():
print(row)
More options are detailed over in the API referrence for giraffez.BulkExport
.
Loading large amounts of data¶
giraffez.BulkLoad
is best utilized for loading large (> ~100k rows) datasets. It makes use of the Teradata Parallel Transporter API bulk update driver which is generally faster than using something like insert
(which just generates insert statements). It also has less of a performance impact on Teradata server when dealing with larger datasets. While giraffez.BulkLoad
uses the bulk update driver it is not meant to be a direct wrapper of the mload functionality, rather it is a more friendly abstraction but with mload performance.
Just like with insert
, fields and delimiter for the input data are inferred from the header (or from a JSON stream):
with giraffez.BulkLoad("database.table_name") as load:
load.from_file("input.txt")
Another important feature is loading rows individually when dealing with information that isn’t being read from a file:
with giraffez.BulkLoad("database.table_name") as load:
load.columns = ["last_name", "first_name"]
rows = [
("Hemingway", "Ernest"),
("Chekhov", "Anton"),
("Kafka", "Franz")
]
for row in rows:
load.put(row)
Teradata Parallel Transporter API returns an exit code for every job using the bulk update driver. This exit code can be 0
, 2
, 4
, 8
, or 12
. This is intrinsic to how the driver works and any exit code other than 0 indicates some kind of failure to complete and is the same code returned when running a MultiLoad job using Teradata’s official mload
command-line tool. To remove unnecessary boilerplate, this exit code is implicitly 0
when successfully and raises an exception with the exit code should the job be unsuccessful. While we try very hard to abstract away the rough edges of the MultiLoad protocol, it is sometimes not very clear how the job errored. In these cases passing keyword print_error_table
allows for convenient access to the error table upon exit:
with giraffez.BulkLoad("database.table_name", print_error_table=True) as load:
This would be something you use while troubleshooting a new script but most likely remove once the script is working correctly. It is important to note that this data is in the error table regardless of this option being set, and will remain there until the job tables are cleaned up.
Configuring giraffez¶
The giraffez.Config
class is a convenient wrapper for accessing and modifying settings and credentials in your .girafferc
file. While most of the time these settings will be set and updated on the command-line, there may be a need to access these settings via Python:
with giraffez.Config() as config:
default_connection = config.get_value("connections.default")
print(default_connection)
However, a more likely scenario is that giraffez will also be used to store other sensitive information in the .girafferc
file. giraffez.Secret
can be used to handle the storage and retrieval of this information, allowing for this sensitive information to be encrypted at rest and still be easily accessible:
with giraffez.Secret(mode='w') as secret:
secret.set("sso.username", "username")
secret.set("sso.password", "password")
And using them in a script looks something like:
import requests
with giraffez.Secret() as secret:
sso_username, sso_password = secret("sso.username, sso.password")
proxy_url = "https://{0}:{1}@proxy.example.com".format(sso_username, sso_password)
r = requests.get("https://www.google.com", proxies={"http": proxy_url, "https": proxy_url})
print(r.text)