Data Sources

Configuring Data Connections

Datamode supports loading/saving with the following datastores:

  • Postgres, MySQL, local SQLite
  • Local csv, json, jsonl files
  • S3, GCloud, Azure buckets (planned)
  • An existing DataFrame in your Python script

You can add datastore connections to datamode.yml, one connection per entry. See User Profiles for instructions. When Datamode needs a connection, it will use the default logic below, or you can override the connection with the conn parameter to any Source/Sink transform. You can also override individual options like username, password, etc with conn_config. If Datamode needs a connection and doesn’t find one, it will error.

Datamode will use connections in this order:

  • a connection specified in a transform (e.g. SourceSql(conn='abc'))
  • a connection marked with default: true in the .yml
  • a connection if it’s the only one in the .yml.

You don’t need datamode.yml if you are sending/outputting a DataFrame directly with SourceDF, using SourceFile, or specifying the connection information manually via conn_config to any Source/Sink.

Data Context

Datamode can work with multiple tables and data connections. By default, Datamode will use the default connection. Source/Sink transforms can override the connection with the conn parameter, and/or override individual values like username/password/etc in conn_config.

SourceTable('abc')  # Use default connection; current table is 'abc'
SourceTable('def', conn='my_conn')  # Use 'my_conn', set current table to 'def'
DropColumns('some_column')  # Operate on table 'def'
SetWorkingTable(table_name='abc')  # Change table context back to 'abc'.
ProcessNones('another_column')  # Operate on table 'abc'
SinkData('output')  # Write results back to default conn

For any transform that operates on a single table, Datamode keeps track of the current working table. This can also be changed with the table_name parameter to SetWorkingTable.

Datasource Transforms

SourceData

These transforms wrap SourceData and will enforce certain constraints. Any parameter that SourceData accepts can be passed to other Source transforms and will be used if appropriate.

  • SourceTable(table)
    • Loads table into Datamode.
  • SourceQueryFile(queryfile)
    • Runs the query in queryfile into Datamode.
  • SourceInline(query)
    • Runs a sql query, for example select * from abc and loads the results into Datamode.
  • SourceSql(table=None, query=None, queryfile=None)
    • Loads the results of a table, query or queryfile into Datamode. If none is specified, Datamode will error. SourceTable, SourceQueryFile, and SourceInline are just convenience wrappers for SourceSql.
  • SourceFile(path, file_type=None, sample_ratio=42, sample_seed=42)
    • Loads a file from path.
    • If a URL to a file is passed as a path it must begin with http or https.
    • Options:
      • file_type can be explicitly set or it will be inferred from the extension in the path.
      • sample_ratio takes a random sample of your data, to let you work with larger datasets.
      • sample_seed lets you fix the random sample deterministically. Pass None to get a random sample.
  • SourceS3(bucket, file_key, file_type=None, sample_ratio=42, sample_seed=42)
    • Uses bucket bucket to access file file_key.
    • The options available for SourceFile as also available for files accessed through SourceS3.
  • SourceDF(df, dfname='dataframe')
    • Use this when you want to use an existing DataFrame. dfname can be overridden.
  • SourceData(dfname=None, table=None, query=None, queryfile=None,path=None, conn=None, conn_config={})
    • A generic transform to load data named dfname and set the current context. The parameters are the same as the convenience classes above. Any of these parameters (e.g. conn or conn_config can be overridden in the convenience classes directly. conn_config is a dict of options that will override any connection values in datamode.yaml.

SinkData

Sink transforms wrap SinkData. Again, any parameter accepted by SinkData can be used in the Sink transforms if appropriate.

  • SinkTable(table)
    • Writes table to the current sql connection.
  • SinkFile(path, file_type=None)
    • Writes file path with the appropriate output.
    • file_type either set explicitly or inferred from path.
  • SinkData(dfname, table=None, path=None, filetype=None, conn=None, conn_config={}, if_exists='replace')
    • A generic transform to save data dfname to a connection or a file, again using the current context.

Note that to prevent data loss, Datamode will error if you try to write back to any source with the same table or filename that has been sourced. Otherwise, is_exists will default to replace, and you can also use append.

SetWorkingTable

  • SetWorkingTable(dfname)
    • Changes the dfname.
    • When using multitable transforms, the tables should be specified explicitly.

Datamode will use the working table on any single-table transforms. The working table is also automatically set with any Source transform. For multi-table transforms, the tables have to be specified explicitly.