Using Data Sources

Loading in data can sometimes be a hassle. We wanted to make it easier for you to get started with all types of common data sources.

Sourcing Data

Sourcing data is synonymous with loading data. The parent class is SourceData() but users should utilize the helper classes listed below. All of the Source features below are possible through SourceData() but all parameters must be explicitly defined.

Sourcing data from SQL

Any interaction with a SQL database requires connection information to be defined. This can be done in your datamode.yml (Setting Profile Configurations)

Let’s assume we have the following datamode.yml configured:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
connections:
  mysql_dev:
    type: mysql
    host: internal-mysql.host.name.com
    port: 3306
    user: dev_user
    password: mypass!!
    dbname: science
    schema:
    default: True
  mysql_prod:
    type: mysql
    host: prod-mysql.host.name.com
    port: 3306
    user: prod_user
    password:
    dbname: science
    schema:

If we wanted to source data from mysql_dev then we can simply SourceTable('mydevtable'). The connection profile mysql_dev is used by default since this was set in the yaml.

If we wanted to source data from mysql_prod then we would have to explicitly reference it using conn. This would look like SourceTable('myprodtable', conn='mysql_prod'). Notice how password is empty in the yaml. You can supplement an existing connection configuration from a yaml at the time of calling.

Finally, you can also pass a conn_config directly to the class without the need for a yaml. conn_config accepts a Python dictionary with the keys type, host, port, user, password, and dbname and their corresponding values.

Putting it all together the three examples are as follows:

1
2
3
SourceTable('mydevtable'),
SourceTable('myprodtable', conn='mysql_prod', conn_config={'password': 'mypr0dp4s5'}),
SourceTable('anothertable', conn_config={'type': 'mysql', 'host': 'localhost', 'port': 3306, 'user': 'localuser', 'password': 'mylocalpass', 'dbname': 'localdb'})

SourceTable

SourceTable() loads a complete SQL table. See above for an examples

SourceQueryFile

Datamode supports using a query stored in a separate file. Assume we have a file named my_query.sql in the same directory that looks like:

1
2
3
SELECT *
FROM mytable
LIMIT 1000;

We use this query directly from the file (dfname simply names the object so it’s referenceable):

1
SourceQueryFile('my_query.sql', dfname='mytable_sample')

SourceInline

Users can also inject SQL directly into Source classes using SourceInline():

1
SourceInline('SELECT * FROM mytable LIMIT 1000', dfname='mytable_sample')

SourceFile

Datamode supports the loading of CSV and JSONLINES file types. Simply pass the file path to SourceFile(). The type of file is inferred from the extension but this can be overridden by passing file_type to the class. The dfname defaults to the filename without the extension.

Users can also load in JSON or CSV objects directly from a URL.

1
2
3
4
SourceFile('myfile.csv'), # Assumes a CSV
SourceFile('data.logs', file_type='json'), # Assumes a JSON
SourceFile('foo.csv', file_type='json'), # Assumes a JSON
SourceFile('https://raw.githubusercontent.com/datamode/datasets/master/movies.csv'), # Loads a CSV

Sinking Data

Sinking data is simply saving it to a location. A rule that is baked into the Sink classes is that the user cannot save back to a table or file that was used as a Source. This is to prevent any accidentally data loss.

SinkTable

SinkTable() uses the same mechanics for configuring SQL connections as any of the SQL related Sources. You can Source a table from one database and Sink to another like:

1
2
SourceTable('mydevtable'), # mysql_dev was set as the default in the YAML,
SinkTable('outputtable', conn='mysql_prod', conn_config={'password': 'mypr0dp4s5'}),

SinkFile

SinkFile() is similar to SourceFile() in that it will try to infer the filetype from the filepath you pass it or this can be overridden by passing file_type.

1
2
SinkFile('ouput.csv'), # Saves as CSV
SinkFile('output.logs', file_type='json'), # Saves as JSON

Extracting a DataFrame

The run_transforms() method returns a TransformContext() object that contains the last table sourced, sinked, or transformed as a DataFrame. It can be retrieved via the df_current property. See example:

1
2
3
4
5
6
7
8
TRANSFORMS = [
  SourceFile('input.csv'),
  SinkFile('output.csv'),
]

tcon = run_transforms(TRANSFORMS)

df = tcon.df_current

SetWorkingTable

Datamode applies transforms to the last table sourced, sinked, or transformed. When working with many data tables you may wish to switch back-and-forth for transforms. This can be done by using SetWorkingTable().

1
2
3
4
5
SourceFile('input.csv'), # dfname defaults to filename without extension
SourceTable('mytable'),
DropDuplicates('unique_ids'), # Drops duplicates in the unique_ids column in the SQL table 'mytable'
SetWorkingTable('input'),
DropDuplicates('dates'), # Drops duplicates in the dates column from the CSV 'input'