Using Transforms

Datamode’s set of Transforms makes data reshaping much simpler and easier to maintain.

SelectColumns

Let’s use the following data table named ex_data as an example:

# ex_data
uid     first_name     last_name     city            state    country
1       john           smith         san francisco   ca       usa
2       sarah          jenkins       seattle         wa       usa
3       sheila         carey         san diego       ca       usa

Running SelectColumns(['first_name', 'last_name']) on ex_data would return:

# ex_data
first_name     last_name
john           smith
sarah          jenkins
sheila         carey

DropColumns

Running DropColumns(['first_name', 'last_name']) on ex_data would return:

# ex_data
uid     city            state    country
1       san francisco   ca       usa
2       seattle         wa       usa
3       san diego       ca       usa

BlanksToNones

A lot of tools do not recognize empty or whitespace strings as proper NULL values. That’s where BlanksToNones() comes in to save the day.

Imagine the following data where the first row’s city field is actually an empty string:

# ex_data
uid     city            state    country
1                       ca       usa
2       seattle         wa       usa
3       san diego       ca       usa

Applying BlanksToNones('city') would convert that empty string into a valid NULL as such:

# ex_data
uid     city            state    country
1       NULL            ca       usa
2       seattle         wa       usa
3       san diego       ca       usa

ProcessNones

ProcessNones() takes in the column(s) you want to process and the action you wish to perform on those None values.

Using the following dataset let’s run a few different types of ProcessNones().

# ex_data
uid     name            price    date
1       tom             NaN      01-03-2018 12:01
2       sally           17.00    01-03-2018 12:01
3       NULL            15.95    01-03-2018 12:01

ProcessNones(['name', 'price'], action='drop') would drop all rows containing a NULL, None, NaN and yield:

# ex_data
uid     name            price    date
2       sally           17.00    01-03-2018 12:01

ProcessNones('price', action='zero') would replace any NaN in the price column and yield:

# ex_data
uid     name            price    date
1       tom             0        01-03-2018 12:01
2       sally           17.00    01-03-2018 12:01
3       NULL            15.95    01-03-2018 12:01

StringToNumber

Sometimes columns containing numerical values are incorrectly cast as strings. Running StringToNumber() on the column would convert values to proper integers and any string (such as ‘foo’) would become NULL.

DropNumericalOutliers

DropNumericalOutliers() will take a column and drop outliers outside drop_threshold_mult times the standard deviations from the average. This is best illustrated in Jupyter notebook:

We have a dataset random.csv with a column foo containing a mostly uniformly distributed set of numbers except a few outliers as seen in the inline data visualization below:

../_images/before.png

Running DropNumericalOutliers('foo', drop_threshold_mult=4) will drop all values that are 4 standard deviations away from the mean and the resulting data and distribution would look like:

../_images/after.png

DropDuplicates

DropDuplicates() can only be applied to a single column at this time. Please contact us at feedback@datamode.com if you’d like to be able to use more than one column. By default all None/NULL values will be collapsed into a single record. Set ignore_nones=True to preserve all None records.

# ex_data
uid     name            price    date
1       tom             0        01-03-2018 12:01
2       tom             17.00    01-03-2018 12:01
3       NULL            15.95    01-03-2018 12:01
4       NULL            15.95    01-03-2018 12:01

DropDuplicates('name') would yield:

# ex_data
uid     name            price    date
1       tom             0        01-03-2018 12:01
3       NULL            15.95    01-03-2018 12:01

While DropDuplicates('name', ignore_nones=True) would yield:

# ex_data
uid     name            price    date
1       tom             0        01-03-2018 12:01
3       NULL            15.95    01-03-2018 12:01
4       NULL            15.95    01-03-2018 12:01

CanonicalizeDate

CanonicalizeDate() will convert strings into Python datetime objects. This is often applied to data with inconsistent date formatting such as:

# ex_data
uid     name            price    date
1       tom             0        01-03-2018 12:01
2       tom             17.00    12/31/2017
3       NULL            15.95    01-03-2016

CanonicalizeDate('date') will normalize all dates into a timezone naive datetime object.

# ex_data
uid     name            price    date
1       tom             0        01-03-2018 12:01 PM
2       tom             17.00    12-31-2017 4:00 PM
3       NULL            15.95    01-03-2016 4:00 PM

CombineDateAndTimeFragments

CombineDateAndTimeFragments() requires the following parameters to be defined

  • col_date - The original date column.
  • col_time - The original time column.
  • col_new - The destination column for the newly created datetime object.
  • consume_originals - Setting this value to True will remove the original date and time columns.

Consider the following example dataset:

# ex_data
uid     date          time
1       22/09/2011    14:20:11
2       30/11/2011    10:34:24
3       28/10/2011    14:20:13

CombineDateAndTimeFragments(col_date='date', col_time='time', col_new='new_datetime', consume_orignals=False) would return:

# ex_data
uid     date          time        new_datetime
1       22/09/2011    14:20:11    Sep 22, 2011 7:20 AM
2       30/11/2011    10:34:24    Nov 30, 2011 3:34 AM
3       28/10/2011    14:20:13    Oct 28, 2011 7:20 AM

CombineTables

CombineTables() is a way to merge two tables. This transform takes the following parameters:

  • merged_table - the destination table name.
  • tables - must be a list containing two tables (using dfname) that you want merged.
  • relations - the column you wish to merge the two tables on (this must also be a list).
# table1
uid     first_name
1       john
2       sarah
3       sheila

# table 2
uid     last_name
1       smith
2       jenkins
3       carey

CombineTables('full_name_table', ['table1', 'table2'], ['uid']) would merge table1 and table2 on uid and return:

# full_name_table
uid     first_name     last_name     _merge
1       john           smith         both
2       sarah          jenkins       both
3       sheila         carey         both

NormalizeJson

Columns with JSON objects can be flattened using NormalizeJson() and new columns will be created from the nested data.

Consider the following data:

# json_data
uid     json_data
1       {"city": "San Francisco", "state": "CA"}
2       {"city": "Seattle", "state": "WA"}
3       {"city": "Albany", "state": "NY"}

The transform NormalizeJson('json_data') would yield:

# json_data
uid     json_data                                  json_data.city     json_data.state
1       {"city": "San Francisco", "state": "CA"}   San Francisco      CA
2       {"city": "Seattle", "state": "WA"}         Seattle            WA
3       {"city": "Albany", "state": "NY"}          Albany             NY

This will not work with JSON objects in a list or Python dicts.