Expression provides a powerful, simple query language for data manipulation, similar to SQL. You can create/replace columns or drop rows, which can use primitives or existing columns in your dataset, including cross-table filters and aggregations.
What can you do with Expressions?¶
- Arbitrarily complex math operations between columns or for a single column. Example
- Create new columns, e.g. set a boolean for any users that are between 25 and 39. Example
- Aggregate data from multiple tables - e.g. count all the orders for each user. Example
- Easily delete rows that have invalid data, such as user_age < 0. Example
- Use a lookup table, e.g. find the income for each user’s zipcode from the zipcode table. Example
- Add string columns together. Example
How to use Expressions¶
Expression is a transform. Pass a single triplequoted string directive to it. Datamode will parse and execute your directive on the dataset.
Expression(''' your_expression_here '''),
Full expression: <assignment> | <droprows> [where ...] [join ...] assignment: <tablecolumn> = <formula> [default <primitive>] droprows: drop rows from <[table.]column> formula: <arbitrary nested infix PEDMAS formula including aggregation functions> where: where <tablecolumn> <comparison-operator> (<tablecolumn> | <primitive>) join: join <tablecolumn> <comparison-operator> <tablecolumn> tablecolumn: [tablename.]columnname aggregations: funcname(<formula>) primitive: integer, bool, string, float
Some things to keep in mind:
- String literals must be single-quoted.
- Both the table name and column name can separately be optionally double-quoted to allow spaces, e.g.
- All keywords are case insensitive (e.g. aND, WHERE, is null, True, fAlSe, DEFAULT etc).
- You don’t strictly need triplequoted strings but it makes writing longer queries easier.
- For now, aggregation functions can only take a single column, nothing else (and only works with joins)
- Don’t forget the comma at the end, since
Expressionis just another Transform.
You can create or overwrite a single named column by assigning an arbitrary formula to it, which follows standard PEDMAS order. The formula can have any number of nested math expressions, including functions like sum, average and count for cross-table aggregations. Example:
Expression(''' mytable.mycolumn = foo / bar + (baz * 3) ** 2 '''),
One restriction is that you can’t create an infix formula within an aggregation function. This is for performance and complexity reasons.
However, if you need this logic, just use two
Expressions to create an interim column. That will also make your logic and data easier to debug.
You can use the single
drop rows from <table> directive. You’ll probably want to specify a where clause though, or all rows will be dropped.
If you use a join, only rows in the specific drop table will be dropped.
Expression(''' drop rows from users where user_age < 0 '''),
This is similar to SQL - you can compare a column to a primitive, two columns in the same table, or two columns cross-table. Cross-table columns require an explicit join.
Don’t forget you need to use
= for assignment,
== (or another operator) for comparison.
# Compare to a scalar Expression(''' foo = bar + 1 where mycolumn > 5 '''), # Compare to string constant Expression(''' foo = bar + 1 where mycolumn == 'somevalue' '''), # You can use less-than/greater-than since Python does the right thing Expression(''' foo = bar + 1 where mycolumn <= 'def' '''),
Datamode currently only supports a single inner join per expression. You can join any two tables on any one column from each.
The columns don’t need to have the same name. If you need to join more than one table, you should use multiple
That will help you separate and debug your logic as well.
# Cross-table comparison. Requires a where clause though. Expression(''' foo = bar + 1 join users.user_id == orders.order_id '''), # Joining two different columns which have similar values Expression(''' foo = bar + 1 join users.valueA == zipcodes.valueB '''), # Cross-table where comparison Expression(''' foo = bar + 1 where table_1.foo > table_2.bar join table_1.my_id == table_2.my_id '''),
You can also use this as a lookup engine - see the example below on zipcodes.
For source columns¶
You can always specify the table, but you don’t need to. If you don’t, Datamode will lookup the table from the column. If it doesn’t exist, or there are multiple columns with the same name, Datamode will raise an error.
For destination columns¶
Again, you can always specify the table. Just like source columns, Datamode will attempt to lookup the column. If it doesn’t exist, it will be created in the table used by the source columns. If the table doesn’t exist, it will be created as well.
Simple math operations¶
# Simple division Expression(''' tax = price / 10 '''), # Exponents and square roots Expression(''' price_sqrt = price ** 0.5 '''), # Multicolumn operation Expression(''' roi = revenue / budget '''),
Nested math operations with aggregation functions¶
You can create an arbitrarily nested expression, but it’s often better to break up your
Expression into multiple
Expressions, which helps with debugging.
# Arbitrarily complex PEDMAS order Expression(''' newcol = (foo ** bar + baz / (baz * (3 - 2)) ) ** 2 '''),
Create multiple columns¶
Note the comma separating the clauses. The actions share any
# Two assignments, with a cross-column where clause. Expression(''' column1 = foo + bar, column2 = bar + baz where foo > bar '''),
Sum and count total orders per user¶
Expression can aggregate columns across multiple tables, just like SQL.
In this case we’re summing and counting the orders for each user, joining the orders and users table.
You have to source multiple tables in your transforms, however, e.g. with multiple calls to
Expression(''' users.orders_total = sum(orders.total), users.orders_count = count(orders.total) join users.user_id == orders.user_id '''),
Time between signup and first order¶
We want to find out how long it took the user to make their first order.
To do that, we can get the
first order and subtract the user’s signupDate.
If any row data is null, the result for that row will be null too.
Expression(''' users.time_to_first_order = first(orders.date) - users.signupDate join users.user_id == orders.user_id '''),
Lookup a value from another table (one to many)¶
Say you’ve collected zipcodes for each user. You have another table with income for that zipcode. You can use Expression to do a lookup on the zipcodes table and store the result in the users table.
Expression(''' users.zipcode_income = zipcodes.income join users.zipcode == zipcodes.zipcode '''),
You can remove invalid rows, or really any rows that meet the where expression.
You can also join a table to use a cross-table where clause, but rows will only be dropped from the named
drop rows from <table> table.
# Drop obviously invalid rows Expression(''' drop rows from users where users.age <= 0 '''), # Drop rows with a join # Will only drop rows from users, not zipcodes Expression(''' drop rows from users where zipcode.income < 10000 join users.zipcode == zipcodes.zipcode '''),
Use spaces or other chars in column or table names¶
Double quotes are only necessary around table or column names that have spaces or dots.
Expression(''' "my new column" = "my table"."my old column" * 2 '''),
Create boolean columns from existing columns¶
With this query, any project valued over 100 will get high priority = true, false otherwise. This can be especially helpful when trying to create dependent variables.
Expression(''' projects.high_priority = true default false where projects.value > 100 '''),
Simple string manipulation¶
You can use
+ for simple string concatenation. It won’t work with non-strings yet though.
Expression(''' address = street_line + ', ' + city + ', ' + state + ' ' + zipcode '''),
For better readability with multiple lines, use Python’s triplequoted strings in your expression:
Expression(''' mycolumn1 = mycolumn2 where mycolumn3 > 1 '''),
What’s not supported yet¶
- Date to string comparisons (coming soon)
- Sorting, including for aggregations. (coming soon)
- Groupbys, although every join has an implicit groupby on the joined id. (coming soon)
- Operating on strings and non-strings in conjunction
- OR-conjunctions in where clauses
- Python-style string or date formatting
- Arbitrary formulas inside aggregation functions
- Scalar functions like logarithms, etc.
- Intervals like week, date, day, hour, minute etc
- Time windows
If you’d like any of the above to exist, please email us at email@example.com.