3. Method 2: Using data painter
Explore a range of pre-built, ready-to-use functions available in our data painter tool, and learn how and when to use them to create synthetic data generation rules. This section includes Exercises 5-10.
The ‘Data Painter’ enables users to create their own rules using a selection of pre-built functions that can be modified using VB.Net in-built functions. The rules are in-line VB.Net functions.
This section provides a collection of ready-to-use functions designed for common data generation scenarios, designed to help users get started quickly. For specific data generation needs, users can use the ‘Function Editor’ to write custom logic using VB.Net.
'Step 4 - Data References', allows for parameterisation of data using local and global variables, supporting referential integrity and dynamic data generation.
Step 1 - Data painter synthetic functions
The Curiosity Platform has an extensive library of synthetic data generation functions that can be used to create the data and use the VB.net syntax to modify.
These can be found in the ‘Functions’ section and you can apply a filter to help you find the function you would like to use. Most of the functions have parameters which control the output of the function.
For instance, the ‘RandomHelper.GetRandomDateTime()’ function takes two parameters for the minimum date and maximum date that can be generated, like so: ‘RandomHelper.GetRandomDateTime("2021-01-01", "2012-01-01")’
Exercise 5
- For one of the columns in your rule set, choose a function that will populate the column with realistic data
Step 2 – Data list wizards
There are also functions that return values from assets in the Curiosity Platform such as data lists and sequences. Each of these has a wizard that is designed to help the user configure the correct function to return the desired data. The functions that return data from a list all use the ‘Resolve List’ wizard, whereas the sequence functions all use the ‘Select Sequence’ wizard.
The Curiosity Platform is able to store data and make this easily accessible and editable to the user in a Data List. Further information can be found in our knowledge base: Data List Information
The difference between the data list functions is in the data types they return, for instance one will return a list of strings whilst another will return just a single string. Make sure you review the name of the function and the data type to ensure the data returned is what you expect.
- Clicking on the desired list function opens the ‘Resolve List’ wizard. The first step is to choose the list to return data from.
- Click ‘OK’ to progress the wizard to the next screen, where the inputs to the wizard configure the function created.
These options include:
- Where – Select the operator either “AND”, “OR” and then select the field you want this 'where' condition to be against. You can then choose an operator such as equals or not equals. The '>_' button allows you to run these conditions against the value of a column or data variable at runtime.
- Grouped Columns – This allows you to choose the columns that the data will return, if you leave blank it will select all the columns
- Column to Use – This is the column that dictates the order for either random or sequential
- Distinct On – The columns to ensure unique values are returned
- Order By – The column to order the data returned by
- Limit – The number of rows returned
- Start Offset – The number of rows to skip and start returning data by
- Click 'OK' to complete the wizard.
Exercise 6
- Use one of the data list wizards to select an address from one of the standard data lists included in the product into a user-defined variable
- Try one of the ‘where’ clauses as well to only bring in an address for a certain post code
Step 3 - Sequence wizard
With data generation, there are times where we’ll need consecutive files or row values to be taken from a sequence that is handled within the Curiosity Platform. There is a wizard that helps users get one or many values from a sequence, to use for data generation.
Further information can be found in our knowledge base: Sequences
- Click on one of the functions to start the wizard
-
The ‘Select Sequence’ dialogue box will show. This is where you can choose the name of the sequence. In the ‘Advanced’ section you can choose how many of the values to return from the sequence.
Clicking ‘OK’ will finish the wizard and populate the rule with the desired sequence function. It will now appear in data painter, and you can click on the preview button to see an example result.
Exercise 7
- Select one of the columns in the rule set to use a sequence as a data generation rule
Step 4 - Data references
When creating rule sets, it can be useful to either use the user-defined variables (explained in Section 2 - Step 3) or reference a value created in the same or a related table in the generation rule set.
This allows us to keep referential integrity across parent-child relationships (which are automatically filled out if the scan has been done).
You can find them to the right of the functions screen in the data painter:
Step 4a - User-defined parameters
You can use the values from the user-defined variables that either you or the tool have created. If these are a value that easily fits inside a column – such as a string or an integer – then you can click on the user-defined variable that you wish to select.
In some cases, there may be a list where it makes sense to select one value to insert into the column. In this case, icons are shown next to the parameter so you can select either a random or sequential value for the column. The folder sign also allows you to view the variable to help you visualise the function you are creating.
Exercise 8
- Use a user-defined parameter to populate one of your data generation columns in a rule set
Step 4b - Linked table references
When creating synthetic data rules, it is often required to use data that has been generated for another column, either in the same table or a different table. For instance if an email field is being generated using the values from a ‘first name’ and a ‘last name’ column in a table, to create an email that follows the pattern ‘firstname.lastname@company.com’.
Additionally, when creating tables with a parent-child relationship, for example a header table called ‘orders’ and a detail table called ‘order_items’, linked by a foreign key on ‘order_id’, it is important to maintain referential integrity by ensuring consistent ‘order_id’ values when generating data for both tables.
Linked table references can help in both scenarios. By navigating to the ‘Data References’ section in data painter, you can select an object associated with columns in the same table as you are generating into, as well as a table that has been generated beforehand. For the email example, this means that a linked table reference can be used to the first name and last name columns in different tables, to create an email address in the Email column.
In the scenario where a value from a table generated beforehand is required, the linked table references can be used again. The only difference here is that you need to select the column from a different table. For example, to generate data into the ‘order items’ table, the ‘order id’ from the ‘orders’ table can be used to make sure the data generated is referentially integral.
It is important to note that if the data profiling and discovery activity has been completed, the Curiosity Platform will automatically insert the necessary linked table reference to keep data referentially integral for foreign key relationships. This will also happen for soft keys if these have also been identified.
Exercise 9
- Use a linked table reference to populate one of the columns in the rule set
Step 5 – Data reference variables
There are several variables that are commonly used in synthetic data generation rule sets that are made available to the user for use in the creation of synthetic data generation rules. These are the most used global or local variables found in the VIP flow.
-
Object variable
-
The object represents the current row of data being generated
-
Data is stored in an object named var_tableName
-
To access a specific column's value within this row, use the syntax: var_tableName.columnName
-
Example: var_Employees.FirstName
-
-
-
Local iteration variable
-
Refers to the count of rows generated within data painter, not the original data source
-
Syntax: parCount_tableName
-
- Number of rows variable
- Represents the total number of rows you intend to generate
- Syntax: parItem
- This variable can be used in custom logic to determine the data set size dynamically
- List variable
- Refers to the list containing all object variables generated
- Syntax: lst_tableName
- Current index
- Indicates the current row number being generated within the list
- Syntax: _tableName_
When clicking on the relevant Data Reference Variable the relevant object is inserted into the data painter screen. In the example below, we have used the reference variable ‘current_index’ which will give a current indicator for the ‘order_items’ table, and will therefore act as ‘Item ID’.
Exercise 10
- Select a column in the rule set and use one of the data reference variables to either populate, or be part of a function that populates it
Need help or want to check your work? Check the solution videos here.
Data painter - Additional information
1. Preview
To work out whether the function you have typed out returns the data you want, the 'Preview' button can be used.
If the value is not a function, but instead a parameter or variable that needs information to be inputted to be resolved, the preview requires a value to be inputted.
2. Analysis
It is often helpful to have the meta data about the column you are generating into available in data painter. This information is available in the analysis section and contains all the information captured in the deep scanning conducted in Module 1 – Data profiling and discovery.
Note: this section only appears if you have run a profiling job.