Semi-Structured To Structured Data Conversion Using Talend In Snowflake
Author: Sanket Nalawade
Talend Data Studio:
Talend Open Studio is a free open source ETL tool for Data Integration and Big Data. It is an Eclipse-based developer tool and job designer. You just need to Drag and Drop components and connect them to create and run ETL or ETL Jobs. The tool will create the Java code for the job automatically, and you need not write a single line of code.
In this blog, we are going to convert json format data into rows and columns. Consider the following json example:
Our aim is to convert this data into rows and columns as shown below:
The data used is collected by Zomato API Analysis. The data is an analysis of restaurants, their food, cuisine, value for money, and customer satisfaction around the world. It is sourced from https://www.kaggle.com/shrutimehta/zomato-restaurants-data
Type of data:
TASK: Finding cheap and best restaurants around the world
Source file formats: csv, json
The objective is to find cheap and best restaurants around the world.
Step 1) Creating metadata for the source files and Snowflake output connection for the target.
1. Creating metadata for the csv file
2. Right-click on File delimited and click on Create file delimited option under Metadata option under Repository section on Talend UI.
3. Give any name to the metadata file and click on Next.
4. Browse the filename and choose the required file and click on Next. The File Viewer pane shows the data of the file selected.
5. Choose the Encoding, Field Separator, Escape Char, Header, Text Enclosure, and other options as shown below. After choosing, click on the Refresh Preview button to see the sample output. Once satisfied, click on Next.
6. Choose the schema structure, key, and click Finish.
7. In the same way, create a metadata for the json file.
8. Right-click on Snowflake and click on the Snowflake connection option under the Metadata option under the Repository section on Talend UI.
9. Enter the required parameters to connect to Snowflake, test the connection and click Finish.
Step 2) Creating a job.
1. Right-click on Job Designs and click on Create job under Repository section on Talend UI.
2. Give a name to the job and click on Finish.
Step 3) Double-click on the created job to open it. Drag and drop the metadata connection created for csv; a suggestion for tFileInputDelimited would pop, double click on it. Similarly, do the same with json connection; a suggestion for tFileInputJson would pop, double click it to bring it on the job pane.
Step 4) In the job pane, type tMap, a suggestion box will open up; double-click on the tMap component to bring it on the job pane.
Step 5) Right-click on the tFileInputDelimited and click on Row, select main and connect to the tMap component as the main input.
Step 6) Do the process in Step 6 and connect it to tMap as a lookup.
Step 7) Configuring tMap component, double click on it:
Setting up inner join:
- Drag and drop Restaurant_ID of the main component towards the Restaurant_ID of lookup component to enable join on this column.
- Click on tMap settings and change the Join Model to Inner Join.
Configuring output in tMap:
- Click on Add Output table button and give a name to the output to be obtained. Drag and drop the required columns to the matched_output.
Adding Expressions and filter:
1. Click on the three horizontal dots in the Expression column of the Country column and add the required expression.
2. After adding the expression, click on Test! Button to verify.
3. Click on the Enable expression filter to add expressions in order to filter rows.
4. Click Apply and then Ok button in tMap to save all the configurations made.
Step 8) Drag and drop Snowflake connection created, a suggestion for tDBOutput (Snowflake) would pop, double click on it to add it to the job pane.
Step 9) Click on the tDBOutput (Snowflake) and configure it by editing the parameters in the Component tab below. Click on sync columns to sync schema, add custom object name as Table if the table doesn’t exist.
After the above steps, the job pane would look like this:
Step 10) Click on the Run button present below the job pane to run the job.
The job runs successfully, giving a summarised detail of the rows populated.
We can verify the above over target: