{"id":11791,"date":"2020-10-14T07:10:00","date_gmt":"2020-10-14T07:10:00","guid":{"rendered":"https:\/\/viewmyprojects.com\/winwirewp\/?p=11791"},"modified":"2023-11-29T12:17:04","modified_gmt":"2023-11-29T12:17:04","slug":"merge-data-into-a-single-table-using-azure-data-factory","status":"publish","type":"post","link":"https:\/\/viewmyprojects.com\/winwirewp\/blog\/merge-data-into-a-single-table-using-azure-data-factory\/","title":{"rendered":"Merge Data into a Single Table using Azure Data Factory"},"content":{"rendered":"\n<p>I recently came across a scenario where the source files in Azure blob storage container had slightly different schemas, i.e., one set of files had an extra column compared to another set of files. The challenge was to identify this diverse set of files and load it into a single target table on the fly.<\/p>\n\n\n\n<p>How to handle files with a slightly different structure (coming from the same source\/repository) and transfer\/merge the data into a single table using&nbsp;Azure Data Factory?<\/p>\n\n\n\n<p>For example, File1 has columns \u2013 Name, Age, Department.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Name<\/strong><\/td><td><strong>Age<\/strong><\/td><td><strong>Department<\/strong><\/td><\/tr><tr><td>John<\/td><td>34<\/td><td>IT<\/td><\/tr><tr><td>Mathew<\/td><td>45<\/td><td>Finance<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>File2 has columns \u2013 Id, Name, Age, Department<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>ID<\/strong><\/td><td><strong>Name<\/strong><\/td><td><strong>Age<\/strong><\/td><td><strong>Department<\/strong><\/td><\/tr><tr><td>12<\/td><td>Steve<\/td><td>40<\/td><td>HR<\/td><\/tr><tr><td>17<\/td><td>Fleming<\/td><td>44<\/td><td>Finance<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Data in the target table (say EmployeeDetails) should look like<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>ID<\/strong><\/td><td><strong>Name<\/strong><\/td><td><strong>Age<\/strong><\/td><td><strong>Department<\/strong><\/td><\/tr><tr><td>12<\/td><td>Steve<\/td><td>40<\/td><td>HR<\/td><\/tr><tr><td>17<\/td><td>Fleming<\/td><td>44<\/td><td>Finance<\/td><\/tr><tr><td><\/td><td>John<\/td><td>34<\/td><td>IT<\/td><\/tr><tr><td><\/td><td>Mathew<\/td><td>45<\/td><td>Finance<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>Solution Summary<\/strong><\/h1>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Count the columns in the source file<\/li>\n\n\n\n<li>Check the value of the first row and first column of a given source file.<\/li>\n\n\n\n<li>Add 2 copy data activities for each of the file types, map the source and target columns explicitly.<\/li>\n\n\n\n<li>Based on #1 and #2 checks, decide which copy data activity to select for a give file type.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Detailed Explanation of steps involved in the pipeline<\/strong><\/h2>\n\n\n\n<p>All of the files have extension as&nbsp;<strong>.prd&nbsp;<\/strong>and the ask is to load these into Azure DWH or synapse.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Load the files from amazon s3 to azure blob using copy data activity. No special settings in the source and target files here. Just select the encoding as&nbsp;<strong>UTF-8.<\/strong><\/li>\n\n\n\n<li>There will be 3 tables to assist in loading.\n<ul class=\"wp-block-list\">\n<li>TableFileName \u2013 It will hold the name of files and the value of first column and first row present in blob storage.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<p>Example if the file (titled File1) has 1, Lokesh, 35, Health as the first row. The table will have values<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>File1<\/td><td>1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<ul class=\"blog-detail-list wp-block-list\">\n<li>TableErrorFileName \u2013 It will hold the name of files which have failed to load.<\/li>\n\n\n\n<li>StagingTable \u2013 This will be the actual target table with the data from the source files.<\/li>\n<\/ul>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li>Before loading clear or truncate all these tables.<\/li>\n\n\n\n<li>Child Pipeline<\/li>\n<\/ol>\n\n\n\n<p>a. First step in this pipeline is GET METADATA activity which will have the dataset mapped to blob storage where source files are kept, and field list will be set to Child Items.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture1-4.webp\" alt=\"\" class=\"wp-image-11792\"\/><\/figure><\/div>\n\n\n<p>b. Second step will be a for each loop container.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture2-2-1.webp\" alt=\"\" class=\"wp-image-11793\"\/><\/figure><\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture3-4.webp\" alt=\"\" class=\"wp-image-11794\"\/><\/figure><\/div>\n\n\n<p>It further has these activities<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture4-1-2.webp\" alt=\"\" class=\"wp-image-11795\"\/><\/figure><\/div>\n\n\n<ul class=\"blog-detail-list wp-block-list\">\n<li>\u201cGet Metadata\u2026.2\u201d \u2013 This is set to parameterized dataset which points to a file from the blob storage. This is done to get the column count from a given source file.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture5-2-1.webp\" alt=\"\" class=\"wp-image-11796\"\/><\/figure><\/div>\n\n\n<ul class=\"blog-detail-list wp-block-list\">\n<li>Lookup First&nbsp;Row and Column \u2013 Points to a parameterized dataset. This activity is added to get the first column and first row value of the source file and additionally to prevent a no or blank file name being passed to the subsequent step. A no file name will be hard to handle therefore inserted a dummy file called test.txt which will prevent the operation from failing.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture6-2-1.webp\" alt=\"\" class=\"wp-image-11797\"\/><\/figure><\/div>\n\n\n<p><em>Expression \u2013 @if(greaterOrEquals(activity(\u2018Get MetadataFileDetails2\u2032).output.columnCount,2),item().name,\u2019test.txt\u2019)<\/em><\/p>\n\n\n\n<ul class=\"blog-detail-list wp-block-list\">\n<li>\u201cIf First Row n Col is \u2026\u201d \u2013 Next activity is if activity, its job is to find out the type of file based on its first column value.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture7-2-1.webp\" alt=\"\" class=\"wp-image-11798\"\/><\/figure><\/div>\n\n\n<p><em>Expression \u2013 @bool(startswith(activity(\u2018Lookup 1st Row n Col\u2019).output.firstRow.Prop_0,\u20199\u2032))<\/em><\/p>\n\n\n\n<ul class=\"blog-detail-list wp-block-list\">\n<li>True condition \u2013 Insert the file name and the first column and row value in TableFileName (which is already been created, please refer to #1).<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture8-2-1.webp\" alt=\"\" class=\"wp-image-11799\"\/><\/figure><\/div>\n\n\n<ul class=\"blog-detail-list wp-block-list\">\n<li>False condition \u2013 Insert the file name and hard coded flag or value in TableFileName. This will help identifying or distinguishing between the 2 different types of file.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture9-1-3.webp\" alt=\"\" class=\"wp-image-11800\"\/><\/figure><\/div>\n\n\n<ul class=\"blog-detail-list wp-block-list\">\n<li>If the \u201cif activity\u201d fails, then insert the file name into the same table with first column value concatenated to failure.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture10-1-2.webp\" alt=\"\" class=\"wp-image-11801\"\/><\/figure><\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture11-1-1.webp\" alt=\"\" class=\"wp-image-11802\"\/><\/figure><\/div>\n\n\n<p>5. (Return to main pipeline) \u201cLookup_G\u2026Files\u201d. This lookup queries for one type of file out of the 2 possible ones by querying data inside the table TableFileName.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture12-1-1.webp\" alt=\"\" class=\"wp-image-11803\"\/><\/figure><\/div>\n\n\n<p>6. Other lookup (\u201cLookup_I\u2026Files\u201d) queries the second type of the file. As the lookups return multiple rows the \u201cfirst row\u201d property is not set.<\/p>\n\n\n\n<p>7. Each of these lookups then feed the output to for each loop container. These containers have a copy data activity to transfer the data into the target table.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture13-1-1.webp\" alt=\"\" class=\"wp-image-11804\"\/><\/figure><\/div>\n\n\n<p>8. Eventually data is inserted into the respective tables with the file names of the failed files are inserted into error table TableErrorFileName. One of the important settings here is that the mapping must be explicitly specified.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture14-2-1.webp\" alt=\"\" class=\"wp-image-11805\"\/><\/figure><\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture15-1.webp\" alt=\"\" class=\"wp-image-11806\"\/><\/figure><\/div>\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/11\/Picture16-2.webp\" alt=\"\" class=\"wp-image-11807\"\/><\/figure><\/div>\n\n\n<p>Get Metadata activity in ADF will provide important meta data parameters about the file\/folder of files which can be leveraged to make decision on the actual loading of the data. For instance, in this case we made use of \u201cColumn Count\u201d parameter. I hope this gave you an end-to-end look at how to Merge Data into a Single Table using Azure Data Factory and saved you from looking through 10 different bits of documentation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently came across a scenario where the source files in Azure blob storage container had slightly different schemas, i.e., one set of files had an extra column compared to another set of files. The challenge was to identify this diverse set of files and load it into a single target table on the fly.&hellip; <a class=\"more-link\" href=\"https:\/\/viewmyprojects.com\/winwirewp\/blog\/merge-data-into-a-single-table-using-azure-data-factory\/\">Continue reading <span class=\"screen-reader-text\">Merge Data into a Single Table using Azure Data Factory<\/span><\/a><\/p>\n","protected":false},"author":53,"featured_media":16412,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_eb_attr":"","_uag_custom_page_level_css":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-11791","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","entry"],"acf":[],"featured_image_src":"https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/10\/merge-data.webp","author_info":{"display_name":"Lokesh","author_link":"https:\/\/viewmyprojects.com\/winwirewp\/author\/lokesh\/"},"views":4167,"uagb_featured_image_src":{"full":["https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/10\/merge-data.webp",800,440,false],"thumbnail":["https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/10\/merge-data-150x150.webp",150,150,true],"medium":["https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/10\/merge-data-300x165.webp",300,165,true],"medium_large":["https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/10\/merge-data-768x422.webp",750,412,true],"large":["https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/10\/merge-data.webp",750,413,false],"1536x1536":["https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/10\/merge-data.webp",800,440,false],"2048x2048":["https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/10\/merge-data.webp",800,440,false],"post-thumbnail":["https:\/\/viewmyprojects.com\/winwirewp\/wp-content\/uploads\/2023\/10\/merge-data.webp",800,440,false]},"uagb_author_info":{"display_name":"Lokesh","author_link":"https:\/\/viewmyprojects.com\/winwirewp\/author\/lokesh\/"},"uagb_comment_info":0,"uagb_excerpt":"I recently came across a scenario where the source files in Azure blob storage container had slightly different schemas, i.e., one set of files had an extra column compared to another set of files. The challenge was to identify this diverse set of files and load it into a single target table on the fly.&hellip;&hellip;","_links":{"self":[{"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/posts\/11791","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/users\/53"}],"replies":[{"embeddable":true,"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/comments?post=11791"}],"version-history":[{"count":2,"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/posts\/11791\/revisions"}],"predecessor-version":[{"id":18102,"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/posts\/11791\/revisions\/18102"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/media\/16412"}],"wp:attachment":[{"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/media?parent=11791"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/categories?post=11791"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/viewmyprojects.com\/winwirewp\/wp-json\/wp\/v2\/tags?post=11791"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}