Bob The (Proto) Builder

I have been building a prototype application over the past week, using Power Query to take the principal data management sheet, and create a form of normalised data that I could load into a SQL Server database which I would then build my application code on top of.

I have done a similar thing a few times now, and it occurred to me that I was repeating what I have done each time before, identify the repeated lookup (dimensional) data, put that into separate tables with their own index, modify the original source to hold the indexes of these lookup tables, and finally change the headings to conform with the database column names. There are other steps that are particular to each prototype I build based upon the actual data encountered, and how much cleansing it might need, but all of them start as described above.

As a developer, I look for every opportunity to automate my work, and this seemed a perfect candidate, so I set about building a tool that could manage this task. This post describes that tool, a Power Query Prototype Data Builder. The Power Query that I deploy in this tool is not advanced or clever in any way, it is more the overall technique and package that I thought would be worth sharing.

In essence, the tool consists of a number of steps to get to the end goal, namely:

  • create a clean dataset from the source work
  • create a clean dataset of just the data columns to be loaded into dimension tables
  • create all of the dimension tables
  • create the output values table from the cleansed data, change all lookup values to their equivalent indexes, rename all headings.

This is the sort of thing I used to do in Excel, with some complex(ish) list formulas, and a lot of VLOOKUPs. All of these steps can easily be achieved with Power Query, and a lot more elegantly.

As well as building very similar queries in each prototype, in all of my previous efforts many of the queries were repetitive. This is something to be avoided if we are building a generic tool. Another goal is to avoid hard-coding wherever possible. To address both of these issues, I built three Power Query functions:-

  • fnGetParameters – this looks up variable data in a table on an Excel worksheet, and passes the result to our query.
  • fnBuildDimension – this function strips the targeted column name from the dimension dataset described above, removes duplicate values, and adds an index column. The input is the name of column to be dimensioned, the output is a dimension table ready to load into our datamodel.
  • fnDimensionID – this function adds a column with the index number for a targetted column in the cleansed source data, and then removes the original column. This would be called several times, once for each of the lookup values in the cleansed source data.

Let’s look at the code. I will take as my example a simple sales table (I think I got this data from Bill Jelen originally, I have just extended it over time). I use this data in a few of my talks, it is a flat Excel table that I want to break down. The table looks like this

Original Table

The first query we will look at is SetupMasterData. It is quite simple, it opens the source workbook and prepares the data, cleansing as needed. Any application specific cleansing that needed I would do in a separate query or in the final query, CreateMasterTable.

   Source = Excel.Workbook(File.Contents(fnGetParameter("Filepath")&fnGetParameter("Filename"))),
   Source.Sheet= Source{[Item=fnGetParameter("Sheet"), Kind="Sheet"]}[Data],
   Promote.ToHeader = Table.PromoteHeaders(Source.Sheet),
   List.Headings = Table.ToRows(Excel.CurrentWorkbook(){[Name="Headings"]}[Content]),
   Renamed.Headings = Table.RenameColumns(Promote.ToHeader,List.Headings)

As you can see, this query references the fnGetParameter function, 3 times, to get the file path, file name, and the name of the worksheet to be sourced and cleansed. The name of the parameter is passed as an input and it returns its associated value. This function is based upon a previous post of mine, Excel is for Parameters, but extended to a fully parameterised table as described by Ken Puls in his post Building a Parameter Table for Power Query.

The code for fnGetParameter is

(ParameterName as text) =>

   Parameter.Source = Excel.CurrentWorkbook(){[Name="Parameters"]}[Content],
   Parameter.Data = Table.SelectRows(Parameter.Source, each ([Parameter] = ParameterName)),
   Parameter.Value = 
   if Table.IsEmpty(Parameter.Data) = true then

In our example, the parameter table would like like this

Parameter Table

I won’t detail the function as it is all covered in Ken’s Post.

There are two other step in the SetupMasterData that needs explanation, the steps List.Headings and  Renamed.Headings. The purpose of these is to change all of the headings in the source file to more meaningful values, values that align to the database columns. Here again I am using an Excel table to hold the original and replacement values. This table is pulled into the query and converted to a list of lists using Table.ToRows. This list is then used in Table.RenameColumns to change the headings.

In my example, this table Headings looks like this

Prototype Headings

The next step I will cover is SourceDimensionData query which takes the cleansed source data and removes all but the dimension columns. It’s another very simple query that creates a source for the building of each dimension.

   Source = SourceMasterData,
   Remove.NonTargetColumns = Table.SelectColumns(Source,{"Region", "Division", "State", "City", "Store", "Type"})

The function that builds a dimension is fnBuildDimension,

(Dimension) =>

   Source = SourceDimensionData,
   Remove.NonTargetColumns = Table.SelectColumns(Source,{Dimension}),
   Remove.Duplicates = Table.Distinct(Remove.NonTargetColumns),
   Sort.Rows = Table.Sort(Remove.Duplicates,{{Dimension, Order.Ascending}}),
   Add.IndexColumn = Table.AddIndexColumn(Sort.Rows, Dimension&"ID", 1, 1),
   Move.IndexColumn = Table.ReorderColumns(Add.IndexColumn,{Dimension&"ID", Dimension}) 

No hard-coding here, it will work for any nominated column. It is used in the a number of separate queries, that I call Dimxxxx, to build the dimension tables. The code to build one of the dimensions would look like

    Source = fnBuildDimension("Region")

It would be just as simple and as relevant to add that call to the final query that outputs the master table, I just prefer to abstract it. If the dimension has any particular attributes to be built, it can be done within its own query, it is simple to extend – another reason to put it in its own query.

The final function is fnDimensionID, which lookups a row value in the nominated source query, inserts a column of index values, then deletes the original lookup column.

(DataSource, Dimension, JoinColumn) =>

   Source = DataSource,
   Get.Dimension= Table.NestedJoin(Source,{JoinColumn},Dimension,{JoinColumn},JoinColumn&"ID"),
   Expand.JoinColumn = Table.ExpandTableColumn(Get.Dimension, JoinColumn&"ID", {JoinColumn&"ID"}, {JoinColumn&"ID"}),
   Delete.JoinColumn = Table.RemoveColumns(Expand.JoinColumn,{JoinColumn})

This function is used in the final step, CreateMasterTable, for each lookup value, add all of the indexe values, and move those columns to the front.

   Source = SourceMasterData,
// replace all lookup values with indexes here
   Index.Region = fnDimensionID(Source, DimRegion, "Region"),
   Index.Division = fnDimensionID(Index.Region, DimDivision, "Division"),
   Index.State = fnDimensionID(Index.Division, DimState, "State"),
   Index.City = fnDimensionID(Index.State, DimCity, "City"),
   Index.Store = fnDimensionID(Index.City, DimStore, "Store"),
   Index.Type = fnDimensionID(Index.Store, DimType, "Type"),
// add a master index if required
   Index.Master = Table.AddIndexColumn(Index.Type, "SalesID", 1, 1),
// reorder columns to place indexes at front
   Move.IndexColumns = Table.ReorderColumns(Index.Master,{"SalesID", "RegionID", "DivisionID", "StateID", "CityID", "StoreID", "FiscalYear", "TypeID", "Costs", "Sales", "Profit"})

The functions should be usable without any modification.

The two Source queries might need to be extended to add any of your application specific processing, as might the CreateMasterTable query. The queries that create the dimension tables could include further processing for that dimension.

That is it. The code is finished. All that is needed now is to load into our datamodel and use it.

The only thing I can think of to automate it further would e to somehow parameterise the dimension attributes, table name and source column, and build them all in one query. Another day maybe.

The data workbook can be downloaded from here, and the builder workbook here.

This entry was posted in Power Query and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s