Power BI Audit Log Analytics Solution

As Power BI adoption in your organization grows, it becomes more and more important to be able to track the activity in the environment.

When you start to think about deploying a Power BI Audit Log solution that is repeatable there are a few challenges that you will face.

  • Going to the O365 Audit Logs portal each time you want to extract log events is a manual process and doesn’t scale
  • When automating this process through API or PowerShell, there is a limit to how much data you can pull, therefore examples that are currently available also don’t scale very well
  • The AuditData field is a JSON format by default and although Power BI can parse JSON beautifully, when doing this over several thousand record entries may result in data load errors

Based on these factors, i have put together a PowerShell script that can be scheduled on a nightly basis that can iterate MORE than 5000 records so that no data is lost. Also, the screenshot below is of an initial template that you can use to start with to analyze your audit logs for your organization.

TL;DR

.

  • The required files can be found on my GitHub
  • Update the PowerShell script with a UserID and Password that has O365 audit log privileges
  • Use Task Scheduler to schedule the PowerShell script to run each night at midnight (run as admin).
  • At the end of the script, specify the directory you would like the script to generate CSV files in
  • In the PBIX file, it was challenging to get a parameter to work for the file location that the CSVs are in, so in the Query Editor the script for the AuditLog table needs to be manually modified to include your file path.
  • Enjoy

Quick look at the PowerShell

First, there is a PowerShell script.

 

  • Notice that you need to enter O365 audit log privileged credentials at the top so that this can be ran automatically. If you have more clever ways to pass these credentials in so they are not exposed in the file by all means, do that
  • The Do/Until loop handles if there are more than 5000 records in the result set which would easily be the case for a large Power BI community.
  • The foreach loop extracts the AuditData column JSON format and creates an individual record for each entry. This makes the Query Editor in Power BI less complex and easier to accomplish retrieving several hundred thousand records without import errors
  • finally we create a CSV for the data with the date of the file entries (yesterdays info if this is ran at midnight every day). This dumps each file in c:\PBIAuditLogs. You can obviously change this file location to wherever you want to store your CSV extracts

You can use Task Scheduler to run the above PowerShell script every night at midnight.

The PBIX file

In the Power BI file, we are connecting to the content of the entire folder shown above. I went ahead and included the PBIX file WITH the sample data so you could get an idea of what your data may look like.

This is where i have to admit that i tried to use a parameter for this but ran into some Query Editor challenges with how Power BI creates a Sample File transform to import multiple files from a single folder. If you can see what i did wrong here I would love your feedback, but for now, you can ignore the file directory parameter in the Query Editor and need to go to “Advanced Editor” on the “AuditLog” query and modify the file location to be the location you are dumping files from the PowerShell script.

Change the below file location as needed.

Once you have made this change, you should be able to “Close and Apply” and your data will now be populated in this basic audit log analytics view.

Using the file

I created a couple basic pages to get this blog post shipped and so you can start taking advantage of the solution, but it is nowhere near as complete as you can eventually make it. I have a simple overview page that was screenshotted above. It can help you determine number of active users, reports, dashboards, and datasets being used for any time period your audit log data covers.

The second page is a user activity view i created from a calculated table. It helps you determine which users in the system may be inactive so you can re-assign power bi licenses and also shows detailed activity for an individual user that you can select from the slicer.

Other things you can mine from this data:

  • Who has signed up for Free Power BI trials
  • What “Apps” are being created
  • what embed tokens are being generated and used
  • many other possibilities

The PowerShell script and the PBIX file are located on my GitHub here

 

 

Regular Expressions will save your life!

I am closing out 2017 with a refreshing project that has led me away from Power BI for a bit. However, even for the Power BI community, I think the below information is valuable because at some point, you are going to run into a file that even the M language (Power BI Query Editor) is going to really have a hard time parsing.

For many of you, its still a flat file world where much of your data is being dropped via an FTP server and then you have a process that parses it and puts it in your data store. I recently was working with a file format that I have no idea why someone thought it was a good idea, but nonetheless, i am forced to parse the data. It looks like this:

This data simulates truck weigh-in station data. There is a lot of “header” information followed by some “line” items.

Just think for a moment how you would approach parsing this data? Even in Power BI, this would be extremely brittle if we are counting spaces and making assumptions on field names.

What if a system upgrade effecting the fields in the file is rolled out to the truck weigh stations over the course of several months? Slight changes to format, spacing, field names, etc… could all break your process.

 Regex to the Rescue

In my career as a developer, I never bothered to understand the value of regular expressions (regex). With this formatted file I now see that they can save my life (well, that may be dramatic, but they can at least save me from a very brittle pre-processing implementation)

For anyone unfamiliar with regex, a regular expression is simply a special text string for describing a search pattern. The problem is, they are extremely cryptic and scary looking and you want to immediately run away from them and find a more understandable way to solve your text string problem. For instance, a regular expression that would find a number (integer or decimal) in a long string of characters would be defined as

\d+(\.\d*)?

What the heck is that?

The “\d” represents any decimal digit in Unicode character category [Nd]. If you are only dealing with ASCII characters “[0-9]” would be the same thing. The “+” represents at least one instance of this pattern followed by (\.\d*) which identifies an explicit dot “.” followed by another \d but this time with a “*” indicating that 0 to n instances of this section of the pattern unlike the first section that required at least 1 instance of a digit. Therefore this should result in true for both 18 as well as 18.12345. However, regex are greedy by default, meaning it expects the full pattern to be matched. So without adding the “?” to the end of the string, the above regex would NOT recognize 18 as a number. It would expect a decimal of some sort. Because we have included the “?” it will end the match pattern as long as the first part of the match was satisfied, therefore making 18 a valid number.

So, the regex was 11 characters and it took me a large paragraph to explain what is was doing. This is why they are under utilized for string processing. But if you are looking at it the other way, it only took 11 characters to represent this very descriptive pattern. Way cool in my book!

Regex language consistency

My example above was from Python. As i am a “data guy”, i find Python to have the most potential for meeting my needs. I grew up on C# and Java however so understanding regex may have some slight variations between languages. Some interesting links on this are below:

Stack Overflow: https://stackoverflow.com/questions/12739633/regex-standards-across-languages

language comparison on Wikipedia: https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines

Building a Parser using Regex

This file has all kinds of problems. Notice the value formats below:

Temperature: 81 F
Length: 56 Feet
Datetime: 00:02 09-22-2017
Time: 00:03:45
Speed: 33/18 MPH

In addition to text and numeric values, we also have to deal with these additional formats that should be treated as either numeric or datetime values.

I am going to use Python to parse this file and will use a “tokenizer” pattern discussed in the core Python documentation for the re (regex) library: https://docs.python.org/3.4/library/re.html

This pattern will allow us to assign a “type” to each pattern that is matched so we do not have to count spaces and try to look for explicitly named values which could break with any slight modifications to the file.

Below is a function that returns a named tuple with values for the type, the value, the line, and the column it was found in the string.

In my list of token specifications, i have included the most restrictive matches first. This is so that my value for “56.0 Feet” won’t be mistaken for “56.0 F” which would have it identified as a TEMP instead of LENGTH. (I should also be accounting for Celsius and Meters too but i am being lazy)

Let’s look a bit closer at a couple more of these regex.

        (‘ASSIGN’r‘: ‘),           # Assignment operator

The assign operator is very important as we are going to use each instance of this to identify a rule that the NEXT token value should be ASSIGNED to the previous token value. The “little r” before the string means a “raw string literal”. Regex are heavy with “\” characters, using this notation avoids having to do an escape character for everyone of them.

        (‘DATETIME’,        r(\d+:(\d+(:\d)*)*)+\s+(\d+-\d+-\d+)),  # Datetime value (ex. 00:00:00  12-12-2017)

Datetime is taking the numeric pattern I explained in detail above but slightly changing the “.” to a “:”. In my file, i want both 00:00 and 00:00:00 to match the time portion of the pattern, so therefore I use a nested “*” (remember that means 0 to n occurrences). The + at the end of the first section means at least 1 occurrence of the time portion, therefore simply a date field will not match this datetime regex. Then the “\s” represents single or multiple line spaces (remember that regex is greedy and will keep taking spaces unless ended with “?”). Then the last section for the date will take any integer values with two dashes (“-“) in between. This means 2017-01-01 or 01-01-2017 or even 2017-2017-2017 would match the Datetime date section. This may be something I should clean up later 🙂

    tok_regex = ‘|’.join(‘(?P<%s>%s)’ % pair for pair in token_specification)

 

I wanted to just quickly point out how cool it is that Python then allows you to take the list of regex specifications and separate them with a “|” by doing the “|”.join() notation. This will result in the crazy looking regex below:

‘(?P<SPEED_IN_OUT>(\\d+(\\.\\d*)?/\\d+(\\.\\d*)?\\s{1}MPH))|(?P<SPEED>(\\d+(\\.\\d*)?\\s{1}MPH))|(?P<LENGTH>(\\d+(\\.\\d*)?\\s{1}Feet))|(?P<TEMP>(\\d+(\\.\\d*)?\\s{1}[F]))|(?P<DATETIME>(\\d+:(\\d+(:\\d)*)*)+\\s+(\\d+-\\d+-\\d+))|(?P<TIME>(\\d+:(\\d+(:\\d)*)*)+)|(?P<ID_W_NBR>(\\d+(\\.\\d*)?\\s([/\\w]+\\s?)+))|(?P<NUMBER>\\d+(\\.\\d*)?)|(?P<ID>([/\\w]+\\s?)+)|(?P<ASSIGN>: )|(?P<NEWLINE>\\n)|(?P<SKIP>[ \\t]+)’

Two important things were done here. We gave each specification the ?P<name> notation which allows us to reference a match group by name later in our code. Also, each token specification was wrapped with parenthesis and separated with “|”. The bar is like an OR operator and evaluates the regex from left to right to determine match, this is why i wanted to put the most restrictive patterns first in my list.

The rest of the code iterates through the line (or string) that was given to find matches in using the tok_regex expression and yields the token value that includes the kind (or type) of the match found and the value (represented as value.strip() to remove the whitespaces from beginning and end).

Evaluating the Output

Now that our parser is defined, lets process the formatted file above. We add some conditional logic to skip the first line and any lines that have a length of zero. We also stop processing whenever we no longer encounter lines with “:”. This effectively is processing all headers and we will save the line processing for another task.

The first few lines processed will result in the following output from the print statements (first the line, then each token in that line)

Site Name   : Chicago IL                  Seq Number     : 111
Token(typ=’ID’, value=’Site Name’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’ID’, value=’Chicago IL’, line=1, column=14)
Token(typ=’ID’, value=’Seq Number’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’NUMBER’, value=’111′, line=1, column=59)
Mile Mrkr   : 304.40                      DB Index #     : 171
Token(typ=’ID’, value=’Mile Mrkr’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’NUMBER’, value=’304.40′, line=1, column=14)
Token(typ=’ID’, value=’DB Index’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’NUMBER’, value=’171′, line=1, column=59)
Direction   : South                       Arrival        : 00:02  09-22-2017
Token(typ=’ID’, value=’Direction’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’ID’, value=’South’, line=1, column=14)
Token(typ=’ID’, value=’Arrival’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’DATETIME’, value=’00:02  09-22-2017′, line=1, column=59)
Speed In/Out: 33/18 MPH                   Departure      : 00:03:45
Token(typ=’ID’, value=’Speed In/Out’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’SPEED_IN_OUT’, value=’33/18 MPH’, line=1, column=14)
Token(typ=’ID’, value=’Departure’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’TIME’, value=’00:03:45′, line=1, column=59)
Slow Speed  : 38 MPH                      Approach Speed : 0 MPH
Token(typ=’ID’, value=’Slow Speed’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’SPEED’, value=’38 MPH’, line=1, column=14)
Token(typ=’ID’, value=’Approach Speed’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’SPEED’, value=’0 MPH’, line=1, column=59)
Approach Length: ~0.0 Feet
Token(typ=’ID’, value=’Approach Length’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’LENGTH’, value=’0.0 Feet’, line=1, column=60)

Notice how everything is being parsed beautifully without having to do any counting of spaces or finding explicit header names. With being able to identify “SPEED”, “TIME”, “LENGTH”, we will also be able to write a function to change these to the proper type format and add a unit of measure column if needed.

The only assumptions we are going to make to process this header information are as below:

1. skip the first line
2. end processing when a non-empty line no longer has an assignment operator of “:”
3. pattern expected for each line is 0 to n occurrences of ID ASSIGN any_type

To handle #3 above, we add the below code to the end of the for loop shown above:

If you follow the logic, we are just taking the string (each line of the file) and recording the value of the first token as the id, finding the assign operator “:”, and then recording the following token value as the value of a dictionary object. It then appends that dictionary to the “ls” list that was initialized in the first code snippet.

We could then format it as JSON by adding the below line of code after the for loop

See output below, some additional formatting work needs to be done with this as well as pre-processing my numbers and date times to not be represented as strings, but that is not the focus of this blog post.

Now What?

I hope to do a continuation of this blog post and explore a server-less architecture of taking the file from the FTP server, immediately running this pre-processing, and dumping the JSON out to a stream ingestion engine. From there, we can do all sorts of cool things like publish real time data directly to Power BI, or into a Big Data store. This follows principles of “Kappa Architecture”, a simplification of “Lambda Architecture” where everything starts from a stream and the batch processing layer goes away.

There are multiple ways to implement this, but with Cloud computing, we have an opportunity to do this entire chain of events in a “server-less” environment meaning no virtual machines or even container scripts have to be maintained. So, lets cover this next time

Conclusion

Regex are super powerful. I ignored them for years and now I feel super smart and clever for finding a better solution to file processing than i would have originally implemented without regex.

The full Python code from above as well as the formatted file can be found on my GitHub here

 

 

 

Data Driven Subscriptions in Power BI

What was that? Did I just say you could create a data driven subscription in Power BI? Like old school SSRS reporting where i can determine that these 5 sales people get an email with the information only relevant to them while these other 5 people over here get information only relevant to them?

Yep, that is exactly what i said…

Really this is pretty basic stuff and although you will see that this technique is a bit limited in options (for instance, we can ONLY send the hyperlink to the report and not a PDF or attachment) it works nonetheless.

Microsoft Flow

I have chosen to do this via Microsoft Flow, but this technique could be achieved several ways. You can use the scheduler of your choice. Flow is a business friendly workflow engine and if you are already an Office 365 shop, it is likely you can use it as part of your current licensing. In this article we are going to use Flow to iterate through a subscription table stored in SQL Server twice a day and send emails with a hyperlink to the properly filtered report.

To learn more about Flow and how to evaluate it for FREE, visit https://preview.flow.microsoft.com/en-us/

The Power BI Report

In this blog post we will use the AdventureWorksDW database and key off of the “DimSalesTerritory” table to determine which sales country should be filtered in the subscription notification. The report has two pages showing sales information. The PBIX file is available on my GitHub repository if you don’t want to create your own.

When setting up the report, add DimSalesTerritory.SalesTerritoryCountry to the Report Level Filters. Adding to the Report Level will ensure that EVERY page of the report adheres to the filtered value that comes from the subscription (described below).


Enabling Data Driven Subscriptions

To achieve this we will use url query string parameters as described here when putting the hyperlink in the email notification.

To create a custom hyperlink for each email address (to make it truly data driven), I am using a SQL Server table to store the Sales Territory each employee should see in the report. Below is the DDL for the “subscriptions” table that I am adding to the AdventureWorksDW database. However, this could easily be a SharePoint list or some other data store that can be accessed from Microsoft Flow to drive the email subscriptions

 

 

URL Formats in Power BI

The table above will contain the email address of the recipient as well as critical sections of the Power BI URL: the group id and the report id.

There are 3 different formats of the base url that for this example you will have to determine how your report is available to users in Power BI.

If the report is shared directly in your personal workspace, the format is below
https://app.powerbi.com/groups/me/reports/{reportid}/ReportSection

Notice that in this scenario, the group identified is “me”. This indicates the report is available in your personal workspace.

If the report is shared with you from an App Workspace (formerly known as Group Workspace), the format is below
https://app.powerbi.com/groups/{groupid}/reports/{reportid}/ReportSection

As described in my previous blog post Power BI Content Workflow – with Apps , the best practice however is to distribute content via an “App” and therefore the format should render as below:
https://app.powerbi.com/groups/me/apps/{groupid}/reports/{reportid}/ReportSection

In this scenario, we are using the groupid value from the SQL Server data source as the App and you can see that the group is still specified as “me”

Query String

In the guidance referenced above for including a query string, all of the above formats would have “?filter=DimSalesTerritory/SalesTerritoryCountry eq ‘United States’” for instance appended to set the report filter to only shows sales related to the United States.

SQL Server Entries

In this example, there are two entries in my subscriptions table, one for steve@angryanalytics.com that will be sent a link via email for ‘United States’ sales and another for bob@angryanalytics.com that will be sent a link for ‘France’ sales.

Setting up Microsoft Flow

In Microsoft Flow, there are two types of activities: Actions and Triggers. We will create a “Schedule” trigger in a blank flow

From the Schedule (Recurrence) trigger, select the frequency to be Daily and the proper time zone from the advanced options. Then you can select multiple hours you want the subscription to be ran

To add an additional action, use the New Step button

After that select the “SQL Server – Get Rows” action and find your “subscription” table. Note: You will need to connect to your database first if you have never connected before via Microsoft Flow.

 

Magic

And this is where the magic happens if you have never used flow before… add the “Send an email” action. Flow detects the previous action and sees that there is a “Get Rows” that has been performed. It will first create a loop (Apply to each) for every row in the table and then it will show the columns available from the selected rows so that they can be used in the “Send an email” action.

The email address can be used for the “To” location. I created a static subject for “Your Power BI report is ready!”. Click “Show advanced options” to expand the selection and make sure to set “Is HTML” to “Yes”.

Finally in the “Body” use an html tag <a href=”…”>Click Me!</a> or something more elegant of course. In the href attribute string together the base url (in my case, “https://app.powerbi.com/groups/me/apps/”) and append the rest of the variable data from the subscriptions table to make dynamic data driven link.

Now Save your flow and you can click “Run Now” to test it (this will run it now regardless of the trigger schedule you created).

Result

As expected I receive an email for both steve@angryanalytics.com and bob@angryanalytics.com

Because when we setup the Power BI report, we used a Page level filter, both pages of the report will auto filter to the expected sales territory.

 

Now, every day at the times selected for the schedule 100s of people can get the filtered report link that would show them the most up to date information after data refreshes have occurred.

Other Actions

In this example, I used a simple email, however, you could have selected Twilio to use for text messaging notifications or something like SendGrid to create highly polished email content that looks much better than a simple url link. More complex workflows could also be performed by created conditional logic or more complex filters.

Conclusion

As said, this is pretty basic stuff… but very powerful.

 

 

 

Power BI Routing Visual with Two Lines of R

Objective

Although the out of box Bing and ESRI maps in Power BI can visualize most business requirements, trying to visualize routes or shipping lanes for the transportation industry can be challenging. Fortunately in Power BI we can expand our visualization palette by using either custom visuals or R visuals.

In this blog post we will look at publicly available Flight Data and determine routes that have the highest likelihood of cancellation. This could easily be translated to shipping or transportation scenarios.

You can get the final solution from my github repository or you can download the “Airports” data set from http://openflights.org/data.html and the “Flights” data set from Engima.IO

NOTE: Engima.IO is a repository for public datasets. It is free but requires you to create a login to use it. I enjoy working with the enigma-us.gov.dot.rita.trans-stats.on-time-performance.2009 as it is rather large at 2.4 GB.

Although the above visual only requires two lines of R code to be written, there are two hurdles to get over first: Ensuring R and the required libraries are installed on your desktop, and doing the data prep in the query editor to create the R data frame in the format that is expected.

Installing R and ggmap

There is already well documented guidance on installing R to be used with Power BI on the Power BI Documentation site. Once this installation has been complete, we need to get ggmap and other supporting R libraries installed as well.

I prefer going to the RGui command line (just search for “RGui”) and perform the following command:

Doing this in the R console will automatically download any dependent packages as well. If you performed this line in a Power BI visual directly it would not install the other required packages and you will get an error when you run the solution.

Data Prep

In Power BI, lets first bring in the airports data CSV file we downloaded from http://openflights.org/data.html. The important columns in this data set are the 3 letter airport code and the latitude and longitude of the airport. You can include the other fields for more detail as I am showing below, however they are not necessary for us to achieve the R visual above.

Next import the flight data that was downloaded from Engima.IO for 2009. This data is extremely wide and a lot of interesting data points exist, however we can simply remove a large portion of the columns that we will not be using. Scroll to the right Shift+click and right click to Remove Columns that start with “div”. Alternatively you can use the “Choose Columns” dialog to un-select.

To reduce the number of rows we will work with, filter on the flightdate column to only retrieve the last 3 months.

Shaping the Data

We now need to shape the data for the R visual. To only require 2 lines of R, the data has to be in the following format

index Direction Latitude Longitude
1 Origin 41.97 -87.9
1 Destination 33.63 -84.42
2 Origin 41.73 -71.43
2 Destination 33.63 -84.42
3 Origin 35.21 -80.94
3 Destination 33.63 -84.42

We will include other columns, however the format of alternating rows for the origination of the route and then the destination with the latitude and longitude for each is required. Having an Index row that keeps the right origin and destination values ordered appropriately will also help Power BI from making adjustments that you don’t want.

The Double Merge

Latitude and Longitude are not included in the flight data so we need to do a MERGE from the airport data set that we have.

NOTE: Finding lat/long values for zip codes, airport codes, city names, etc… is generally the hardest part of using ggmap and is why most of the time the use of a second reference data set is required to get this data

Perform a “Merge Queries” against the “origin” column of the flights data and the “Code” column (this is the 3 letter airport code from the airports data set)

Rename the newly created column as “Origin” and then click the “Expand” button to select the Latitude and Longitude columns ONLY to be merged.

Now repeat the above Merge steps a second time but instead of using the “origin” use the “dest” column from the flights data and merge against the same 3 digit code in the airports data. Call the new column “Dest” before expanding the Latitude and Longitude.

Once finished your dataset should look like this:

We have successfully acquired latitude and longitude which ggmap needs to plot the path of the line. However, we need the values of Lat/Long for origin and destination to be on separate rows as shown above. This is where we get to use my favorite M/PowerQuery function of “Unpivot”

Using Unpivot

To ensure later that the order of our rows are not out of sync when creating the dataframe for the R visual, add an index column via the “Index Column” button on the “Add Column” tab. I start at 1.

We need to have two columns to unpivot on to create two rows for each single row currently in the data set. I achieve this most simply by adding two custom columns via the “Custom Column” button on the “Add Column” tab. Just fill in the expression with the string “Origin” and then “Destination” for each new column as shown below

The data should now look like this:

Select both of the new custom columns (mine are called Direction 1 and Direction 2) right click and select “Unpivot Columns”

Now each row has been duplicated so that we can get the Origin and Destination latitude and longitude on separate rows.

The newly created “Attribute” column should be removed and in my example I have renamed “Value” to “Direction”.

NOTE: In this example I am using unpivot to manipulate my single row into two rows. A more meaningful use of unpivot would be if you have revenue data by month and each month is represented as a column (Jan, Feb, March, etc…) you could select all the month columns and “Unpivot” and you would now have a row for each month as the attribute and the sales amount as the value.

Conditional Columns

Once the unpivot has been completed, add two conditional columns for “Latitude” and “Longitude” via the “Conditional Column” button in the “Add Column” tab to get the values for Origin and Destination into a single column for each. Use the “Direction” column as the conditional and when it equals “Origin” select the “Origin.Latitude” column. Otherwise, select the “Dest.Latitude” column.

See below example for Latitude:

Be sure to change the type of the two circled buttons from Value to Column.

Repeat the above for Longitude.

Change the type of these new columns to decimal numbers.

Now remove the 4 columns of “Origin.Latitude”, “Origin.Longitude”, “Dest.Latitude”, “Dest.Longitude”.

The last 4 columns of the Flights data set should now look like this:

Data prep is complete. We can now close and Apply.

The data load will take a long time if you used the 2.4 GB file from Enigma.IO… Time for coffee 🙂

Creating the Viz

As we work with this data set, remember that we now have two rows representing each flight. So if you want to get any counts or summarizations, always divide by 2 in your measures.

With these 3 measures, we can start working on visualizations. First, simply a few card visuals and a matrix by origin/destination

We have 1.6 million flights in our data set. Creating an R visual to represent all of these routes will not be very productive and will probably crash your memory anyway. Let’s setup a filter by origincityname and only route the cancelled flights.

For the above visual, we first should add origincityname as a slicer and select “Atlanta, GA”. Then add a slicer for cancelled = 1.

To create the R visual, select the “R” visualizations icon

Pull in the following values from the “Flights” data

This automatically generates some R code in the R script editor for the Visual

NOTE: This is really important as we have eliminated potentially 100s of lines of R code to “prep” the data to make the data frame look like we need it to be entered into the ggmap function. This was all done via the Query Editor transformations we made above.

Now we simply add the 2 lines of R code underneath the generated R script section

NOTE: The comments make it longer than two lines, but helps describe what is happening

The geom_path function is part of the ggplot2 library and is further explained here: http://docs.ggplot2.org/current/geom_path.html

Hopefully from this example you can see that the R code is fairly minimal in filling the requirement of routing visualization.

Measures vs Columns for this R Visual

One limitation that currently exists in Power BI Desktop is that the measures we defined earlier are not really providing value to the visualization because we need to include the “Index” column to keep the dataset ordered as expected for ggmap to plot the route from the alternating rows of data.

Because the index column is required to keep the sort order, the filter context applied to the DAX measure of “Number of Cancelled Flights” will always equal 1. This does not allow us to do much “Business Intelligence” of the data set.

EARLIER function

Until this day, I am still not fully aware of why they call this function EARLIER, but what we need to do to introduce some actual “Business Intelligence” into this R visual is to create a column with the total number of cancelled flights via a given route. This “column’s data” will be repeated over and over so beware of how you utilize it. However, it will give us a great way to ONLY retrieve the data that we want.

For the “Flights” data set, add the following column to create a unique value for each Route that exists between airports:

Once that value is added, the EARLIER function can be applied to get the total number of cancellations:

The above value is repeated in every row, so don’t use it to be summarized… use it as a page level filter or a slicer to only retrieve data that meets that requirement (example: Only show routes that have more than 25 cancellations)

Make sure your slicers are cleared and your new plot should look something like this:

Now the solution is starting to become usable to gain some insights into the data. You can download my finished solution from the github repository and see that I have duplicated the “Airports” data set and have one for Origin and one for Destination that I can use as a slicer to more quickly find routes that have frequent cancellations from/to each city or state.

Conclusion

This is just an example of the many ways Power BI can be extended to help solve business problems through visualizations. One note to make is that ggmap is not yet supported by PowerBI.com. This specific R visual is only available in the desktop but many other R visuals are supported in the service as well.

And for my next article, we will see if i am brave enough to post my real opinions on the internet about Big Data and data warehousing architectures. I have lots to “rant” about but we will see if I actually post anything 🙂

Map Visual – How to deal with large datasets

 

The Power BI bubble map is very useful when plotting geography points rather than shapes or areas. I’d say that in 90% of the cases in how you should be representing your data the out of the box functionality should be fine. Occasionally however you will run into scenarios that you must plot more than a few thousand data points in which you will exceed the upper bound and get the little information icon shown below.

20160801-004

001

What is frustrating about this when it comes to maps is that it does not take a representative sample of the data and instead will just “cut off” the data points. Other visualizations will show an information icon with this message instead:

002

There are 3 techniques that I have used to deal with this issue.

  • Creating a representative sample via DAX
  • Using R script with ggmap
  • Using a location hierarchy

All of these have pros and cons.

Getting a Representative Sample via DAX

The best way to illustrate the native map behavior and how we solve for it is by using an embedded Power BI report. I am going to use some anonymized data from a logistics company that represents cell phone signal strength across one of their routes. The route in this example starts in the Chicago, IL area (United States). See the first tab in the report below.

Note: Use the “full screen” button in the lower right to expand

In the 4 visuals above, notice that the 1st and 3rd only plot a tiny portion of the latitude/longitude. This is confirmed as if you hover over the highest point the latitude is much less than the max latitude shown in the multi-line card visual in the upper right. You can also just “zoom out” on these and see that no points are plotted anywhere near Chicago which is where this route starts.

I included the 2nd visual as I did find it interesting that when I used a value in the SIZE field that may be greater than 1, it seemed to plot more evenly across the entire span of the latitude, however I still would not consider this “representative”.

The 4th visual is simply a scatter plot of the latitude/longitude. Note when hovering over the “I” icon in the upper left it states that it is a representative sample. This is what we want to achieve in the bubble map that is not done natively. In this scenario where we have a continuous latitude/longitude plot, this is a good solution.

 

Now navigate to the second page of the report. Here you will find the map visual always giving the representative sample regardless of how many dates you select in the slicer. I have solved the real business problem here by adding the cellphone signal strength as the legend across the route.

Compare the min and max latitudes in the card visual with hovering over the highest and lowest plotted point on the map. You should notice them to be consistent and the points in between are filled nicely.

“If there is an easier way to do this than what i am about to show, please share with me”

I worked through several iterations before arriving at the conclusion that using RAND() ( or RANDBETWEEN() actually ) was the magic I needed. This is the random number generator expression for those of you that don’t know. And for those of you that do know, you are probably questioning my knowledge a little bit at this point 🙂 but this was truly the only way i got this to work without having to make too many assumptions about the data.

Let’s get into the DAX.

It is important to know how you plan to slice the data in your map visual. In my case, i want to be able to select a date or range of dates to review the cellphone signal strength and how it may vary over time. We can simply get the count of rows in the table with a formula like this:

I have appended “WRONG” to the measure name because we run into an issue of CONTEXT on the map and we have to do a context transition (as described by Alberto Ferrari here). In the correct formula below, we need to include ALL rows of the table, except for the rows we are filtering out with the date slicer.

Notice on the second page of my embedded report above, i have listed a couple of the measures in my card visual that are the wrong ones. Apply a date filter and then in the map select a single point. Notice how the “Total Signals (WRONG)” changes to 1. This is because you have just applied filter context to the measure when you selected that single point on the map. We have to use this measure in getting a representative sample but it has to ignore the map’s filter context which is why in the RIGHT measure above, we have done the context transition to get ALL of the rows except for those being filtered by the date.

Now we need to apply a sample rate to the selected rows. The sample rate will be dynamic depending on how many rows are selected. We start to exceed the bounds of the map visual at around 3000 points so i play it safe below and use 2500 as my denominator. This is because the later calculation will not exactly match this and we may end up with more than 2500 rows.

The CEILING function just rounds whatever the decimal number is up to the nearest integer as i have specified “1” as the second argument.

On page 2 in the report above, you can see how the sample rate changes as the total signals increases with multiple dates selected in the slicer

As the next building block in the representative sample formula, we will pull out the random number generator to give a random number between 1 and the dynamic sample rate that was calculated above

We will use the number generated from this measure and compare it to our Sample Rate. When they are equal we will plot the point.

stopjpg

Before you call me crazy and question my judgement… yes, i know that RAND or RANDBETWEEN does not guarantee me a consistent sampling of data. If my sample rate is 100, it may take me 150 or even 200 tries before my measure above equals the sample rate. But it also may only take 5 tries to equal it as well. I look at the use of RAND() as a “better than the alternative” approach as it gives me the opportunity to get a representative sample of my data versus getting a cut off and unusable data set.

In a prior attempt, i used the MOD function and some other DAX acrobats to try to ensure a consistent representation of the data, but in each attempt i was defeated by the filter context issue of the lat/long on the map. This is where I would humbly welcome feedback from any DAX ninjas if there is a better way.

The reason that using the RANDBETWEEN() function works is that it re-evaluates for every single point on the map trying to be plotted.

Below is the measure for the representative sample. Use this in the SIZE field.

As it is possible i could have multiple rows being counted for the same lat/long position, i add some additional DAX to ensure i only have a size of 1 or BLANK()

The result below is a nice smooth representative sample without too much DAX needing to be written.

20160801-003

Pros for using this approach:

  • Easy… low difficulty level
  • Native Bing map means it is zoom-able… in my use case, it is important to be able to zoom in on the areas that have bad signal coverage
  • Great for big datasets… I believe all the DAX used above is available when doing direct query

Cons

  • Using RAND does not ensure a consistent sampling
  • May not be well suited for geographically disperse data (see the hierarchical approach below)

Using R script with ggmap

Another way to approach the sampling problem is with an R visual. We can create an R Visual that uses ggmap to plot the data points for our route.

I drag in the LATITUDE, LONGITUDE, Max_Signal_Rate, and the Total Signals measure (to avoid pre-sampling the data, reset this measure to just use the COUNTROWS function)

Here is the R script that then needs to be put in the R Script Editor

To use the above code, you will need to be sure you have installed R on your machine and that you have also installed the ggmap and ggplot2 packages.

The above code will produce a plot shown below and can still be filtered with a date slicer.

20160801-005

As i am still learning R, i didn’t take the time to assign a specific coloring scheme for the plotted points so the above just took the default.

There are a couple of obvious CONS to this approach

  1. You have to know R. It is not nearly as easy as just using native Power BI visuals and a little bit of DAX
  2. The map is not zoom-able. It is completely static which does not allow me to achieve solving my business problem which is to zoom in on the spots that my signal strength is low. For my use case, this does not work, however in another use case, the zoom may not be an issue.

A couple of PROS for this approach

  1. I can plot more data points with ggmap. It appears there is still a data limit but it is larger than the out of box map. I have not yet determined what that limit is. I believe it is more to do with Power BI R visuals however and not with the R ggmap functionality.
  2. With the ability to “code your own visual”, there is much more control and opportunities to create a custom experience. instead of using the geom_point function we could use geom_tile to see a more “blocky” heat map style plot as shown in this tutorial or even using geom_path which would allow you to do more of a point A to point B type of map if you didn’t really need to plot information along the way as shown in my example.

Using a Location Hierarchy

This to me seems like an obvious approach when dealing with location data that has a natural hierarchy such as Country->State/Province or Country->City.

So instead of having something like this

20160801-006

by using a hierarchy in the “Location” field you can right click and “drill down” on a parent geography to see the details

20160801-007

20160801-008

This is a very clean way to keep your data points under the threshold of around 3000 without having to write any DAX or R. The biggest problem with this however is it is fully dependent on the location hierarchy giving the necessary split to keep the lowest level of detail going over the threshold.

 

These are 3 different ways of dealing with too much data trying to be plotted on your maps. Again, for the representative sample pattern i showed i am interested in any thoughts of better approaches. Leave them in the comments or ping me on Twitter at @angryanalytics.

Dear Mr. CISO

“Dear Mr. CISO,

 As our Chief Information Security Officer I know you are exhausted trying to keep us safe from security breaches and you are ultimately responsible for any that may occur in our organization. I have started working with Power BI Desktop to get a more modern and effective view of the data that is running our business. My manager told me though that we can’t upload any of our data to the cloud. I asked her why and she just told me because we haven’t went through the necessary steps to ensure our data would be safe and that it is too much of a risk. When I pushed for more information, she said it is ultimately your team’s decision and when you approve it then we can use products like Power BI that have components that live in the cloud.”

data-security_2283310b

In so many organizations that are starting to see the benefits of using a Software as a Service (SaaS) solution like Power BI, this is the elephant in the room. Individual departments start to use the FREE version of Power BI. At some point the organization wants to limit the use until a security review has been completed by the security team. This blog post will discuss 5 main security areas your CISO will want to know about:

  1. Compliance
  2. Authentication
  3. Authorization
  4. Data Security
  5. Auditability

Power BI is a self-service solution. It is intended to enable business users to “do more” with their data. In the data security section of this post I will talk about how data can remain on premises and only be accessed via a private pipe through Power BI, but resist the urge to restrict all data from being uploaded to the cloud. It will greatly reduce your ability to truly enable a data culture in your organization. Create policies that define which data is High Business Impact (HBI) and which data is Low Business Impact (LBI). To start with, you may choose to keep your HBI data on premises only while allowing your LBI data to be published to the cloud. Over time, as your comfort grows with a cloud based solution, those policies can be modified.

Prior to discussing these 5 areas of security, I want to reference two white papers that have been extremely valuable to me. The Power BI Security White Paper is publicly available and describes in detail how Power BI as an application is architected and how data is stored in the cloud. There is also a great governance white paper published by Melissa Coates (@SQLChick) and Javier Guillén (@javiguillen) that will help you understand approaches to deploying Power BI.

Compliance

As of the time of this writing in July 2016, the below compliance are held/adhered to by Power BI. An updated list should be available in the Microsoft Trust Center.

pbicompliances

Most security conversations should start here. 100’s of pages of security requirements are documented in these compliance standards (I will leave it to you to bing or google them)… A security conversation should not start from ground zero. It should start from the coverage of the above compliances and ask “beyond these requirements, what are your concerns?” If Power BI meets these security requirements, why rehash them? Start with the “above and beyond” and reduce a week long security review effort to a couple of hours.

One compliance worth pointing out is ISO 27018. This is the compliance for Personal Identifiable information (PII) in the public cloud. This is often the data that is most scrutinized for its security in organizations that hold it. full name, birth date, SSN… no one wants to see this data leaked in a security breach. Power BI attests to holding this standard to keep this data safe.

Lastly on this topic, Microsoft strives to hold as many of the compliances that organizations frequently request. So check the Trust Center link above every few months for other compliances that may be important to you if they are not yet there.

Authentication

The front door… This is where arguably 80% or more of security breaches happen. Weak username/password security.

Have you heard of phishing attacks? Someone sends you an email that looks like a legitimate request with your company’s logo to reset or verify your credentials by clicking on a link. You go to the bogus page (that has your company’s logo on it) and you enter your username/password and it thanks you for your verification… it happens all the time… and sadly, it works so often…

Multi Factor Authentication (or MFA) to the rescue…
WAMFAAnnimated_512

Power BI can be setup to ensure that any time someone tries to login outside of the organization’s network, that a second (or even third) method of identification verification is had before allowing a user access. Often this is a requirement for the Power BI user to enter their credentials, but after that, required to accept a phone call (or use an app) to verify their identity prior to allowing that user access to the application.

So that little phishing attack above; when the hacker tries to use the credentials that were naively shared with him, the real user would get a call on their cellphone asking them to verify their identity to access the system and warn them if they didn’t just try to access the system to contact their IT administrator. It could even require them to enter a PIN only known to them if the hacker also knew the person and took their phone.

This single capability of multi factor authentication can in most ways be just as powerful of an authentication method as the badge you swipe every day to get into the building you work in.

Authorization

Once a user has validated who they are and has successfully logged in to Power BI, it then becomes a question of what data are they “authorized” to see.

auth

I previously did a blog post on how to distribute content in Power BI. This covered how to give the appropriate people access to the appropriate content. However, if you want to distribute a single report to multiple users but only allow each user to see the data they have access to, this can be achieved through the Row level Security (RLS) feature that has already been well documented.

Again, it is important to remember that Power BI is a self-service solution that allows the content author to have control on how data should be shared. Additional features will be added over time to restrict some capabilities and to provide further governance for enterprise solutions, but don’t get too caught up in what “is not restricted” currently as people generally do not go out of their way to find additional work.

Data Security

So now your data is outside the four walls of your data center… it is in the cloud… is it safe?

This is where i refer to the security white paper for in depth detail on how Power BI secures data. First of all, Power BI as an application follows general 3 tier architecture principles to ensure that the web front end communicates through services to the back end. The “Data Storage and Movement” section of the white paper discusses how nearly all of the data is encrypted at rest in Power BI with the remainder of data that is not is planned to be encrypted in Q3 of 2016.

If you are not quite ready to let your HBI data be uploaded to the cloud, Power BI does offer the capability to keep your data on premises so that Power BI basically becomes a web front end ONLY just as your own data center’s DMZ would be. This is a PRO feature via the use of the Enterprise Gateway and requires building reports in Direct Query mode.

directquery

Another feature of Power BI is the use of Express Route. This will send Power BI traffic over an MPLS line through your network carrier that is completely private. So instead of the data movement that is occurring over https to be going over the public internet, it can go through this private tunnel the same way your data center likely already communicates with its disaster recovery location that is 100s of miles away. Think of it as an extension of your own data center. For more information about Express Route and Power BI, please see the link below:
https://powerbi.microsoft.com/en-us/documentation/powerbi-admin-power-bi-expressroute/

With the data security controls in place and the option to keep specific data sets on premises if necessary, you need to ask the question “what is it that you don’t trust?” These are the same controls you would put in place for your most stringent security requirements for on premises applications.

An area where Microsoft likely goes beyond what is being done in your internal data center is the use of “Red Teaming”. This blog and white paper discusses this technique in detail but it is basically a group of security experts that try to penetrate the Microsoft cloud environment so that the threat detection and response measures can be constantly improved.

Auditability

<<UPDATE as of 8/26/2016, auditing feature is now available, see this video blog on this feature>>

Your CISO will want to know how user activity can be tracked/monitored in the Power BI environment even though it is managed by Microsoft and external to your data center.

It is likely that there are mission critical internal systems used in your organization that do not have the level of auditability that your CISO will want from a software as a service (SaaS) application. That may be understandable as it is the “fear of the unknown” as your organization does not control the environment in which Power BI is hosted.

Activities such as monitoring user logins to the system can be achieved through the O365 logging activities discussed here.

At the time of this writing there is not currently additional activities that can be logged for Power BI usage such as report views and data set usage, however this looks to be coming very soon. See the below link on the ideas site and it has been placed in the “Started” status which means it should be delivered in the coming months.
https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/13279371-better-auditing-of-user-activity-datasets-report

It is easy to forget with all the capabilities that Power BI has to offer that it has been generally available for less than a year (July 24th is the anniversary date). So it is somewhat understandable that these deeper administrative functions such as audit logging are just now being delivered. It will be up to your security team on if they want to allow the use of Power BI given all of the security features i have mentioned above or choose to wait until the additional auditing capability is in place. IMHO this is really one of the last steps to making Power BI a solution that your CISO should feel very comfortable with.

 

 

API Strategies with Power BI

API (Application Programming Interface). You surely have heard the term from your enterprise architects telling you that all data will be accessed through “the API” and you will no longer access data from the source. They will build all the interfaces for you and you won’t need that source connectivity anymore.

Tag-Cloud-Web-API

Enterprise architects are really smart (I used to be one), ha. In all of the wisdom we share as architects, the confession is that making data accessible to our business analysts was never our top priority… All of the design documentation that we wrote started with “end user experience” and had a lot of content around “3 Tier Architecture”. Reporting and Business intelligence usually ended up as the last few paragraphs after a few dozen sections of content on how the application would work and interface with all the back-end systems.

I am writing this to give the perspective that more often than not, data connectivity is not forefront in most application designs. I do believe that is starting to change as data is becoming a financial currency to most organizations, but nonetheless, these “pesky” APIs still stand in between you and the data you need to access.

So, you have been given a nice long URL that starts with “https://” to access the data you need. Now what?

In this article, I will show how to connect to your APIs through Power BI in a few different ways. I will start by showing the ODATA protocol but will quickly dive deep into APIs that are secured by the popular OAuth protocol and end with how awesome Power BI transverses nested JSON documents that get returned.

Connecting to ODATA Services

The most well suited API strategy for connecting business intelligence tools is the ODATA protocol. When your organization’s APIs are presented as an ODATA feed then it has a standard implied that power bi can understand. OData can be offered either through the more modern REST standard or more traditional and still widely used SOAP standard.

There is a public ODATA feed for the Northwind sample data set here:
http://services.odata.org/northwind/northwind.svc/

To connect to ODATA feeds, simply go to “Get Data” -> “ODATA” and enter the URL.

Get Data ODATA

As this is a standard format for APIs, the benefit is that you are presented with a table list from the service just like you would be if you were connecting directly to a database.

ODATA Table Navigator

 

Most APIs will have additional request parameters that can be added to get more specific data. The below url connects to the same service as above but returns only a single record:
http://services.odata.org/Northwind/Northwind.svc/Orders(10643)

ODATA to table

You have successfully just connected to your first API. From this point forward, ODATA behaves in the same way as other data sources and you can start shaping data in the “Edit Queries” window.

If you can influence your organization to adopt this standard for APIs, your life will be much easier consuming them. However, the purpose of this blog site is to help you use technologies such as Power BI in an enterprise scenario, and it is most likely that your APIs are not ODATA compliant…. So let’s get on with the FUN stuff.

Non ODATA APIs

After you ask your architect if the API you are trying to connect to is ODATA compliant and she tells you NO, the next alternative is to use the Web data source.

connect to web

When connecting to a web URL for your internal (or external API) it is likely going to be a secured and will have a prefix of “https” instead of “http”. In our ODATA example above, I did not connect to a secured service but had I, the below dialog would have been presented back.

connect to web 2

There are 5 different authentication methods that are quite easy to connect with. It is possible that internal APIs may use Basic or Windows credentials. Someone will have to provide to you a basic user/password combination for connecting to the service or if using Windows credentials someone will need to verify you have been granted access with your own user/password combination.

If connecting with one of these five authentication types, it is pretty straight forward and you can skip ahead to the end of this post where I show how to transverse the JSON documents that likely will be returned as a result.

Connecting to OAuth Services

Many organizations are moving towards OAuth authentication protocol. Notice in the screenshot above, this protocol is not natively supported in Power BI. This protocol has an extra layer of complexity as it requires two services. One service will be called with an api key and an api secret and will return a token to then be used with all subsequent requests to the resource URL.

We will get in the weeds rather quickly in the rest of this blog post so prior to the half of you that will stop reading at this point, please go to the Power BI Ideas site and VOTE for OAuth protocol to be added as a first class authentication provider. Here is the link: https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/13195278-oauth

Also, the below content is available in a Power BI Template file here. Read the content below but feel free to download the template to save a lot of the grunt work.

Setting up a Twitter App for our OAuth API Example

First of all, twitter connectivity via Power Query has already been blogged about a few times. The most useful post I found on it was by Chris Koester here. The script I am using later in this article is nearly identical to what I learned from his post. I will talk through the necessary script that has to be written but also focus on how you can transverse JSON data that gets returned from most modern APIs you will encounter.

Twitter uses OAuth2 for authentication. This will require you to setup a free developer account if you have not already.

Go to https://dev.twitter.com and sign in with your twitter account. You will then need to create a twitter application which will give us the necessary API Key and API Secret that will be used to authenticate with the twitter Search API.

In the new Twitter UI, the applications section is a little hard to find. At the bottom of https://dev.twitter.com are some links. On the right hand side you will see “Manage Your Apps”

Twitter UI

You will see a “Create New App” button that will take you through a dialog to setup your application

Callback URL issue with OAuth based APIs

In the application creation dialog, you will notice a Callback URL entry that we will leave blank for this example.

callback url issue

In a traditional OAuth flow that has a UI associated with it, this is the URL that the flow will return user interaction back to after the user agrees to allow the application to act on their behalf. This blog is not meant to go into that detail but you can review this link for more information.

As with OAuth, we have to get an authentication token from the first service we are going to call, there is an issue with some apis as they only pass this token back to the Callback URL. Power BI cannot read a response returned to a callback URL and if the API you are using requires that “authentication flow” then it can’t be used with Power BI (well, that is not entirely true, but to prevent this blog from being 50 pages long, we will say it can’t). For instance, the LinkedIn API requires this flow so it is not something we can connect to from Power BI easily.

As long as the token is returned in the response from the first authentication service call (as the Twitter API does), we will be able to integrate it with Power BI.

Connecting to the Twitter Search API

Once you have entered the data to create an application, you will land on your “App details” screen. Navigate to the “Keys and Access Tokens” tab and copy the API Key and Secret so they can be used later.

Note: These should be treated like passwords and not made public

twitter access tokens

Back in Power BI, go back to the Web data source and we will just type in an arbitrary web URL as we will be replacing all of this with a script. I used http://twitter.com and was presented with the navigator dialog to select a table from that page. Click “Edit” to open the Query Editor.

connect to arbitrary table

It is a good time to take advantage of a new feature that was added in the April Power BI Desktop update for Parameters. For deeper insight on this feature, check out the power bi blog entry.

manage parameters

Create 4 new parameters as follows:
API KEY: <<your API Key>>
API Secret: <<your API Secret>>
Token URL: https://api.twitter.com/oauth2/token
Search URL: https://api.twitter.com/1.1/search/tweets.json?q=Analytics&count=100

Notice the Search URL has a couple of query string parameters. In my example we will be searching on the word “Analytics”

Once the parameters are entered you will see them in the queries list with their values along with the arbitrary table you pulled in from the Web data source

query editor parameters

 

Advanced Editor

Now that we have our parameters in place, navigate back to the Table Query and click on “Advanced Editor”

advanced editor

Once in the advanced editor, replace the script with the following:

In the above code, there are two calls to Web.Contents function. The first is to call the token service to get a token returned to then be used in the second service call to actually get results.

OAuth is a preferred authentication method as the token that is returned from the first service is only valid for a finite period of time. So if you were to manually reproduce this token and then just called the search service with it, it would no longer be valid the next time you tried to refresh this data.

That is why the script above is so powerful as it handles the token generation so that you have a repeatable automated process. Kudos to @cjkoester for putting this script together.

A couple other things to note in the script is the use of our parameters. They are prefixed with the “#” sign. Also, not all OAuth services will follow the same signature above but this can be used as an example and ideally your development team can tell you changes you may need to make in the pattern to work.

I renamed the query to “TwitterSearch”. The result should now look similar to this.

query editor twitter search start

 

Working with the JSON response

JSON (javascript object notation) is the most popular format used for transmitting data for modern APIs. XML is also still widely used and if your APIs return XML it is likely many of these same steps will apply.

I am really impressed with the JSON parser that is built into the query editor of Power BI. For instance, when you look at the results of the JSON that gets returned in the Twitter API that we just connected to, it looks like this:

JSON via Fiddler

I can count 6 levels of nesting in the small portion of the JSON response I am showing above (via Fiddler). The Power BI Query Editor handles this with no issues.

We will create two separate queries from this response, one for the statuses and the other for the hashtags.

Taking the TwitterSearch result, the two top level entities are “statuses” and “search_metadata”. Click the “List” value for “statuses” and this will expand the record set for “statuses”.

query editor to table

Notice the “To Table” option at the top, by changing this list to a table, we will be able to expand the Record to all the values. Click “Ok” in the To Table dialog and you will see the magic expand button pop up in the column header.

query editor expand button

Click the button and then uncheck the “Use original column name as prefix” to avoid the “Column1” showing up as a prefix for to every column created.

query editor choose columns 1

Click Ok.

Congratulations! You now have the last 100 statuses for the search term “Analytics” that we used in the search URL string we entered as the parameter above: https://api.twitter.com/1.1/search/tweets.json?q=Analytics&count=100

Let’s use this same result to create an additional query for hashtags. Simply right click on the “TwitterSearch” query and select “Duplicate”

query editor duplicate

Rename the query “Hashtags”.

Find the “Entities” column and click the magic expand button again.

query editor expand entities

On the column expansion dialog simply click “Ok” again to see all the results of the nested JSON document.

query editor expand hashtags

Click the magic expand button again on “hashtags”.

This will bring in the null values and the “List” links will become “Record” links.

Finally click the magic expand button one last time to be greeted by the hashtag records we wanted.

query editor hashtags expanded

Right click “text.1” and rename it “hashtag”.

Now we can use the “Choose Columns” dialog to only select a few of the columns to reduce our query to just the information important to the hashtags.

query editor choose columns

From here you can “Close and Apply” your Query Editor window and will have two queries. One for the statuses of your twitter search and the other for the hashtags that you can cross filter against.

A simple viz may look like this

viz

Power BI Template

Another new feature in the April update of Power BI is the Power BI Template file. I have saved all of the above work in a template file that I am sharing by clicking the Export option from the File menu.

template screen shot

This retains all of the work that has been completed above but doesn’t retain any of the data. So when you open this template file, it will prompt you for the parameters that were discussed above.

Here is the template file link again. Angry Analytics Twitter API Example

Conclusion

In closing, this article was intended to show you multiple ways that you can connect to APIs as it is likely your organization has set standards around data connectivity that no longer allows connecting directly to the source.

Be an advocate for a Data Culture. Putting data in the hands of more people as quickly as possible will result in the ability to achieve more. Challenge your development teams to make data accessible. Data is currency!

Power BI Content Workflow

I work with many enterprise level organizations in the Midwestern United States. There is a lot of excitement around Power BI, however I often run into challenges regarding features that are expected with a business intelligence solution that can meet the needs of an enterprise such as governance, administration, and lifecycle management.

pendulum

In Business Intelligence, there is a pendulum that swings between self-service and governance that wreaks havoc on most organizations of reputable size. I can confidently say there is little doubt from an end user feature set that Power BI competes well with any solution on the market. With a FREE and PRO offering at only 9.99 a month per user, the price point is hard to beat. But what about total cost of ownership (TCO)? What about the enterprise deployments of Power BI? I intend to use this blog to help with answering these questions regarding Power BI, SSAS Tabular models, and an overall agile approach to business intelligence within the enterprise.

Let’s start with discussing lifecycle management and taking reports through development, testing, and production similar to how you would deploy a codebase for an application.

power bi cheapskate

First of all, we need to ensure that we are working with the PRO version of Power BI. Yes, if you want Enterprise level features, you are going to have to pay for it. Two PRO features are being used below: Power BI Group Workspaces and Content Packs.

Group Workspaces in Power BI allow for multiple users to “co-author” reports and share ownership over the Power BI content. You can easily create a group workspace from the left hand navigation bar.

power bi groups

Once a group has been created, you can assign users with either “Edit” or “View Only” access. This will be an important distinction for users you assign to this group between those who will be “co-authoring” and those who will be testing and doing quality assurance.

power bi groups add user

 

Once a group has been established with distinction between the content developers and the testers, the below diagram can be followed to achieve a content management workflow.

power bi content mgmt workflow

This diagram is also provided in pdf form below:
Power BI Content Workflow

In this workflow, One Drive For Business (ODFB) is used as a collaboration space for PBIX files. Alternatively, this could as be a source control environment or a local file share. A benefit in O365 deployments for Power BI is that when a group is created, a ODFB directory is also established for sharing.

The group workspace is used for collaboration between content authors and testers. Once development and testing iterations are complete, a Content Pack should be produced to distribute the production version of the dashboard, reports, and datasets to the rest of the organization. Content Packs are created by going to “Get Data” -> “My Organization” -> “Create Content Pack”.

power bi content pack create

IMPORTANT: It is critical that in an enterprise scenario that personal workspaces are not used to publish content packs. This creates a dependency on a single author to have full control over content and scenarios will consistently pop up that a single author may not be available when updates are needed on the content. Enterprise deployment scenarios can’t be achieved when content is tied to a personal workspace. Establish a group workspace structure that makes sense for your organization as pointed out in the Pre-Requisite text on the workflow diagram.

Making Revisions

As revisions occur on the content in the group workspace a notification appears in the upper right reminding that there is a content pack associated with the dashboard or report being modified and that an update needs to be applied to the content pack to publish the changes.

power bi content pack changes

I have talked with some organizations that originally considered bypassing the use of content packs and just were going to use group workspaces with “View Only” users getting the content distribution. This would not allow for the option to iterate on changes before they are published to the end users.

Once the iterations have been completed and the next revision is ready to publish, anyone with “Edit” privileges in the group can go to Settings (the gear icon in upper right) and “View Content Packs”.

power bi content pack update

A warning icon is displayed next to the content pack name indicating there are pending changes. Once you click “Edit” you will have the option to “Update” the content pack in which the changes will be distributed to those that have access to it.

power bi content pack update button

Connection Strings and Separation of Duties (SoD)

The above workflow still leaves a couple of desired features without answers.

With the group workspaces, we have separated the ability to develop and test report changes from publishing them to the rest of the organization, but how do we transition from a DEV/TEST data source to a PROD data source? This has to be achieved within the Power BI Desktop file (as of the time this blog post was written). If it is a new calculation or set of tables that are being tested, collaboration will have to occur between the development team and the testers as to when to change the connection strings from DEV/TEST to PROD sources and republish the PBIX file.

In my experience, often on the reporting and analytics side of things, production connectivity is required by the “power users” in each department that are doing the quality assurance as they can only tell if new report features will pass the “sniff test” when they are connected to data that they know. DEV/TEST environments often grow stale and don’t hold as much value on the reporting/analytics side as they do for application development. Connection string management will be enhanced in Power BI as the product matures, but even then, limited value will be achieved without the production data to verify changes.

Lastly, when we look at the fact that a group workspace only has an “Edit” and “View” role and lacks a “Publish” role, it may be of concern that a content author can push changes for content packs without separation of duty (which may require someone else to actually publish the change). If this is very critical to your business, then the separation of duty can be achieved by adding another group workspace to the mix that only publishers can have access to, but I would suggest this is often overkill. Because reporting cannot “alter” any underlying data, it does not as often have as strict of controls in place to protect from malicious code being deployed such as a transactional application may have. And remember, there is a significant amount of any enterprise organization that is still running their business from Excel spreadsheets being produced and managed by the “power users” in each department. They simply extract data from source systems and create Excel reports that often go all the way to the CFO or CEO in the organization. So as we come back to the initial comment in this blog regarding the pendulum of self-service vs governance, remember that these concerns probably already exist in your organization today with much less visibility than they would when using the proposed workflow mentioned above.

 

Feel free to leave any comments on this article or suggestions for future articles that touch on the use of Microsoft BI solutions such as Power BI for enterprise deployment.