I really like some of the advancements that have been made in Power BI scatter plots over the last few months. I wanted to point out some capabilities you may not be using that maybe you should be.
Data Sampling Improvements
In the September 2017 release, you can now be confident that all of your outliers are being shown. No one can visually look at a plot and interpret several thousand data points at once, but you can interpret which of those points may be outliers. I decided to test this out myself between a Python scatter plot of 50k data points and Power BI.
In the test, I used a randomly generated normal distribution of 50k data points to ensure I had some outliers.
#Create a random dataset that has a normal distribution and then sort it (in this case, 50000 data points)
x = np.random.normal(50,25,50000)
x = np.sort(x)
#Create another dataset to put on the y axis of the scatter plot
y = np.random.normal(50,25,50000)
#plot the dataset
(You can see the Python notebook on my GitHub here).
Here it is in Python:
Here it is in Power BI (September desktop release)
Notice that all the outliers have been preserved. Note that in previous releases, the Power BI rendering of this would have been shown as below.
This is a great improvement. To learn more about this update, check out the official blog post on high density sampling: https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-high-density-scatter-charts/
Working with Outliers (Grouping)
Now that we know the dense sampling is preserving our outliers, we can perform some analysis on them. Power BI makes it easy to CTRL+click on multiple outliers and then right-click and add new Group
This will create a new field in your fields list for this group of outliers and will automatically include a Group for “Other” (the other 49.993 data points that weren’t selected). Note that I renamed my field to “High Performers”
As this is a random dataset with integers for x,y values there are no dimensions here that may be interesting to compare, but consider now we can always come back to this grouping for further analysis such as the bar chart below:
You can also use “…” in the upper right of the scatter chart to automatically detect clusters. Our example again is a bit uninteresting due to it being a random normal distribution but gives you an idea of how you can cluster data that is more meaningful.
Symmetry Shading and Ratio Lines
These gems were released in the August 2017 desktop release and really helps visualize the skew of your data.
Both of these can be turned on from the analytics tab.
Instead of using our sample dataset above I will use the dataset from my last blog post on Scorecards and Heatmaps.
In the below plot I took the SalesAmount field and plotted on the y axis against the SalesAmountQuota field on the x axis. From Symmetry shading we can observe that none of our sales people are meeting their quota. From the ratio line we can see the few individuals that have a positive variance to the ratio while most are flat or below the ratio.
You can read more about these two features in the August Desktop blog post: https://powerbi.microsoft.com/en-us/blog/power-bi-desktop-august-2017-feature-summary/#symmetryShading
These are just a few of the recently released features that I think have made the native scatter chart in Power BI a very useful visual. I have posted the PBIX file for the normal distribution data on my GitHub if you would like to download: https://github.com/realAngryAnalytics/angryanalyticsblog/tree/master/20171002-scatterplots