Regular Expressions will save your life!

I am closing out 2017 with a refreshing project that has led me away from Power BI for a bit. However, even for the Power BI community, I think the below information is valuable because at some point, you are going to run into a file that even the M language (Power BI Query Editor) is going to really have a hard time parsing.

For many of you, its still a flat file world where much of your data is being dropped via an FTP server and then you have a process that parses it and puts it in your data store. I recently was working with a file format that I have no idea why someone thought it was a good idea, but nonetheless, i am forced to parse the data. It looks like this:

This data simulates truck weigh-in station data. There is a lot of “header” information followed by some “line” items.

Just think for a moment how you would approach parsing this data? Even in Power BI, this would be extremely brittle if we are counting spaces and making assumptions on field names.

What if a system upgrade effecting the fields in the file is rolled out to the truck weigh stations over the course of several months? Slight changes to format, spacing, field names, etc… could all break your process.

 Regex to the Rescue

In my career as a developer, I never bothered to understand the value of regular expressions (regex). With this formatted file I now see that they can save my life (well, that may be dramatic, but they can at least save me from a very brittle pre-processing implementation)

For anyone unfamiliar with regex, a regular expression is simply a special text string for describing a search pattern. The problem is, they are extremely cryptic and scary looking and you want to immediately run away from them and find a more understandable way to solve your text string problem. For instance, a regular expression that would find a number (integer or decimal) in a long string of characters would be defined as

\d+(\.\d*)?

What the heck is that?

The “\d” represents any decimal digit in Unicode character category [Nd]. If you are only dealing with ASCII characters “[0-9]” would be the same thing. The “+” represents at least one instance of this pattern followed by (\.\d*) which identifies an explicit dot “.” followed by another \d but this time with a “*” indicating that 0 to n instances of this section of the pattern unlike the first section that required at least 1 instance of a digit. Therefore this should result in true for both 18 as well as 18.12345. However, regex are greedy by default, meaning it expects the full pattern to be matched. So without adding the “?” to the end of the string, the above regex would NOT recognize 18 as a number. It would expect a decimal of some sort. Because we have included the “?” it will end the match pattern as long as the first part of the match was satisfied, therefore making 18 a valid number.

So, the regex was 11 characters and it took me a large paragraph to explain what is was doing. This is why they are under utilized for string processing. But if you are looking at it the other way, it only took 11 characters to represent this very descriptive pattern. Way cool in my book!

Regex language consistency

My example above was from Python. As i am a “data guy”, i find Python to have the most potential for meeting my needs. I grew up on C# and Java however so understanding regex may have some slight variations between languages. Some interesting links on this are below:

Stack Overflow: https://stackoverflow.com/questions/12739633/regex-standards-across-languages

language comparison on Wikipedia: https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines

Building a Parser using Regex

This file has all kinds of problems. Notice the value formats below:

Temperature: 81 F
Length: 56 Feet
Datetime: 00:02 09-22-2017
Time: 00:03:45
Speed: 33/18 MPH

In addition to text and numeric values, we also have to deal with these additional formats that should be treated as either numeric or datetime values.

I am going to use Python to parse this file and will use a “tokenizer” pattern discussed in the core Python documentation for the re (regex) library: https://docs.python.org/3.4/library/re.html

This pattern will allow us to assign a “type” to each pattern that is matched so we do not have to count spaces and try to look for explicitly named values which could break with any slight modifications to the file.

Below is a function that returns a named tuple with values for the type, the value, the line, and the column it was found in the string.

In my list of token specifications, i have included the most restrictive matches first. This is so that my value for “56.0 Feet” won’t be mistaken for “56.0 F” which would have it identified as a TEMP instead of LENGTH. (I should also be accounting for Celsius and Meters too but i am being lazy)

Let’s look a bit closer at a couple more of these regex.

        (‘ASSIGN’r‘: ‘),           # Assignment operator

The assign operator is very important as we are going to use each instance of this to identify a rule that the NEXT token value should be ASSIGNED to the previous token value. The “little r” before the string means a “raw string literal”. Regex are heavy with “\” characters, using this notation avoids having to do an escape character for everyone of them.

        (‘DATETIME’,        r(\d+:(\d+(:\d)*)*)+\s+(\d+-\d+-\d+)),  # Datetime value (ex. 00:00:00  12-12-2017)

Datetime is taking the numeric pattern I explained in detail above but slightly changing the “.” to a “:”. In my file, i want both 00:00 and 00:00:00 to match the time portion of the pattern, so therefore I use a nested “*” (remember that means 0 to n occurrences). The + at the end of the first section means at least 1 occurrence of the time portion, therefore simply a date field will not match this datetime regex. Then the “\s” represents single or multiple line spaces (remember that regex is greedy and will keep taking spaces unless ended with “?”). Then the last section for the date will take any integer values with two dashes (“-“) in between. This means 2017-01-01 or 01-01-2017 or even 2017-2017-2017 would match the Datetime date section. This may be something I should clean up later 🙂

    tok_regex = ‘|’.join(‘(?P<%s>%s)’ % pair for pair in token_specification)

 

I wanted to just quickly point out how cool it is that Python then allows you to take the list of regex specifications and separate them with a “|” by doing the “|”.join() notation. This will result in the crazy looking regex below:

‘(?P<SPEED_IN_OUT>(\\d+(\\.\\d*)?/\\d+(\\.\\d*)?\\s{1}MPH))|(?P<SPEED>(\\d+(\\.\\d*)?\\s{1}MPH))|(?P<LENGTH>(\\d+(\\.\\d*)?\\s{1}Feet))|(?P<TEMP>(\\d+(\\.\\d*)?\\s{1}[F]))|(?P<DATETIME>(\\d+:(\\d+(:\\d)*)*)+\\s+(\\d+-\\d+-\\d+))|(?P<TIME>(\\d+:(\\d+(:\\d)*)*)+)|(?P<ID_W_NBR>(\\d+(\\.\\d*)?\\s([/\\w]+\\s?)+))|(?P<NUMBER>\\d+(\\.\\d*)?)|(?P<ID>([/\\w]+\\s?)+)|(?P<ASSIGN>: )|(?P<NEWLINE>\\n)|(?P<SKIP>[ \\t]+)’

Two important things were done here. We gave each specification the ?P<name> notation which allows us to reference a match group by name later in our code. Also, each token specification was wrapped with parenthesis and separated with “|”. The bar is like an OR operator and evaluates the regex from left to right to determine match, this is why i wanted to put the most restrictive patterns first in my list.

The rest of the code iterates through the line (or string) that was given to find matches in using the tok_regex expression and yields the token value that includes the kind (or type) of the match found and the value (represented as value.strip() to remove the whitespaces from beginning and end).

Evaluating the Output

Now that our parser is defined, lets process the formatted file above. We add some conditional logic to skip the first line and any lines that have a length of zero. We also stop processing whenever we no longer encounter lines with “:”. This effectively is processing all headers and we will save the line processing for another task.

The first few lines processed will result in the following output from the print statements (first the line, then each token in that line)

Site Name   : Chicago IL                  Seq Number     : 111
Token(typ=’ID’, value=’Site Name’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’ID’, value=’Chicago IL’, line=1, column=14)
Token(typ=’ID’, value=’Seq Number’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’NUMBER’, value=’111′, line=1, column=59)
Mile Mrkr   : 304.40                      DB Index #     : 171
Token(typ=’ID’, value=’Mile Mrkr’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’NUMBER’, value=’304.40′, line=1, column=14)
Token(typ=’ID’, value=’DB Index’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’NUMBER’, value=’171′, line=1, column=59)
Direction   : South                       Arrival        : 00:02  09-22-2017
Token(typ=’ID’, value=’Direction’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’ID’, value=’South’, line=1, column=14)
Token(typ=’ID’, value=’Arrival’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’DATETIME’, value=’00:02  09-22-2017′, line=1, column=59)
Speed In/Out: 33/18 MPH                   Departure      : 00:03:45
Token(typ=’ID’, value=’Speed In/Out’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’SPEED_IN_OUT’, value=’33/18 MPH’, line=1, column=14)
Token(typ=’ID’, value=’Departure’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’TIME’, value=’00:03:45′, line=1, column=59)
Slow Speed  : 38 MPH                      Approach Speed : 0 MPH
Token(typ=’ID’, value=’Slow Speed’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’SPEED’, value=’38 MPH’, line=1, column=14)
Token(typ=’ID’, value=’Approach Speed’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’SPEED’, value=’0 MPH’, line=1, column=59)
Approach Length: ~0.0 Feet
Token(typ=’ID’, value=’Approach Length’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’LENGTH’, value=’0.0 Feet’, line=1, column=60)

Notice how everything is being parsed beautifully without having to do any counting of spaces or finding explicit header names. With being able to identify “SPEED”, “TIME”, “LENGTH”, we will also be able to write a function to change these to the proper type format and add a unit of measure column if needed.

The only assumptions we are going to make to process this header information are as below:

1. skip the first line
2. end processing when a non-empty line no longer has an assignment operator of “:”
3. pattern expected for each line is 0 to n occurrences of ID ASSIGN any_type

To handle #3 above, we add the below code to the end of the for loop shown above:

If you follow the logic, we are just taking the string (each line of the file) and recording the value of the first token as the id, finding the assign operator “:”, and then recording the following token value as the value of a dictionary object. It then appends that dictionary to the “ls” list that was initialized in the first code snippet.

We could then format it as JSON by adding the below line of code after the for loop

See output below, some additional formatting work needs to be done with this as well as pre-processing my numbers and date times to not be represented as strings, but that is not the focus of this blog post.

Now What?

I hope to do a continuation of this blog post and explore a server-less architecture of taking the file from the FTP server, immediately running this pre-processing, and dumping the JSON out to a stream ingestion engine. From there, we can do all sorts of cool things like publish real time data directly to Power BI, or into a Big Data store. This follows principles of “Kappa Architecture”, a simplification of “Lambda Architecture” where everything starts from a stream and the batch processing layer goes away.

There are multiple ways to implement this, but with Cloud computing, we have an opportunity to do this entire chain of events in a “server-less” environment meaning no virtual machines or even container scripts have to be maintained. So, lets cover this next time

Conclusion

Regex are super powerful. I ignored them for years and now I feel super smart and clever for finding a better solution to file processing than i would have originally implemented without regex.

The full Python code from above as well as the formatted file can be found on my GitHub here

 

 

 

Working with Scatter Plots in Power BI

I really like some of the advancements that have been made in Power BI scatter plots over the last few months. I wanted to point out some capabilities you may not be using that maybe you should be.

Data Sampling Improvements

In the September 2017 release, you can now be confident that all of your outliers are being shown. No one can visually look at a plot and interpret several thousand data points at once, but you can interpret which of those points may be outliers. I decided to test this out myself between a Python scatter plot of 50k data points and Power BI.

In the test, I used a randomly generated normal distribution of 50k data points to ensure I had some outliers.

(You can see the Python notebook on my GitHub here).

Here it is in Python:

Here it is in Power BI (September desktop release)

Notice that all the outliers have been preserved. Note that in previous releases, the Power BI rendering of this would have been shown as below.

This is a great improvement. To learn more about this update, check out the official blog post on high density sampling: https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-high-density-scatter-charts/

Working with Outliers (Grouping)

Now that we know the dense sampling is preserving our outliers, we can perform some analysis on them. Power BI makes it easy to CTRL+click on multiple outliers and then right-click and add new Group

This will create a new field in your fields list for this group of outliers and will automatically include a Group for “Other” (the other 49.993 data points that weren’t selected). Note that I renamed my field to “High Performers”

As this is a random dataset with integers for x,y values there are no dimensions here that may be interesting to compare, but consider now we can always come back to this grouping for further analysis such as the bar chart below:

Clustering

You can also use “…” in the upper right of the scatter chart to automatically detect clusters. Our example again is a bit uninteresting due to it being a random normal distribution but gives you an idea of how you can cluster data that is more meaningful.

Symmetry Shading and Ratio Lines

These gems were released in the August 2017 desktop release and really helps visualize the skew of your data.

Both of these can be turned on from the analytics tab.

Instead of using our sample dataset above I will use the dataset from my last blog post on Scorecards and Heatmaps.

In the below plot I took the SalesAmount field and plotted on the y axis against the SalesAmountQuota field on the x axis. From Symmetry shading we can observe that none of our sales people are meeting their quota. From the ratio line we can see the few individuals that have a positive variance to the ratio while most are flat or below the ratio.

You can read more about these two features in the August Desktop blog post: https://powerbi.microsoft.com/en-us/blog/power-bi-desktop-august-2017-feature-summary/#symmetryShading

Conclusion

These are just a few of the recently released features that I think have made the native scatter chart in Power BI a very useful visual. I have posted the PBIX file for the normal distribution data on my GitHub if you would like to download: https://github.com/realAngryAnalytics/angryanalyticsblog/tree/master/20171002-scatterplots

 

 

Deep Learning Toolkit considerations for emerging data scientists

Overview

Disclaimer: This blog is my own opinion and not that of my employer, however it should be noted that I am a Microsoft employee and this may reflect that perspective.

Update: New version of these benchmarks is being worked on and can be tracked here: http://dlbench.comp.hkbu.edu.hk/

This post is a departure from my usual focus on Power BI. Enterprise deployment scenarios for Power BI have been a great subject for me. However, In my day job, I do work on a variety of data platform related subjects. These are my findings on deep learning toolkits and what you should know before getting too deep into them, especially pay attention to my section on Keras

Deep learning is popular for image processing (computer vision, facial recognition, emotion detection), natural language processing (sentiment analysis, translation), and even starting to find its way into areas such as customer churn. Neural networks with many layers are used to increase precision of a prediction as opposed to more statistical type algorithms such as linear regression.

There are several popular open source deep learning toolkits including Caffe, Torch, TensorFlow, CNTK (now Cognitive toolkit), and mxnet

This post will mostly reference TensorFlow and CNTK for reasons established in the section on Keras.

Python vs R

This debate will rage on for probably another decade similar to how I remember the Java vs C# debate as a developer in the early 2000’s. From what I have seen, Python appears to have more support in the area of deep learning than R. All but Torch support Python integration while only TensorFlow and mxnet support R directly.

Toolkit Performance

One of the most important aspects of a deep learning toolkit is performance.

Lets consider a couple of scenarios:

In the software development cycle a poorly indexed table could be the difference in 5 seconds and 5 minutes to call the database. This is annoying but is not the critical path to meeting a deadline. It takes many developer hours and iterations to build the code around that database call making that index issue less significant, but of course something that should be addressed.

In deep learning on the other hand, the difference between a model that performs twice as fast as another toolkit could mean the difference between 1 vs 2 days of training time. The iteration cycle is greatly impacted and retraining a model 5 times could be the difference between 1 week and 2 weeks to deliver results. This is significant!

Benchmarking Performance among leading toolkits

Benchmarking State-of-the-Art Deep Learning Software Tools is an academic journal (latest revision February 2017) comparing the most popular deep learning toolkits for CNN, FCN, and LTSM. These acronyms are neural network types you will want to familiarize yourself with if you are not already. There is a new eDX course that is just starting that you can learn all about these concepts.

Below i have included some links that may be to other frameworks but the content explanations seemed more easily understood

Convolutional Neural Network (CNN) – used primarily for image processing. Popular implementations include:
  • AlexNet – an 8 layer CNN circa 2012 that cut error rate nearly in half from previous versions
  • ResNet-50/101/152/etc – A deep residual learning network with 50/101/152/etc layers respectively circa 2015 achieving an error rate of 3.57% which was 4 times improvement from AlexNet

Fully Convolutional Neural Nework (FCN) – variation of CNN that doesn’t include the fully connected layer

Recurrent Neural Network (RNN) & Long Short Term Memory (LTSM) – widely used for natural language processing

This paper is extremely thorough and as our instincts are to scroll immediately to page 7 to start interpreting the bar charts, it is important to note how they ran these tests and gathered the results as described in pages 1-6.

One summary table that doesn’t fully represent all results is shown below.

Shaohuai Shi, Qiang Wang, Pengfei Xu, Xiaowen Chu, “BenchmarkingState-of-the-ArtDeepLearningSoftwareTool”

As you interpret these results, as well as the rest of them in the journal, you will notice three glaring observations

  • There is not one toolkit that has best performance across all neural network types. In fact, there can be wide variation in performance rank for a single toolkit based on # of CPUs or # of GPUs used.
  • Google TensorFlow is arguably the most popular of all of these toolkits, yet the results published in this paper other than in a few cases show it is quite average if not consistently poorer performing than others.
  • CNTK is orders of magnitude better than all of the competition in LTSM

Note on Google TensorFlow

CNTK performs better overall and by orders of magnitude in some cases to TensorFlow. As emerging data scientists start to pick toolkits for deep learning, TensorFlow seems to be a popular choice. In many cases, it will have desirable performance, but to put “all your eggs in one basket” so to speak, may not be the best approach here.

I actually am a fan of TensorFlow and picking a toolkit on performance alone would also not be wise. TensorFlow has some neat features one being TensorBoard that helps visualize the execution graph (note that CNTK also supports TensorBoard). Google has also recently introduced a dedicated TensorFlow processor (TPU) when running on their cloud platform that will surely speed up processing time. But if you are doing NLP (natural language processing), it is quite obvious you would want to use CNTK for performance reasons…

What is an emerging data scientist to do?

This is where Keras comes in…

Keras

Keras is an abstraction layer that allows you to run the same code on top of both TensorFlow and CNTK (as well as Theano, another deep learning toolkit) as the backend.

For Big Data people, I would make a correlation between Keras and the use of HIVE as an abstraction layer for Map/Reduce. It is rare to actually write Map/Reduce code anymore with the evolution of libraries around big data, and that is what Keras reminds me of compared to actually writing TensorFlow (or CNTK) code. For instance, TensorFlow on its own actually requires you to write the formula for Mean Squared Error to pass into the model. Although trivial, this is totally annoying and the use of Keras builds a lot of shortcuts for us that makes life much easier and reduces code often by 50%.

In the keras.json file that is created during installation, the backend can be configured by changing one line between “tensorflow” and “cntk”

to verify the backend that is being used, from python simply enter

or at anytime you can access the _BACKEND variable from the same library to see the result

These details are all described clearly on the keras.io site referenced above.

…and for all of the R users, there is a nice CRAN package available too:
https://cran.r-project.org/web/packages/kerasR/vignettes/introduction.html

from my somewhat limited experience, I can say that using Keras on top of TensorFlow or CNTK keeps me from pulling my hair out. Kudos to the creators and contributors to this library. Maybe we can dive deeper into this in a future post.

Transfer Learning

Transfer learning is the ability to take a preexisting model and use it as the base for another model. This allows you to take for instance a model that has classified millions of images and has trained for possibly weeks and apply it to new images that are more specific to your scenario. This allows for more rapid model development if you can build on preexisting work.

CNTK has a really nice tutorial on this technique here:
https://docs.microsoft.com/en-us/cognitive-toolkit/build-your-own-image-classifier-using-transfer-learning

TensorFlow also has it’s own “Inception” library that can be transferred.

This concept is the basis for the next section of “Deep Learning as a Service”

Deep Learning as a Service

I don’t believe this has actually become a term yet. I am just making it up as I go here 🙂

The concept of transfer learning opens some new capabilities to more easily apply your own scenarios to previously trained models.

Microsoft has developed a few interesting services that make deep learning very accessible to end users

One is the Custom Vision Service: https://www.customvision.ai/

This allows you to bring your own images to train on and allows you to reinforce in an iterative approach

Another is Q&A Maker: https://qnamaker.ai/

this allows you to build a bot in minutes to scroll through FAQ and document content on a subject that is important to your organization. This bot can then interact in an intelligent way without having to use a deep learning toolkit or a bunch of coding.

I did one using the Power BI FAQ pages and it worked really well

What is interesting about these services is that it is actually training a model on YOUR data. Not simply tapping into a pre-existing model. You are able to influence the results.

I believe we will continue to see many more services pop up like this that will continue to “democratize” AI for the masses

Conclusion

I would never claim to be a data scientist, but many of us are doing more and more data science like activities. For a person moving from data and business intelligence into machine learning and artificial intelligence, I feel like the above content would have saved me a lot of time. There is plenty of getting started content out there so start using your google/bing search skills to get deeper into it.