How to Train your MAML–Refining the data

by Guest Post on September 21, 2014

in SQL Server

In my last post we looked at how to load data into Microsoft Azure Machine Learning using the browser based ML Studio.  We also started to look at the data around predicting delayed flights and identified some problems with it and this post is all about getting the data into the right shape to ensure that the predictive algorithms in MAML have the best chance of giving us the right answer.   

Our approach is three fold

  • To discard the data we don’t need, either columns that aren’t relevant or are derived from other data and to discard rows where there is missing data for the columns (features in machine learning speak)
  • To tag the features correctly as being numbers or strings and whether they are categorical or not.  Categorical in this context means that the value puts them in a group rather than being continuous so AirportID is categorical as it puts a row into a group of rows for the same airport where temperature is a continuous variable and the numbers do represent point on a line (where AirportID 1 is nothing to do with ID 3 or 4).
  • To join the flight delay dataset to the weather data set on the Airport and the date/time. In my last post I mentioned that we could either join the weather data in twice once to the departure airport and once to the arriving airport and indeed the sample experiment on flight delay prediction does exactly this but I think a simpler approach is to just model the arrival delay on the fact that some flight have a delayed departure time which may or  may not be influenced by the weather at the departure airport.

Let’s get started..

Open ML Studio, create a new experiment , give it a suitable name and drag the flight delays and the Weather datasets onto the design surface so it looks like this ..

image

Clean the data

As before we can right click on the circle at the bottom of the data set and select visualize data to see what we are working with- for example here’s the weather data.

image

What’s is odd here is that the data is not properly typed in that some of the numeric data is in a column marked string such as  the weather data set temperature columns.  I spent ages trying to work out how to fix this and the answer turns out to be to use the Convert to Data set module which automatically does this.  So our first step is to drag tow of them onto the design service and connect them to each of our data sets..

image

If we run our model (run is at the bottom of the screen) we can then visualize the output of the convert to dataset steps and now our data is correctly identified as being numeric etc.

The next step is to get rid of any unwanted columns and this is simply a case of using the project columns module (to find it just use the search at the top of the modules list).  You can either start with a full list of columns and remove what you don’t need or start with an empty list and add in what you do need. So lets drag it onto to the design surface and then drag a line from the Flight Delays Data to it.  It’ll have a red X against it as it’s not configured and we can do this from the select columns on the task pane

image

Here I have selected all columns and then excluded Year , Cancelled, ArrDelay, DepDelay15, and CRSDeptime.  At this point we can check to see that what we get is what we wanted by clicking the run button at the bottom of the screen.  

Note It’s only when we run stuff in ML Studio that we are being charged for computation time using this service, the rest of the time we are just charged for the storage we are using (for our own data sets and experiments)

As before at each stage we can visualize the data that’s produced by right clicking on its output node..

image

Here we can see that we have one column Depdelay that has missing values so the next thing we need to do is to get rid of that and we can use the Missing Values Scrubber module for this so search for that and drag it on to the design service and drag a connector from the output of the project columns module to it.  We then need to set its properties to set how to deal with the missing values.  As we have such a lot of clean data we can simply ignore any rows with missing values by setting the top option to remove entire row..

image

We can now run the experiment again to check we have no more missing values.

Now we need to do some of this again for the weather dataset. We can then add in another project column module to select the columns we need – this time I am starting with an empty list and specifying which columns to add..

image 

and the data scrubber module again set to remove the entire row…

image

Tag the Features

Now we need to change the metadata about some of the columns to ensure ML studio handles them properly. Here I cheated which shows you another feature of ML studio.   Remember that some of the number in our data are codes rather than being a continuous number for example the airport codes and the airline code. We need to tell MLStudio that these are categorical  by using the Metadata Editor module. To this we are going to cheat and by simply copying that module form another experiment.  Open another browser window and go into the ML Studio home page and navigate to the Flight Delay sample prediction.  Find the Metadata Editor module on their and paste it to the clipboard and then go back into the browser with our experiment and paste it in, and you should see that this module is set to make Carrier, OriginalAirportID and DepAirPortID categorical…

image

Join the datasets

Now we have to sets of clean data we need to join them.  They both have an airport ID, month and day and the flight delay data set has an arrival time to the nearest minute.  However  the weather data is taken at 56 minutes part the hour every hour and is in local time with a separate time zone column. So what we need to do is round up the flight arrival time to the nearest hour and do the same for the weather data as follows:

For the flight delay arrival time

1. Divide the arrival time by 100

2. Round down the arrival time to the nearest hour

For the weather data time

3. Divide the weather reading time by 100 to give the local time in hours

4. round up to the nearest hour

So how do we do that in ML studio? The answer is one step at a time making repeated use of the Apply Math Operation  module.  Help is pretty non existent for most of these modules at the time of writing so experimentation is the name of the game, and I hope I have done that for you. We’ll place 4 copies of the Maths Operation module on the design surface one for each step above (so two linked to the weather dataset and two to the flight delay set) ..

image

Notice the comments I have added to each module (right click and select add comment) and here’s the setting for each step..

Step 1

image

note the output mode of inplace which means that the value is overwritten and  we get all the other columns in the output as well, so make sure this is set for each of the four steps.

Step 2

image

Step 3

image

Step 4

image

Now we can use the Join module (again just search for Join and drag it onto the design surface) to connect our data sets together.  Not surprisingly this module has two inputs and one output and we’ll see several modules with multiple inputs and outputs in future.  Connect the last module in each of our data set chains into the join and set the properties for the join as shown..

image

so on the left (flight data ) we have Month,DayofMonth,CRSArrTime,DestAirportID and on the right (the weather data) we have Month,Day,Time,AirportID.

I have to be honest it took a while to get here and initially I got zero rows back.  Even now it’s not quite perfect as I have got slightly more rows than I started with which I have tracked down to having the odd hour in the weather data that has two readings.  Finding that kind of data problem is beyond what you can do in ML studio in the preview so in my next post I’ll show you your options for examining this data outside of ML studio.


Insufficient data from Andrew Fryer

{ Comments on this entry are closed }

Now casting: WATCH Disney Channel, Twitch, iHeart Radio and DramaFever for Chromecast

September 20, 2014

Chromecast has a little something for everyone in the family to enjoy, and today we’re adding even more options for kids, music lovers and gamers. For kids of all ages, we’re introducing the WATCH Disney, WATCH Disney Junior and WATCH Disney XD apps. So now you’ll be able to watch Girl Meets World, Doc McStuffins, […]

Read the full article →

Microsoft Azure Offers HDInsight (Hadoop-as-a-service) to China

September 18, 2014

Microsoft today announced that Azure HDInsight is now available for all customers in China as a public preview, making it the first global cloud provider to have a publicly available cloud Hadoop offering in China. With this launch, Microsoft is bringing Azure HDInsight’s ability to process big data volumes from unstructured and semi-structured sources to […]

Read the full article →

Apple’s Tim Cook Does Some Security Straight Talking

September 18, 2014

 Today, Apple’s Tim Cook posted a letter announcing a new security page on the company’s website, publishing some fairly plain-language security talk. There’s some solid language here that is clearly designed to allay fears about the way that Apple protects user data in the wake of the celebrity nude hacking incidents. Cook’s words: We believe in […]

Read the full article →

iOS 8, thoroughly reviewed

September 18, 2014

In-depth iOS 8 review at Ars. With this release, Apple is trying to make additions that developers and power users want without upsetting people who come to iOS specifically because of its consistency and simplicity. It’s telling that just about every major iOS 8 feature can be disabled or ignored, and that big transformative features […]

Read the full article →

New tool from SysInternals – SysMon

September 17, 2014

Since Microsoft purchased SysInternals, there has not been much activity out of them over the last few years. A few weeks ago that changed and they released SysMon. It is an interesting tool that I think primarily would be used for research and investigations of malware. At the same time, you might find it useful […]

Read the full article →

Remove Lingering Objects that cause AD Replication error 8606 and friends

September 16, 2014

Introducing the Lingering Object Liquidator Hi all, Justin Turner here —it's been a while since my last update. The goal of this post is to discuss what causes lingering objects and show you how to download, and then use the new GUI-based Lingering Object Liquidator (LOL) tool to remove them. This is a beta version […]

Read the full article →

New update available for Azure Backup for Microsoft Azure Recovery Services Agent

September 15, 2014

Today we released an article describing an update for the Microsoft Azure Recovery Services Agent that is used both by Microsoft Azure Backup and the Microsoft Azure Site Recovery service to transport data to Azure. For details regarding the new features and reliability issues addressed in this update, please see the following: KB2997692 – Update […]

Read the full article →

Miracast in Enterprise Environments

September 15, 2014

This blog is intended to document our learning's about Miracast technology and explain things to consider when implementing a solution using Miracast. It is primarily about Windows 8.1 tablets and Windows Phone 8.1 as the authors work for Microsoft in the Worldwide Modern Devices Centre of Excellence (CoE). We work on a program called First […]

Read the full article →

First set of Android apps coming to a Chromebook near you

September 15, 2014

Chromebooks were designed to keep up with you on the go—they’re thin and light, have long battery lives, resume instantly, and are easy to use. Today, we’re making Chromebooks even more mobile by bringing the first set of Android apps to Chrome OS: Duolingo – a fun and free way to learn a new language […]

Read the full article →