Digital (dis)content: April 2015

Apr 17, 2015

Test drive: real-time prediction in Java with Amazon Machine Learning

Following up on my two previous Amazon ML articles (batch prediction with Redshift and real-time prediction with the AWS CLI), here's a quickie on implementing real-time prediction with the AWS SDK for Java.

At the time of writing, I'm using SDK version 1.9.31:

sources on Github: https://github.com/aws/aws-sdk-java
Javadoc : https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/

Here's the Maven dependency:

The code (source on Github) is pretty self-explanatory and totally similar to the CLI example in the previous article.

Get the list of models
Pick one
Build a prediction request : data record + endpoint
Fire the request
Get a result!

Here's the output:

Nice and simple. What about performance? My previous article measured a CLI call (aka 'aws machinelearning predict') within us-east-1 at around 300ms, well above the 100ms mentioned in the FAQ.

Believe or not, the Amazon product team got in touch (I'm not worthy, I'm not worthy... and thanks for reading!). They kindly pointed out that this measurement includes much more than the call to the web service and of course, they're right.

Fortunately, the Java program above is finer-grained and allows us to measure only the actual API call. I packaged it, deployed it to a t2.micro EC2 instance running in us-east-1 (same configuration as before) and...

Average time is around 80ms, which is indeed under the 100ms limit mentioned in the FAQ. There you go :)

I'm definitely looking forward to using this in production. Till next time, keep rockin'.

Apr 16, 2015

Test drive: real-time prediction with Amazon Machine Learning

As explained in my previous article, Amazon ML supports both batch prediction and real-time prediction.

I burned some rubber (and some cash!) on batch prediction, but there's still some gas in the tank and tires ain't totally dead yet, so let's do a few more laps and see how the real-time thing works :)

Assuming you've already built and evaluated a model, one single operation is required to perform real-time prediction: create a real-time endpoint, i.e. a web service URL to send your requests to.

Of course, we could to this in the AWS console, but why not use the CLI instead? A word of warning: at the time of writing, the CLI package available from the AWS website suffers from a nasty bug on prediction calls (see https://forums.aws.amazon.com/thread.jspa?messageID=615018). You'll need to download and install the latest version from Github:

While we're at it, let's take a quick look at the CLI for Amazon ML. Here's our model, built from 100,000 lines with the format described in my previous article:

Let's also look at the evaluation for this model. The log file URL gives away the fact Amazon ML jobs are based on EMR... in case you had any doubts ;)
All right, everything looks good. Time to create a prediction endpoint for our model:

Pretty simple. We can now see that our model is accessible at https://realtime.machinelearning.us-east-1.amazonaws.com

And now, the moment of truth: let's hit the endpoint with a record.

Boom. We got a predicted value. This actually worked on the first call, I guess caffeine was stronger this morning ;)

One last question for today: how fast is this baby? I tested two different setups :

Return trip between my office in Paris and the endpoint in us-east-1 : about 700 ms
Return trip between an EC2 instance in us-east-1 and the endpoint in us-east-1 : about 300 ms

(and yes, I did run the tests multiple times. These numbers are average values).

Slower than expected. The Amazon ML FAQ says:

Q: How fast can the Amazon Machine Learning real-time API generate predictions?
Most real-time prediction requests return a response within 100 MS, making them fast enough for interactive web, mobile, or desktop applications. The exact time it takes for the real-time API to generate a prediction varies depending on the size of the input data record, and the complexity of the data processing “recipe” associated with the ML model that is generating the predictions

300 ms feels slow, especially since my model doesn't strike me as super-complicated. Maybe I'm jaded ;)

Ok, enough bitching :) This product has only been out a week or so and it's already fuckin' AWESOME. It's hard to believe Amazon made ML this simple. If I can get this to work, anyone can.

Given the right level of price, performance and scale (all will come quickly), I see this product crushing the competition... and not only other ML SaaS providers. Hardware & software vendors should start sweating even more than they already do.

C'mon, give this thing a try and tell me you're STILL eager to build Hadoop clusters and write Map-Reduce jobs. Seriously?

Till next time, keep rockin'.

Apr 14, 2015

Test drive: Amazon Machine Learning + Redshift

Last week, AWS launched their flavor of "Machine Learning as a service", aka Amazon Machine Learning. It was not a moment too soon, given the number of existing cloud-based ML propositions. To name just a few: BigML, Qubole and yes, Azure Machine Learning (pretty impressive, I'm sorry to admit).

So, here it is finally. Let's take it for a ride.

First things first: some data is needed. Time to use a little Java program that I wrote to pump out test data simulating an e-commerce web log (see Generator.java in https://github.com/juliensimon/DataStuff).

Here's the format, columns are pretty self-explanatory:

Nothing fancy, but it should do the trick.

Next step: connect to my super fancy 1-node Redshift cluster and create an appropriate table for this data:

Next, let's generate 10,000,000 lines, write them in a CSV file and upload it to my favorite S3 bucket located in eu-west-1. And now the AWS fun begins! Right now, Amazon ML is only available in us-east-1, which means that your Redshift cluster must be in the same region, as well as the S3 bucket used to output files (as I later found out). Bottom line: if everything is in us-east-1 for now, your life will be easier ;)

Lucky me, the only cross-region operation allowed in this scenario is copying data from S3 to Redshift, here's how:

For the record, this took just under a minute for 450MB. That's about 100MB per second sustained. Not bad :)

Let's take a quick look: SELECT * FROM mydata LIMIT 10;

Looks good. Time to fire up Amazon ML. The process is quite simple:

Create a datasource, either from an S3 file of from Redshift
Pick the column you want to predict the value for (in our case, we'll use 'basket')
Send some data to build and evaluate the model (we'll use the 10M-line file)
If the model is good enough, use it to predict values for new data

Creating the datasource from Redshift is straightforward: cluster id, credentials, table name, SQL statement to build the test data.

Once connected to Redshift, Amazon ML figures out the schema and data types:

Now, let's select our target column (the one we want to predict):

Next, we can customize the model. Since this is a numerical value, Amazon ML will use a numerical regression algorithm. If we had picked a boolean value, a different algorithm would have been used. Keep on eye on this in future releases, I'm sure AWS will add more algos and allow users to tweak them much more than today.

As you can see, 70% of data is used to build the model, 30% to evaluate it.

Next, Amazon ML ingests the data: In our case, this means 10 million lines, which takes a little while. You can see the different tasks: splitting the data, building the model, evaluating it.

A few coffees later, all tasks are completed. The longest one was by far building the ML model. The whole process lasted just under a hour (reminder: 10 columns, 10 millions lines).

So, is this model any good? Amazon ML gives limited information for now, but here it is:

That promising "Explore the model performance" button displays a distribution curve of residuals for the part of the data set used to evaluate the model. Nothing extraordinary.

As a sidenote, I think it's pretty interesting to see that a model can be build from totally random data. What does this say about the Java random generator? I'm not sure.

Now, we're ready to predict! Amazon ML supports batch prediction and real-time prediction through an API. I'll use batch for now. Using a second data set of 10,000 lines missing the 'basket' column, let's build a second data source (from S3 this time):

Two news tasks are created: ingest the data from S3 and predict. After a 3-4 minutes, prediction is complete:

A nice distribution curve of predicted values is also available.

Actual predicted values are available in a gzip'ed text file in S3:

Pretty cool... but one last question needs to be answered. How much does it cost? Well, I did push the envelope all afternoon and so...

Over a thousand bucks. Ouch! Expensive fun indeed. I guess I'll expense that one :D

One thousand predictions cost $0.1. So, the scenario I described (model building plus 10K predictions) only costs a few dollars (thanks Jamie @ideasasylum for pointing it out).

However, if you decide to use live prediction on a high-traffic website or if you want to go crazy on data-mining, costs will rise VERY quickly. Caveat emptor. AWS has an history of adapting prices pretty quickly, let's see what happens.

Final words? Amazon ML delivers prediction at scale. Ease of use and documentation are what you'd expect from AWS. Features are pretty limited and the UI is still pretty rough but good things come to those who wait, I guess. Cost rises quickly, so make sure you set and track ROI targets on your ML scenarios. Easier said than done... and that's another story :)

Till next time, keep crunchin'!

(Update: want to learn about real-time prediction with Amazon ML? Read on!)

Apr 1, 2015

AWS Startup Day: "Moving Viadeo to AWS"

Happy to share my keynote presentation at the AWS Startup Day yesterday in Paris!