Dec 31, 2015

Installing a Minecraft server on AWS

This one is for all the kids (and geek dads & moms) out there. I'm surprised I didn't get to write this sooner :)

Running your own Minecraft server in AWS is pretty straightforward, so here goes!

1) Create an EC2 instance

I've picked a t2.small instance (2 GB RAM) running the Amazon Linux AMI and costing $0.028 per hour at the time of writing. Anything smaller feels... small for a complex Java app, but you're welcome to try. Details and pricing for EC2 instances may be found here.

Only the security group needs to be tweaked: on top of SSH access, you also need to open a port for the server (default port is 25565). If you want to play with friends (which is kind of the point), allow access from anywhere.

Your inbound rules should look like this:

Select a key pair and launch your instance.

2) Install the Minecraft server

Once you ssh'ed into your instance, it's time to select a Minecraft version. Let's retrieve the list (conveniently hosted in S3) :

$ wget https://s3.amazonaws.com/Minecraft.Download/versions/versions.json

My customer (ok, my son) wants to run 1.8.9, so let's download and extract this version:

$ mkdir minecraft
$ cd minecraft
$ wget http://s3.amazonaws.com/Minecraft.Download/versions/1.8.9/minecraft_server.1.8.9.jar

3) Configure the Minecraft server

Start it once:

$ java -jar minecraft_server.1.8.9.jar

It will fail and complain about approving the user agreement (bleh). All you need to do is to edit the eula.txt file:

eula=true

Feel free to edit the server.properties file to add your favorite settings.

 4) Start the Minecraft server

The best way to start a long-running server without losing access to the console is to use screen. Here is a simple tutorial.

$ sudo yum install screen -y
$ screen java -jar minecraft_server.1.8.9.jar

Your server is running, congratulations. You should be able to connect to it using the Minecraft client.

In the console, you will see players joining and leaving the game, as well as their UUID, e.g.:

[17:08:11] [User Authenticator #1/INFO]: UUID of player _____Dante_____ is xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

This UUID is useful to declare players as operators in ops.json:

$ cat ops.json
[
  {
    "uuid": "
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "name": "_____Dante_____",
    "level": 4
  }

]

That's it. Don't forget to detach from your server before ending your SSH session, or it will be killed (hint: CTRL-A D). And don't forget to stop the instance when you're not playing, it will save you money :)

Have a nice game and happy New Year!

Dec 27, 2015

A silly little script for Amazon Redshift

One of the great things about Amazon Redshift is that it's based on PostgreSQL. Hence, our favorite PostgreSQL tools can be used, notably psql. However, building and typing the full connection chain to a Redshift cluster is a bit of a drag.

So, here's a simple script (source on Github) for the lazy ones among us. It simply requires a cluster name, a database name and a user name. You will be prompted for a password, unless you have a matching entry in your .pgpass file. As a bonus, your connection will be SSL-enabled. Ain't life grand?


Dec 17, 2015

Amazon ECS @ Docker meetups

Here are my slides on Amazon ECS, which I had the pleasure to present at the Docker meetups in Marseille and Bordeaux. Thanks guys, I had a great time :)

Dec 10, 2015

A silly little script for Amazon ECS

I'm currently spending a lot of time playing with Amazon ECS and Docker.

The ecs-cli tool is very convenient to manage clusters and tasks, but I needed a few extra things (as always) and crammed them into the ecs-find script.

Hopefully, this might come in handy for you too :)

Dec 2, 2015

Upcoming meetup: Docker Paris, December 15

Amazon Web Services is now the official sponsor of the Docker Paris meetup :)

I'll be present at each event to keep the Docker community updated on the latest AWS news, especially on ECS and Lambda. See you there !


Nov 30, 2015

Upcoming meetup: Lyon Data Science, January 7

Busy day for meetups :) The kind folks at Lyon Data Science just announced their next meetup on January 7. It will be my pleasure to speak about Amazon Redshift and Amazon Machine Learning. See you there!


Upcoming meetup: Docker Marseille, December 16

I will speak about our managed Docker service, aka AWS EC2 Container Service at the next Docker meetup in Marseille on December 16. See you there!


Nov 16, 2015

Filtering the AWS CLI with jq

The JSON-formatted output of the AWS CLI may sometimes feel daunting, especially with a large number of objects. Running 'aws ec2 describe-instances' with more than 10 instances? Bad idea :)

 Fortunately, jq - the self-proclaimed 'sed for JSON data' - is a pretty simple way to filter this output. This will especially come in handy when writing automation scripts :)

Installing jq is one command away:
$ sudo apt-get install jq  (Debian / Ubuntu)
$ yum install jq  (CentOS)
$ brew install jq  (MacOS)

Let's fire up a few EC2 instances and try the following examples.

Show only one field, OwnerId:
$ aws ec2 describe-instances | jq '.Reservations[].OwnerId'

Show all information on the first instance: 
$ aws ec2 describe-instances | jq '.Reservations[].Instance[0]'

Show the instance id and the IP address of the first instance:

$ aws ec2 describe-instances | jq '.Reservations[].Instances[0] | {InstanceId, PublicIpAddress}'
{
  "InstanceId": "i-2181b498",
  "PublicIpAddress": "52.31.231.110"
}

Show the instance id and public IP address of all instances:

$ aws ec2 describe-instances | jq '.Reservations[].Instances[] | {InstanceId, PublicIpAddress}'
{
  "InstanceId": "i-2181b498",
  "PublicIpAddress": "52.31.231.110"
}
{
  "InstanceId": "i-2081b499",
  "PublicIpAddress": "52.31.208.25"
}
{
  "InstanceId": "i-2581b49c",
  "PublicIpAddress": "52.31.207.29"
}
{
  "InstanceId": "i-2781b49e",
  "PublicIpAddress": "52.31.228.234"
}
{
  "InstanceId": "i-2681b49f",
  "PublicIpAddress": "52.31.230.63"
}
 
Fields can also be concatenated, like so:
$ aws ec2 describe-instances | jq '.Reservations[].Instances[] | .InstanceId + " " + .PublicIpAddress'
"i-2181b498 52.31.231.110"
"i-2081b499 52.31.208.25"
"i-2581b49c 52.31.207.29"
"i-2781b49e 52.31.228.234"
"i-2681b49f 52.31.230.63"


Show the instance id and launch time of all instances, sorted by IP address:
$ aws ec2 describe-instances | jq '.Reservations[].Instances[] | .PublicIpAddress + " " + .InstanceId + " " +.LaunchTime' | sort
"52.31.207.29 i-2581b49c 2015-11-16T19:14:57.000Z"
"52.31.208.25 i-2081b499 2015-11-16T19:14:57.000Z"
"52.31.228.234 i-2781b49e 2015-11-16T19:14:57.000Z"
"52.31.230.63 i-2681b49f 2015-11-16T19:14:57.000Z"
"52.31.231.110 i-2181b498 2015-11-16T19:14:57.000Z"


All, let's add an 'environment' tag on all instances, set to either 'dev' or 'prod'.

Now, how about the instance id and IP address of all instances with a 'prod' tag:
$ aws ec2 describe-instances | jq '.Reservations[].Instances[] | select(.Tags[].Value=="prod") | .InstanceId + " " + .PublicIpAddress'
"i-2581b49c 52.31.207.29"
"i-2781b49e 52.31.228.234"
"i-2681b49f 52.31.230.63"


We barely scratched the surface, but I'm sure you get the idea. Make sure you add this nifty tool to your collection!

Nov 11, 2015

Set up the AWS CLI on MacOS in 60 seconds

I had to do this AGAIN on a new Mac, so here goes:

1) Install Homebrew

$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" 

2) Install python and pip

$ brew install python 
$ curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" 
$ sudo python get-pip.py 

3) Install Iterm2 (c'mon, how can you live without it?)

4) Install zsh and oh-my-zsh

$ sh -c "$(curl -fsSL https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"

5) Install and configure AWS CLI

$ sudo pip install awscli
$ sudo pip install awsebcli (if you work with Elastic Beanstalk)
$ aws configure 

6) Enable auto-completion for the AWS CLI in zsh

Open ~/.zshrc  and add aws to the list of plugins, e.g.: plugins=(git aws)

$ source ~/.zshrc

7) Test :)

$ aws s3 ls (or any other AWS command that you like!)
 

Talk @ Velocity Conf 2015 (Amsterdam)

Hi, my colleague Antoine and myself recently gave a talk at the Velocity conference in Amsterdam.

The topic was our all-in move to AWS. The session was great fun, with plenty of questions from the audience. Thanks to everyone who attended.

Here are the slides.




May 11, 2015

Lambdas in Spark: now you're talking!

In a previous article, I tried to figure out what the hell lambdas could be good for in Java. Short version: nice, but not compelling (mostly because of my silly example).

But lo and behold, ye of little faith. The Good (?) Lord of Functional Programming has inspired another much better example in me.

Take a look at this simple Spark program. Nothing fancy, just building key/value pairs from a text file :

In the pre-lambda world, you would build key/value pairs like this:
  "WTF?!" I hear you say and yes, I'd have to agree. This is pretty hard to read and definitely not Java at its finest.

Here's the lamba version:

Do you see the light now? Oh c'mon, just a little bit? ;)

Till next time, keep codin'.

Reading list (May 2015)

Tech books: one of the great loves of my life. There isn't much I'd rather do than grab a book, lay on the couch and immerse myself in complex systems. Feeding my insatiable curiosity. Learning new skills. Grinding my mind on the whetstone once more. Me against me, my favorite battle :)

And so here's my current reading list, with some early impressions.

AWS System Administration, by Mike Ryan (O'Reilly, 2015, early release version). Possibly the first book of its kind. Yes, the AWS online documentation is very good, but the humongous amount of information is sometimes a little overwhelming. This book is a nice introduction to most AWS building blocks, with lots of real-life advice and tons of examples. A useful compass to navigate the AWS ocean.

Designing Data-Intensive Applications, by Martin Kleppmann (O'Reilly, 2015, early release version). Subtitled "the big ideas behind reliable, scalable and maintainable systems", this book covers all major concepts and techniques used to build data stores, both for OTP and analytics : data models, storage and retrieval (yes, you will understand B-trees at last), encoding, replication, etc. Lots of illustrations, lots of examples from current technologies, lots of complex stuff explained in plain English. I like it very much so far.

Learning Spark, Matei Zaharia et al. (O'Reilly, 2015). A beginner book written by the creator of Spark (O'Reilly has another Spark book for advanced readers). This one delivers exactly what the title says and is another fine example of why O'Reilly books are the best : straight to the point and lots of examples (Python, Java, Scala). You'll be coding Spark jobs in no time. Some advanced topics are covered at the end of the book, including machine learning with MLLib.

Next on the pile :

 

User Story Mapping, Jeff Patton (O'Reilly, 2014) - Key Agile concept! Short version here.

Data Science From Scratch, Joel Grus (O'Reilly, 2015) : "Anyone who has some amount of mathematical aptitude and some amount of programming skill has the necessary raw materials to do data science". Sounds pragmatic and bullshit-free :)

PS: anyone from O'Reilly reading this? If you feel so inclined, I'll gladly accept a t-shirt or something. Thank you.

May 7, 2015

Video: Cloud Computing World Expo 2015

Here's the video (in French) from the roundtable at Cloud Computing World Expo 2015 back in April. The topic was: "is it realistic to build your IT systems 100% in the Cloud?".

Guess what my answer was? ;)


Apr 17, 2015

Test drive: real-time prediction in Java with Amazon Machine Learning

Following up on my two previous Amazon ML articles (batch prediction with Redshift and real-time prediction with the AWS CLI), here's a quickie on implementing real-time prediction with the AWS SDK for Java.

At the time of writing, I'm using SDK version 1.9.31:
Here's the Maven dependency:

The code (source on Github) is pretty self-explanatory and totally similar to the CLI example in the previous article.
  • Get the list of models
  • Pick one
  • Build a prediction request : data record + endpoint
  • Fire the request
  • Get a result!

Here's the output:

Nice and simple. What about performance? My previous article measured a CLI call (aka 'aws machinelearning predict') within us-east-1 at around 300ms, well above the 100ms mentioned in the FAQ.

Believe or not, the Amazon product team got in touch (I'm not worthy, I'm not worthy... and thanks for reading!). They kindly pointed out that this measurement includes much more than the call to the web service and of course, they're right.

Fortunately, the Java program above is finer-grained and allows us to measure only the actual API call. I packaged it, deployed it to a t2.micro EC2 instance running in us-east-1 (same configuration as before) and...

Average time is around 80ms, which is indeed under the 100ms limit mentioned in the FAQ. There you go :)

I'm definitely looking forward to using this in production. Till next time, keep rockin'.

Apr 16, 2015

Test drive: real-time prediction with Amazon Machine Learning

As explained in my previous article, Amazon ML supports both batch prediction and real-time prediction.

I burned some rubber (and some cash!) on batch prediction, but there's still some gas in the tank and tires ain't totally dead yet, so let's do a few more laps and see how the real-time thing works :)

Assuming you've already built and evaluated a model, one single operation is required to perform real-time prediction: create a real-time endpoint, i.e. a web service URL to send your requests to.

Of course, we could to this in the AWS console, but why not use the CLI instead?  A word of warning: at the time of writing, the CLI package available from the AWS website suffers from a nasty bug on prediction calls (see https://forums.aws.amazon.com/thread.jspa?messageID=615018). You'll need to download and install the latest version from Github:

While we're at it, let's take a quick look at the CLI for Amazon ML. Here's our model, built from 100,000 lines with the format described in my previous article:

Let's also look at the evaluation for this model. The log file URL gives away the fact Amazon ML jobs are based on EMR... in case you had any doubts ;)
  All right, everything looks good. Time to create a prediction endpoint for our model:

Pretty simple. We can now see that our model is accessible at https://realtime.machinelearning.us-east-1.amazonaws.com

And now, the moment of truth: let's hit the endpoint with a record.

Boom. We got a predicted value. This actually worked on the first call, I guess caffeine was stronger this morning ;)

One last question for today: how fast is this baby? I tested two different setups :
  • Return trip between my office in Paris and the endpoint in us-east-1 : about 700 ms
  • Return trip between an EC2 instance in us-east-1 and the endpoint in us-east-1 : about 300 ms

(and yes, I did run the tests multiple times. These numbers are average values).

Slower than expected. The Amazon ML FAQ says:

Q: How fast can the Amazon Machine Learning real-time API generate predictions?
Most real-time prediction requests return a response within 100 MS, making them fast enough for interactive web, mobile, or desktop applications. The exact time it takes for the real-time API to generate a prediction varies depending on the size of the input data record, and the complexity of the data processing “recipe” associated with the ML model that is generating the predictions

300 ms feels slow, especially since my model doesn't strike me as super-complicated. Maybe I'm jaded ;)

Ok, enough bitching :) This product has only been out a week or so and it's already fuckin' AWESOME. It's hard to believe Amazon made ML this simple. If I can get this to work, anyone can.

Given the right level of price, performance and scale (all will come quickly), I see this product  crushing the competition... and not only other ML SaaS providers. Hardware & software vendors should start sweating even more than they already do.

C'mon, give this thing a try and tell me you're STILL eager to build Hadoop clusters and write Map-Reduce jobs. Seriously?

Till next time, keep rockin'.

Apr 14, 2015

Test drive: Amazon Machine Learning + Redshift

Last week, AWS launched their flavor of "Machine Learning as a service", aka Amazon Machine Learning. It was not a moment too soon, given the number of existing cloud-based ML propositions. To name just a few: BigML, Qubole and yes, Azure Machine Learning (pretty impressive, I'm sorry to admit).

So, here it is finally. Let's take it for a ride.

First things first: some data is needed. Time to use a little Java program that I wrote to pump out test data simulating an e-commerce web log (see Generator.java in https://github.com/juliensimon/DataStuff).

Here's the format, columns are pretty self-explanatory:

Nothing fancy, but it should do the trick.

Next step: connect to my super fancy 1-node Redshift cluster and create an appropriate table for this data:


Next, let's generate 10,000,000 lines, write them in a CSV file and upload it to my favorite S3 bucket located in eu-west-1. And now the AWS fun begins! Right now, Amazon ML is only available in us-east-1, which means that your Redshift cluster must be in the same region, as well as the S3 bucket used to output files (as I later found out). Bottom line: if everything is in us-east-1 for now, your life will be easier ;)

Lucky me, the only cross-region operation allowed in this scenario is copying data from S3 to Redshift, here's how:

For the record, this took just under a minute for 450MB. That's about 100MB per second sustained. Not bad :)


Let's take a quick look: SELECT * FROM mydata LIMIT 10;


Looks good. Time to fire up Amazon ML. The process is quite simple:
  1. Create a datasource, either from an S3 file of from Redshift
  2. Pick the column you want to predict the value for (in our case, we'll use 'basket')
  3. Send some data to build and evaluate the model (we'll use the 10M-line file)
  4. If the model is good enough, use it to predict values for new data
 Creating the datasource from Redshift is straightforward: cluster id, credentials, table name, SQL statement to build the test data.


Once connected to Redshift, Amazon ML figures out the schema and data types:


Now, let's select our target column (the one we want to predict):


 Next, we can customize the model. Since this is a numerical value, Amazon ML will use a numerical regression algorithm. If we had picked a boolean value, a different algorithm would have been used. Keep on eye on this in future releases, I'm sure AWS will add more algos and allow users to tweak them much more than today.

As you can see, 70% of data is used to build the model, 30% to evaluate it.

Next, Amazon ML ingests the data: In our case, this means 10 million lines, which takes a little while. You can see the different tasks: splitting the data, building the model, evaluating it.

A few coffees later, all tasks are completed. The longest one was by far building the ML model. The whole process lasted just under a hour (reminder: 10 columns, 10 millions lines).








So, is this model any good? Amazon ML gives limited information for now, but here it is:










That promising "Explore the model performance" button displays a distribution curve of residuals for the part of the data set used to evaluate the model. Nothing extraordinary.

As a sidenote, I think it's pretty interesting to see that a model can be build from totally random data. What does this say about the Java random generator? I'm not sure.

Now, we're ready to predict! Amazon ML supports batch prediction and real-time prediction through an API. I'll use batch for now. Using a second data set of 10,000 lines missing the 'basket' column, let's build a second data source (from S3 this time):


Two news tasks are created: ingest the data from S3 and predict. After a 3-4 minutes, prediction is complete:





A nice distribution curve of predicted values is also available.

Actual predicted values are available in a gzip'ed text file in S3:

Pretty cool... but one last question needs to be answered. How much does it cost? Well, I did push the envelope all afternoon and so...


Over a thousand bucks. Ouch! Expensive fun indeed. I guess I'll expense that one :D

One thousand predictions cost $0.1. So, the scenario I described (model building plus 10K predictions) only costs a few dollars (thanks Jamie @ideasasylum for pointing it out).

However, if you decide to use live prediction on a high-traffic website or if you want to go crazy on data-mining, costs will rise VERY quickly. Caveat emptor.  AWS has an history of adapting prices pretty quickly, let's see what happens.

Final words? Amazon ML delivers prediction at scale. Ease of use and documentation are what you'd expect from AWS. Features are pretty limited and the UI is still pretty rough but good things come to those who wait, I guess. Cost rises quickly, so make sure you set and track ROI targets on your ML scenarios. Easier said than done... and that's another story :)

Till next time, keep crunchin'!

(Update: want to learn about real-time prediction with Amazon ML? Read on!)

Mar 19, 2015

Java 8 and lambdas : ooooh, that's how it works, then.


Disclaimer: several people "complained" about this code being overly complicated & verbose. Trust me, I did it on purpose to illustrate what many legacy Java apps look like these days. Hopefully lambdas can help us clean up some of the mess that has accumulated over the years. Read on!

One of the key new features of Java 8 is the introduction of functional programming and lamba expressions. This article will give you a real-life example of how lambas can be introduced in existing code, reducing verbosity and increasing flexibility... or so they say ;)

All code for this article is part of a larger repository available at:
github.com/juliensimon/SortAndSearch

Let's get started. Imagine you had to code a linear search method for lists, which could apply different policies when an element has been found:
  • move the element to the front of the list (aka "Most Recently Used"),
  • move it to the back of the list (aka 'Least Recently Used"), 
  • move it one position closer to the the front of the list.

After some experimentation ("Cut and paste ? Yuck. Anonymous classes ? Meh!"), you would realize that these policies only differ by a couple of instructions and that there's an elegant way to factorize them nicely: behavior injection.

Surely, you would get something similar to:
  • A linearSearch() method implementing the Strategy pattern through an object implementing the LinearSearchMode interface,
  • A LinearSearchMode interface defining a single method in charge of performing the actual move,
  • Three LinearSearchModeXXX classes implementing the interface for each move policy,
  • A LinearSearchModeFactory class, implementing the Factory pattern to build LinearSearchModeXXX objects while hiding their internals.

Here's the corresponding code. Pretty canonical, I guess.



You would use the whole thing like this:

linearSearch(list,t,LinearSearchFactory.build(LinearSearchFactory.modeMoveFirst)

Looks familiar? Yeah, thought so ;) Now, although this is pretty neat and tidy, this is awfully verbose, isn't it? There's a lot of boilerplate code and technical classes just to change of couple of lines of code. So yes, we did avoid cut-and-paste and code duplication, but in the end, we still had to write quite a bit of code. For lazy programmers, this is always going to be a problem :)

And so, tada! Lambda expressions.You can read the theory elsewhere, here's what it means to our code:
  • No change to our linearSearch() method,
  • No change to our LinearSearchMode interface,
  • No more need for LinearSearchModeXXX and LinearSearchModeFactory, which is great because they really added nothing to the logic of our application.
All you have to do, then, is to replace the mode parameter in linearSearch() with a lambda expression implementing the move policy :

Pretty nerdy, huh? I like it. Less code, less bugs, less problems, more sleep! Adding another move policy would only mean adding a couple of lines, so less verbosity and more flexibility indeed.

For the record, let me say that I've always avoided functional languages like the plague (shoot me). I don't expect this to change in the near future, but I have to admit that the introduction of lambdas in Java 8 does solve a number of problems, so there. I'll keep digging :)

That's it for today. Till next time, keep rockin'.

Feb 25, 2015

CTO Crunch @ France Digitale

Here are my slides from last night's "CTO Crunch" event by France Digitale. Thanks to everyone in the audience, lots of great questions!


Feb 19, 2015

Public appearances :-P

Today is PR day ;) I've been invited to talk at two very cool technical events in Paris happening in the next weeks.

The first one is the "CTO Crunch" event organized by France Digitale on February 24. where I will talk about the challenges faced by fast-growing teams & platforms, how to detect them early and how to fix them, aka CTO lessons from the trenches :) Only a few seats left, act fast if you're interested.

The second one is the AWS Startup day, where I will giving a keynote speech on how we're currently moving Viadeo to the AWS cloud. Registration is open as well.

If you're in Paris area, please consider dropping by. Happy to meet and maybe have a beer or two :)

Feb 16, 2015

Viadeo @ Devoxx France 2015

Hello,

Working at Viadeo (www.viadeo.com) now. The Aldebaran adventure didn't work out as planned: details in this Rude Baguette post, which feels to me like a pretty accurate description of what happened.

Anyway, I'm super happy to announce that my team will be giving two technical talks at the upcoming Devoxx France conference in Paris (April 8-10).

The first talk is entitled "Un News Feed temps réel personnalisé pour 65 millions d’utilisateurs" and will present how we rebuilt the Viadeo newsfeed, how we deliver in real-time personalized content to our users, why we love Elastic Search, etc.

The second talk is entitled "Toute la stack technique de Viadeo sur un poste de développement avec Docker" and will present how a Viadeo developer can run all of our stack on his own machine by launching a simple Docker script !

If you're attending Devoxx France, don't miss these sessions and please drop by to say hi!