jeudi 4 septembre 2014

Getting started with Hive

Introducing Hive

Hive installation is straightforward (no much things to configure)
$ wget
$ tar xzf apache-hive--bin.tar.gz
$ cd apache-hive--bin/bin/
hive> show tables;

Notice that the environment variable HIVE_HOME is not required (which is not the case for hadoop/hbase/tez). Also, hive-site.xml is not required but if we want to use an hdfs directory then it should contain something like:
  <description>location of default database for the warehouse</description>

Introducing HQL

So lets create a table 'quotes' and make it available to other hadoop programs as a text file:
hive> CREATE EXTERNAL TABLE quotes (symbol STRING, name STRING, price DOUBLE)
    > LOCATION '/tmp/quotes.txt';

Then, we can load data into these tables, for instance form a local file quotes.csv that looks like:
"GE","General Electric ",28.09
"MSFT","Microsoft Corpora",41.66
"GOOG","Google Inc.",604.83
"GM","General Motors Co",41.85
"FB","Facebook, Inc.",72.59
"AAPL","Apple Inc.",607.33
"T","AT&T Inc.",37.15
"VZ","Verizon Communica",52.06
"TM","Toyota Motor Corp",134.94

with the following query:
hive> LOAD DATA LOCAL INPATH '/path/to/quotes.csv'

Once the table is filled, we can query it with things like:
hive> SELECT * FROM quotes;
hive> SELECT symbol FROM quotes;

We can export and save the result of a query into a file locally say /tmp/..:
    > FROM quotes
    > WHERE quotes.price > 100;
The result of this export is a set of files under the quotes_100 directory, the list of quotes that match the criteria can be found in a file name 000000_0

Tuning Hive

Understanding the underlying details of how Hive plan when executing queries is essential for performance tuning. One way to understand the query plan is the use of the EXPLAIN key word:
hive> EXPLAIN SELECT * FROM quotes
    > FROM quotes
    > WHERE quotes.price > 100;
hive> EXPLAIN SELECT SUM(price) FROM quotes;
The result shows the translation of these queries into different possible operations called stages, for instance map-reduce, sampling, merge, or limit stages.
The use of the keyword EXTENDED along with explain will provide even further details for the query execution plan:

By default, hive executes a stage at once. This default behavior can be overridden by setting the property to true in hive-site.xml:
  <description>Whether to execute jobs in parallel</description>

The number of mappers/reducers launched is determined by the size of the input files divided by the default size attributed to a given task, it can be configured via:


lundi 1 septembre 2014

Troubleshooting ubuntu server network interface

So I've installed Ubuntu server on VirtualBox and when I activated a second network adapter with a bridged mode, the later was not automatically configured on Ubuntu.
In fact, the interace cannot be seen with ifconfig and ifconfig -a showed it as disabled.
I tried to bring it up and restart networking service:
$ifconfig eth1 up
$/etc/init.d/networking restart
Now the interface is active but it has only an IPv6 address and when I restart the virtual machine, the interface goes disabled again.
When checking the /etc/network/interfaces there was no eth1!!, so I added it in order to be configured automatically:
$vi /etc/network/interfaces

auto eth1
iface eth1 inet dhcp

that's it now the interface works fine.

lundi 18 août 2014

Comparison between caching systems for Java

Servers are getting more and more powerful with a lot of RAM (up to hundred to thousands of giga bytes). However, it is still not possible to use most of the available capacity directly in java applications due to inherent limitations of the GC (Garbage Collector) on JVM that may pause the application for a long time (even up to many minutes) to move objects between different generations.

Follows is the description/comparison between some solutions, also called data grids like, that can be used to face this problem like the Infinispan project of JBoss (ex. JBoss Cache), DirectMemory (an Apache proposal), EhCache (of terracotta), etc.


1. Infinispan (JBoss Data Grid Platform)
  • Don't provide support for expiration events as disscussed in the forum.
  • SingleFileCacheStore a cache loader from a file stores that manages the data activation (loading from store to cache) and passivation (saving data to store).
  • List of possible attributes in the XML configuration for infinispan 4.0 and infinispan 6.0.

2. MapDB
  • Exists only in the embbeded mode
  • Enables the creation of on heap and off-heap collections (map, queue), as well as file-backed collections
  • Listeners registerd to cache events are notified in the main thread (i.e. should implement async notifications)
  • Can be used for lazy loading (e.g.
  • Provides means for pumping the integral data available on memory to disk (e.g.
  • Transaction isolation level is Serializable which is the highest level and means a new transaction can be initiated only if previous one was committed. 
  • Transactions uses a global lock which reduce considerably the cache performance.

3. Akiban's Persistit - github
4. JCS (Java Caching System)
5. Hazelcast
6. GridGain

5. Others: LArray, Cache2K, DirectMemory (initial project on github, apache proposal for incubation) an off-heap memory storage, MVStore the storage subsystem of the H2 database, Spring cache, HugeCollections.

  • A good explanation of the use of ByteByffer to build non-heap memory caches by Keith Gregory: blog post, JUG presentation, another one.
  • An article on InfoQ about HashMap implementation for off-heap map.
  • An ibm red book on capacity for big data and off-heap memory.
  • Examples related to the use of EhCache from a Devoxx 2014 presentation.
  • Cache2K vs Infinispan/EhCache/JCS - bench
  • Radargun a framework for benchmarking data grids
Memory storage

In-memory databases (a detailed description can be found at Information Week):
  • NoSQL approaches (covers the class of nonrelational and horizontally scalable databases) like Aerospike.
  • NewSQL approaches (emerging databases offerting NoSQL scalability but with familiar SQL query capabilities, i.e. SQL-compliant) like VoltDB, Oracle TimesTen, IBM solidDB, MemSQL.
Companies like Microsoft, Oracle and IBM choosed to add the in-memory support for their traditional databases (e.g. moving tables to memory), whereas SAP adopted another approach with its Hana platform that aims to put everything in-memory.

Some traditional RDBMS can be configured to store their data in-memory instead of disk storage like sqlite, MySQL, etc.

vendredi 13 juin 2014

Getting started with HBase

HBase indexes data bases on 4D coordinaes which are rowkey, column family (or a collection of columns), column qualifier and version. As a result, HBase can be considered a Key-Value store with a key as the 4D coordinates and the the cell as the value. Based on how many of these coordinates are specified during a query, the value may be a map or a map of map.


Installing the lastest stable version of hadoop:
$ mkdir hbase-install
$ cd hbase-install
$ wget
$ tar xvfz hbase-0.98.3-hadoop2-bin.tar.gz
$ export HBASE_HOME=`pwd`/hbase-0.98.3-hadoop2

Adding the HBase program to path
$ export PATH=$PATH:$HBASE_HOME/bin/

# you need the JAVA_HOME variable to be already set, if you're using open jdk, you can set it to:
$ export JAVA_HOME=/usr/lib/jvm/default-java

Running a standalone version

once the master launched you can accees the web admin interface on http://localhost:60010/

By default, hbase will write data into /tmp directory. You can change this by editing $HBASE_HOME/conf/hbase-site.xml and setting the following property (the complete list of properties can be found in the official documentation):

The $HBASE_HOME/conf/ bash file can be run to setup hbase configuration, for instance setting environment variables. For further information on configuring HBase, check the official documentation.

Shell-based interaction
Along the installation binaries, there is a JRuby-based shell that wraps a Java client to interact with HBase interactively (sedding commands and receiving responses directly on the terminal) or via bash scripts.

To validate the installtion, lets run the hbase shell and manipulate some data
$ hbase shell
# check existing tables
hbase(main):001:> list
# create table of column famity 'cf'
hbase(main):002:> create 'mytable', 'cf'
# write 'hello hbase' in first row of column 'cf:message' of table 'mytable'
hbase(main):003:> put 'mytable', 'first', 'cf:message', 'hello HBase'
# create a user table of 'info' famity
hbase(main):004:> create 'users', 'info'
hbase(main):005:> put 'mytable', 'second', 'cf:foo', 3.14159
hbase(main):006:> put 'users', 'first', 'cf:username', "John Doe"
# reading the first row from a table
hbase(main):007:> get 'mytable', 'first'
# reading the whole rows from a table
hbase(main):008:> scan 'mytable'

Java-based interaction

// define a custom configuration (by default the content of hbase-site.xml is used)
Configuration myConf = HBaseConfiguration.create();
myConf.set("param_name", "param_value");

// e.g. to connect to a remote HBase instance you need to set Zookeeper quorum address and port number
myConf.set("hbase.zookeeper.quorum", "serverip");
myConf.set("", "2181");

// establish a connection
HTableInterface myTable = new HTable(myConf, "users");

// Use pool for a better reuse of connections which are expensive resources
HTablePool pool = new HTablePool(myConf, max_nb_connection);
HTableInterface myTable = pool.getTable("mytable");
// close connection and returned to the pool

In HBase data is manipulated in bytes, Java types should be converted into raw bytes with the help of the utility class Bytes. The HBase API for manipulating data is divided into commands: Get, Put, Delete, Scan and Increment. Data is Example, data can be stored as follows:
// create a command with row key TheRealMT

Put p = new Put(Bytes.toBytes("TheRealJD"));

// add information about user
p.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("John Doe"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("email"), Bytes.toBytes(""));
p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("pass00"));

Once, the entry is ready we can send it to hbase for persistence:
HTableInterface usersTable = pool.getTable("users");
Put p = new Put(Bytes.toBytes("TheRealJD"));

The Put command can also be used to update the user information:
Put p = new Put(Bytes.toBytes("TheRealJD"));

p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("securepass"));


The HBase client does not interact directly with the storage layer which is formed of HFile. Instead, HBase writes all operations in a Write-Ahead-Log (WAL) for durability and failure recovery. While, the data are store in memory region called MemStore that upon filled its entire content is flushed to a new immutable file called HFile (no modification of existing HFiles).
This can be customized. For instance, the size of this region can be set via the hbase.hregion.memstore.flush.size parameter. Also, the WAL can be disabled with:
Put p = new Put();

The Get command is used to query data from a set of given columns:
Get g = new Get(Bytes.toBytes("TheRealJD"));
g.addColumn(Bytes.toBytes("info"), Bytes.toBytes("password"));
Result r usersTable.get(g);
byte[] b = r.getValue(Bytes.toBytes("info"), Bytes.toBytes("email"));
String email = Bytes.toString(b);
As HBase is versioned, we can look at partical values in history:
List<keyvalue> passwords = r.getColumn(Bytes.toBytes("info"), Bytes.toBytes("password"));
b = passwords.get(0).getValue();
String currentPassword = Bytes.toString(b);
b = passwords.get(1).getValue();
String previousPassword = Bytes.toString(b);

// the verions are by default the milliseconds corresponding to the moment when the operation was performed
long version = passwords.get(0).getTimestamp();

The Delete command is used to delete data from HBase
Delete d = new Delete(Bytes.toBytes("TheRealJD"));

// remove one column
d.deleteColumn(Bytes.toBytes("info"), Bytes.toBytes("email"));

// remove an entire row with all its columns
d.deleteColumns(Bytes.toBytes("info"), Bytes.toBytes("email"));


The delete operation is logical, meaning the concerned record is flagged as deleted and will no loger be returned in a get or scan. It is until compaction (merging two HFiles into single bigger one) that the record is effectively deleted. More details on the compaction operation can be found in this article.

Creating a table programatically
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor("UserFeed");
// create a column family
HColumnDescriptor c = new HColumnDescriptor("stream");

Once the table is created we can insert data into it, we may hash the row key used for users (i.e. TheRealJD) to a void variable length rowkey and for a better performance:
// prepare the value of the row key
int longLength = Long.SIZE / 8;
byte[] userHash = Md5Utils.md5sum("TheRealJD");
byte[] timestamp = Bytes.toBytes(-1 * System.currentMilliseconds());
byte[] rowKey = new byte[Md5Utils.MD5_LENGTH + longLength];
int offset = 0;
offset = Bytes.putBytes(rowKey, offset, userHash, 0, userHash.length);
Bytes.putBytes(rowKey, offset, timestamp, 0, timestamp.length);
// prepare the put command
Put put = new Put(rowKey);
// we may need to store the real value of user id to be able to find the associated user when scanning the feeds table
put.add(Bytes.toBytes("UserFeed"), Bytes.toBytes("user"), Bytes.toBytes("TheRealMT"));
put.add(Bytes.toBytes("UserFeed"), Bytes.toBytes("feed"), Bytes.toBytes("Hello world!"));

When it comes to scanning the feeds table, things got easy as a result of using a row key starting with a hash of the user row key.
byte[] userHash = Md5Utils.md5sum(user);
byte[] startRow = Bytes.padHead(userHash, longLength);
// create a stop key equal to the increment of the last byte of user id
byte[] stopRow = Bytes.padTail(userHash, longLength);
Scan s = new Scan(startRow, stopRow);
ResultsScanner rs = feedsTable.getScanner(s);
// extract the columns (as created previously) from each result
for(Result r: rs) {
  // extract the username
  byte[] b = r.getValue(Bytes.toBytes("UserFeed"), Bytes.toBytes("user"));
  String user = Bytes.toString(b);
  // extract the feed
  b = r.getValue(Bytes.toBytes("UserFeed"), Bytes.toBytes("feed"));
  String feed = Bytes.toString(b);
  // extract the timestamp
  b = Arrays.copyOfRange(r.getRow(), Md5Utils.MD5_LENGTH, Md5Utils.MD5_LENGTH+longLength);
  DateTime dt = new DateTime(-1 * Bytes.toLong(b));
By default, each RPC call from the client to HBase will return only 1 row (i.e. no cashing) which is not good in case of scanning the whole table. We can make each call returning n row by setting the property hbase.client.scanner.cashing or calling Scan.setCashing(int).

Continue here.


mercredi 28 mai 2014

Indexing keys and values in MapDB

MapDB is a high performance pure java database, it provides concurrent collections (Maps, Sets and Queues) backed by disk storage or off-heap memory.
It provides a powerful mechanism to synchronize collections that can be used to build multiple indexes on a primary collection. Follows is an example showing how to index keys and also values of main collection.

1. define a serializable class
// this class should implement serializable in order to be stored
public class Person implements Serializable {
  String firstname; 
  String lastname; 
  Integer age; 
  boolean male;

  public Person(String f, String l, Integer a, boolean m) {     
    this.firstname = f;
    this.lastname = l; 
    this.age = a; 
    this.male = m;

  public boolean isMale() {
    return male;
  public String toString() {
    return "Person [firstname=" + firstname + ", lastname=" + lastname + ", age=" + age + ", male=" + male + "]";

2. Define a map of persons by id
// stores person under id
BTreeMap<Integer, Person> primary = DBMaker.newTempTreeMap();
primary.put(111, new Person("bIs9r", "NWmqoxFf", 92, true)); 
primary.put(111, new Person("4KXp8", "QrPsabf1", 31, false)); 
primary.put(111, new Person("eJLIo", "SJwJidWk", 6, true)); 
primary.put(111, new Person("LGW58", "vteM4khp", 42, false)); 
primary.put(111, new Person("tIM8R", "Rzq75ONh", 57, false)); 
primary.put(111, new Person("KqKRE", "BnpUV4dW", 26, true)); 

3. Define a gender-based index
// stores value hash from primary map
NavigableSet<Fun.Tuple2<Boolean, Integer>> genderIndex = new TreeSet<Fun.Tuple2<Boolean, Integer>>();

//1. gender-based index: bind secondary to primary so it contains secondary key
Bind.secondaryKey(primary, genderIndex, new Fun.Function2<Boolean, Integer, Person>() {
  public Boolean run(Integer key, Person value) {
    return Boolean.valueOf(value.isMale());
4. Use the gender-index to read all male persons
Iterable<Integer> ids = Fun.filter(genderIndex, true);
for(Integer id: ids) {

MapdDB offers multiple ways to define indexes on a given collection, It can also be extended to define specific kind of indexes. Follows is an example of implementing the Bitmap index in MapDB:
public static <K, V, K2> void secondaryKey(MapWithModificationListener<K, V> map, final Map<K2, Set<K>> secondary,
      final Fun.Function2<K2, K, V> fun) {
  // fill if empty
  if (secondary.isEmpty()) {
    for (Map.Entry<K, V> e : map.entrySet()) {
      K2 k2 =, e.getValue());
      Set<K> set = secondary.get(k2);
      if (set == null) {
        set = new TreeSet<K>();
        secondary.put(k2, set);
  // hook listener
  map.modificationListenerAdd(new MapListener<K, V>() {
    public void update(K key, V oldVal, V newVal) {
      if (newVal == null) {
        // removal
        secondary.get(, oldVal)).remove(key);
      } else if (oldVal == null) {
        // insert
        K2 key2 =, newVal);
        Set<K> set = secondary.get(key2);
        if (set == null) {
          set = new TreeSet<K>();
          secondary.put(key2, set);
      } else {
        // update, must remove old key and insert new
        K2 oldKey =, oldVal);
        K2 newKey =, newVal);
        if (oldKey == newKey || oldKey.equals(newKey))
        Set<K> set1 = secondary.get(oldKey);
        if (set1 != null) {
        Set<K> set2 = secondary.get(newKey);
        if (set2 == null) {
          set2 = new TreeSet<K>();
          secondary.put(newKey, set2);
This new index can be used as follows:
final Map<Boolean, Set<Integer>> bitmapIndex = new HashMap<Boolean, Set<Integer>>();
secondaryKey(primary, bitmapIndex, fun);

Continue here

samedi 3 mai 2014

Exploiting Big RAMs

Those are notes from a talk given by Neil Ferguson about how to take benefit of very large amount of memory to improve the performance of server-side applications.

With the increases in the amount of managed data of any enterprise or web application, there is a continuous need for storing more and more of data while providing a real-time access to it. The performance of such applications can be improved by making data available directly from memory and efficiently use the available huge amount of memory that may reach many many terabytes in a near future.

In fact, memory prices is continuously decreasing while the capacity increases to the point where terabytes of RAM will be available for servers in a near future. The cost of a 1MB of RAM was about $0.01 in Jan 2009 while it is $0.005 in 2013, source Memory Prices (1957-2013). In fact, we could by a workstation with 512GB of RAM  for $2299, and new Intell processors (e.g. Xeon) allow up to 144GB of RAM and more (around terabytes) for new generation processors dedicated to server-class machines. However, it still not practical to do anything with such an amount of RAM. Why?

Garbage Collection Limitations
In any Garbage-collected environment (like JVMs), if the object allocation rates overtake the rates at which the GC collect these objects then long GC pauses (time during which the JVM stops applications to just run the garbage collector) may become very frequent. One way to avoid such problem is to leave a plenty of free space in the heap. The thing is when you leave a third of 3GB it's not really a big deal compared to the case when leaving the third of 300GB even if it's the same ratio betwee free space and live data.
The bad news, is that even with large free space there may be some situations where GC pauses are too long typically for memory defragmentation.
You can improve an application performance with -XX:+ExplicitGCInvokesConcurrent as a workaround to avoid long pausses when System.gc() or Runtime.getRuntime().gc() are explicitly called (e.g. Direct ByteBuffer allocation).

Off-Heap storage
To overcome some of these limitations in JVMs or in Garbage-collected environment, allocation of memory off-heap can be a solution. This can be done in different ways:

1. Direct ByteBuffers
The NIO api allows the allocation of off-heap memory (i.e. not part of the process heap and not subject to GC) for storing long-lived data via ByteBuffer.allocateDirect(int capacity). Capacity is limited to what was specified with the JVM option -XX:MaxDirectMemorySize.
The allocation through ByteByffer has implications for GC (long pauses) when it is freed not fast enough and makes it not suitable for short-lived objects, i.e. allocating and freeing a lot of memory frequently.

2. sun.misc.Unsafe 
Direct ByteByffer itself relies on sun.misc.Unsafe.allocateMemory to allocate a big block of memory off-heap and on sun.misc.Unsafe.freeMemory to explicitly free it.
Here is a very sample implementation of a wrapper class based on the Unsafe API for managing off-heap memory:
public class OffHeapObject {
  // fields
  private static Unsafe UNSAFE;

  static {
    try {
      // get instance using reflection
      Field field = sun.misc.Unsafe.class.getDeclaredField("theUnsafe");
      UNSAFE = (sun.misc.Unsafe) field.get(null);
    }catch(Exception e){
      throw new IllegalStateException("Could not access theUnsafe instance field");
  private static final int INT_SIZE = 4;
  // base address for the allocated data
  private long address;
  // constructor
  public OffHeapObject(T heapObject) {
    // serialize data
    byte[] data = serialize(heapObject);
    // allocate off-heap memory
    address = UNSAFE.allocateMemory(INT_SIZE + data.length);
    // save the data size in first bytes
    UNSAFE.putInt(address, data.length);
    // Write data byte by byte to the allocated memory
    for(int i=0; i < data.length; i++) {
      UNSAFE.putByte(address + INT_SIZE + i, data[i]);

  public T get() {
    int length = UNSAFE.getInt(address);
    // read data from the memory
    byte[] data = new byte[length];
    for(int i = 0; i < data.length; i++) {
      data[i] = UNSAFE.getByte(address + INT_SIZE + i);
    // return the deserialized data
    return deserialize(data);
  // free allocate space to avoid memory leaks

  public void deallocate() {
    //TODO make sure to not call this more than once
The OffHeapObject can be used for instance to store values of a cached data, e.g. using Google Guava to store keys-OffHeapObject pairs where the latter wraps data in the off-heap memory. This way GC pauses can be considerably reduced as these objects are just references and do not occupy big block of heap memory. Also, the process size may not grow indefinitely as fragmentation is reduced.

Note that the implementation of the OffHeapObject is very basic, there is a performance impact for using off-heap memory. In fact, everything needs to be serialized on writes to off-heap and de-serialized on read from off-heap memory and these operations has some overhead and reduced throughput compared to on-heap storage.
Furthermore, not every object can be stored in the off-heap memory for instance the OffHeapObject that keep a reference to a block of memory in the off-heap is actually stored in the heap.
The performance of this implementation may be enhanced with techniques like data alignment.

Some existing caches based on off-heap storage

continue from 28:39
Big RAM: How Java Developers Can Fully Exploit Massive Amounts of RAM 

  • Understanding Java Garbage Collection presentation at JUG Victoria Nov/2013 - Azul Systems
  • Measuring GC pauses with jHiccup - Azul Systems
  • A good documentation of the Unsafe API can be found in this blog post.
  • How Garbage Collection works in Java - blog post.

Random resources related to Docker



Continuous Integration

Environment configuration



  • Atomic project - Deploy and Manage your Docker Containers 
  • GearD - The Intersection of PaaS, Docker and Project Atomic 
  • Classification of the ecosystem of startups based on Docker 
  • Slides from DockerFr Meetup on Docker ecosystem
  • OpenCore a Big Data (Hadoop) as a Service provider

API Client
work in progress

dimanche 13 avril 2014

Managing Docker images and containers

In addition to managing Docker resources (including containers, images, hosts) through the official CLI, there is plenty of solutions available in the community to manage Docker resources in a comprehensive way from a single web-based interface.


Once our containers are running, DockerUI can be use to manage the overall system. It's a simple web app with basic features for:
 - Check the states of the images (running, stopped)
 - Remove images
 - Start, Stop, Kill and Remove containers

DockerUI can be used with the following commands

1. Building the web app from the github repository and tag the build image
$docker build -t crosbymichael/dockerui

2. Launch the built container, make the web app available on the 9000 port and connect to the docker uinx socket to remotely control docker
$docker run -p 9000:9000 -v /var/run/docker.sock:/docker.sock crosbymichael/dockerui -e /docker.sock

Then on the browser, visit localhot:9000 to get something like:


Shipyard is a more advanced Docker management solution based on a client-server architecture where the agents (i.e. clients) collect information on Docker resources and report them to the Shipyard server. It providers in addition to the features available in DockerUI:
 - Authentication
 - Building new images by uploading local Dockerfile or providing URLs to a remote location
 - In the browser terminal emulation for attaching containers
 - Visualizing CPU and memory utilization of the running images
 - ...

1. To use Shipyard, issue to pull the image from the Docker public index:
$docker run -i -t -v /var/run/docker.sock:/docker.sock shipyard/deploy setup

Now, we can register as admin to Shipyard on http://localhost:8000/

2. Install the latest release (e.g. v0.2.5) of Shipyard agent on every hosts to collect the information on Docker resources:
$curl -L -o /usr/local/bin/shipyard-agent
$chmod +x /usr/local/bin/shipyard-agent

3. Run the agent and register to the main host where Shipyard is running
$/usr/local/bin/shipyard-agent -url http://localhost:8000 -register

4. On the Shipyard interface, authorize the agents already deployed to enable them.
5. Run the agent with the given key at registration:
$/usr/local/bin/shipyard-agent -url http://localhost:8000 -key agent_key

Troubleshooting, in case you get this message:
Error requesting images from Docker: Get
Then stop the Docker service and re-start it while enabling Remote API access for any IP address:
$sudo service docker stop
$docker -H tcp:// -H unix:///var/run/docker.sock -d &

happy dockering

dimanche 6 avril 2014

Automating Docker image builds with Dockerfiles

Hello Dockerfile
This is a continuation of an previous post on Docker with the aim of using specific scripts called dockerfiles in order to automate the steps that we have been issuing to build docker images. When docker parse the script file, it sequentially executes the commands starting from a base image to create a new one after each command.
The syntax of a dockerfile instruction is as simple as :
command argument1 argument2 ... 
command ["argument1", "argument2", ...]  only for the entry-point command !!

It's preferable to write the command in uppercase!

Dockerfile instructions
There is a dozen of instructions that can be present in a dockerfile, a detailed list can be found in the official documentation. The most common ones are:
  • FROM all dockerfile should start with this command that specify the name of the image to use as a working or base image;
  • RUN allows to run a command in the current container and commit (automatically) the changes to a new image;
  • MAINTAINER allows to specify information (name, email) on the person responsible for maintain this script;
  • ENTRYPOINT allows to specify what command should be executed at first once the container is started;
  • USER allows to specify with which user account the command inside the container have to be executed with; 
  • EXPOSE allows to specify what port to expose for the running container.
  • ENV to use for setting environment variables
  • ADD to copy files from the build context (it does not work if using stdin to read dockerfile) into a physical directory in the image (e.g. copying a war file into tomcat webapps folder)
Here you can find the official tutorial to experiment with these command.

Parsing dockerfiles
Once finished editing the build script, issue docker build to parse the dockerfile and create a new image. There is different ways to use this command:
  • dockerfile is in current directory docker build .
  • from stdin docker build - < Dockerfile
  • from a github repository docker build docker will then clone the repo and parse the files in the repo directory.

Now lets take the instructions from the previous post and gather them into a dockerfile:
# Use ubuntu as a base image
FROM ubuntu

# update package respository
RUN echo "deb precise main universe" > /etc/apt/sources.list

RUN echo "deb precise-security main universe" > /etc/apt/sources.list
RUN apt-get update

# install java, tomcat7
RUN apt-get install -y default-jdk
RUN apt-get install -y tomcat7

RUN mkdir /usr/share/tomcat7/logs/
RUN mkdir /usr/share/tomcat7/temp/

# set tomcat environment variables
ENV JAVA_HOME=/usr/lib/jvm/default-java
ENV JRE_HOME=/usr/lib/jvm/default-java/jre
ENV CATALINA_HOME=/usr/share/tomcat7/

# copy war files to the webapps/ folder
ADD path/to/war /usr/share/tomcat7/webapps/

# launch tomcat once the container started
#ENTRYPOINT service tomcat7 start
ENTRYPOINT /usr/share/tomcat7/bin/ run

# expose the tomcat port number

Save this script to Dockerfile, build it and tag the image by tomcat7, then launch the container while exposing publicaly the tomcat server port 8080, and finally check if the container is running
$docker build -t tomcat7 - < Dockerfile
$docker run -p 8080 tomcat7
$docker ps

to be continued;

lundi 31 mars 2014

Build your own SaaS with Docker - Part I

Hello Docker
Docker enables sand-boxing of applications and their dependencies in virtual containers to be able to run them in isolated mode. It provides an easy to use API for automating deployment operations that looks very close to Git commands. More introductory information can be found in its Wikipedia page.

Docker installation on a Ubuntu 64bit (for other OS check official documentation)
$sudo sh -c "curl | apt-key add -" 
$sudo sh -c "echo deb docker main > /etc/apt/sources.list.d/docker.list" 
$sudo apt-get update
$sudo apt-get install lxc-docker

Once docker installed, run a shell from within a container as follow
$sudo docker run -i -t ubuntu /bin/bash

As it is supposed to not find the ubuntu image, docker will pull it from the registry. Once, installed you can prompt:
  • #exit to leave the container
  • $sudo docker images to see all local images.
  • $sudo docker inspect image_name to see detailed information on an image.
  • $sudo docker ps to see the status of the container
  • $sudo docker stop CONTAINER_ID to stop a running image (or container)
  • $sudo docker logs CONTAINER_ID to see all logs if a given container
  • $sudo docker commit CONTAINER_ID image_name to commit changes made to a container

Installing Tomcat within a container
Start a new container using the ubuntu base image:
$sudo docker run -i -t ubuntu /bin/bash

Update the image's system packages
#apt-get update

1. Install the Apache Tomcat application server:
#apt-get install -y tomcat7

Once installed the following directories are created (more details can be found here):
  • /etc/tomcat7 for configuration
  • /usr/share/tomcat7 for runtime, called $CATALINA_HOME
  • /usr/share/tomcat7-root for webapps
2. Install Java DK
#apt-get install -y default-jdk

3. Configure environment variables
#pico ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/default-java
export CATALINA_HOME=~/path/to/tomcat
#. ~/.bashrc to make the changes effective

Now when typing #echo $CATALINA_HOME you should see the exact path set to tomcat7.

4. Start the Tomcat7 server
#service tomcat7 start

The start-up may fail with something like "cannot create directory '/usr/share/tomcat7/logs/catalina.out/'". To solve this, you may just have create the logs directory:
#mkdir /usr/share/tomcat7/logs

to check if Tomcat is running issue
#ps -ef | grep tomcat
#service tomcat7 status

then check in your browser http://container_ip_address:8080/
to get the IP address of the container issue

5. Shutdown  Tomcat7
#service tomcat7 stop

Save the image to
The changes we made on the base image created a new one, we should commit these changes to not lose these changes.

1. Login to
$sudo docker login
Username: your_user_name
Password: your_password
Email: your_email
Login Succeeded

If you don't have an account, sign up here.

2. Commit changes to your repository

3. Push changes to this repository
$sudo docker push USERNAME/REPO_NAME

4. Start a new container using the image commit to your repository as base image
$sudo docker run -i -t USERNAME/REPO_NAME /bin/bash

To run Tomcat in the container
$sudo docker run -i -t USERNAME/REPO_NAME $CATALINA_HOME/bin/
$sudo docker run -i -t USERNAME/REPO_NAME service tomcat7 start

to cleanup old containers
$sudo docker ps -a -q | xargs sudo docker rm
$sudo docker ps -a | awk '{print $1}' | xargs sudo docker rm

to cleanup old and non tagged images
$sudo docker images | grep "^" | awk '{print $3}' | xargs sudo docker rmi -f

If you are confused with docker terminology (e.g. container, image, etc.) check this official documentation.
General purpose instructions for installing Tomcat7 on a ubuntu machine here.