Big Data Watchdog Homepage

We are proud to announce that a full working public demo of the framework is now available!

You can try it on here or by clicking on the "Try it" section of the navegation menu on the left side.

BDWatchdog has now been published! If you like or if you are interested in our work, you can cite us with:

Jonatan Enes, Roberto R. Expósito, Juan Touriño. BDWatchdog: Real-time monitoring and profiling of Big Data applications and frameworks. Future Generation Computer Systems February, vol. 87, pages 56-65, February 2018. Online Preprint

BDWatchdog is a framework to assist in tasks of in-depth and real-time analysis of the execution of Big Data frameworks and applications. With BDWatchdog two approaches are used in order to get an accurate picture of what an application is doing with the resources it has available (e.g., CPU, memory, disk and network), 1) per-process resource monitoring using timeseries and 2) mixed system and JVM profiling using flame graphs. By using this approach, we put the focus on applications rather than on hosts and with BDWatchdog it is possible to perform richer queries that, for example, show how much CPU is using a Spark job across a cluster.

With monitoring and profiling, used individually or combined, it is also possible to easily identify both resources and code bottlenecks as well as account for resource utilization or spot certain patterns that frameworks or applications may have.

This frameworks has been tested on both Docker containers and virtual machines, although we encourage the serverless paradigm and thus strongly focus on containers as a light form of virtualization.

Time series of the Disk for several Hadoop workloads.
Time series of the CPU for several Hadoop workloads.
Disk plot of Kmeans as executed with a Spark
CPU plot of Kmeans as executed with a Spark, peaks of about 2000% (20 cpus) can be observed
Disk plot of the average disk bandwidth in a node for a Terasort workload
Network plot showing the added up bandwidth for outgoing and input traffic for a Terasort workload across multiple nodes
Flame graph for a SQL join task in Spark, showing the most consuming methods being the CSV readers
Flame graph for a Spark job that combines JVM and application stacks, showing a bottleneck in JVM administration