BDWatchdog has now been published! If you like or if you are interested in our work, you can cite us with:

Jonatan Enes, Roberto R. Expósito, Juan Touriño. BDWatchdog: Real-time monitoring and profiling of Big Data applications and frameworks. Future Generation Computer Systems, vol. 87, pages 420-437, October 2018. Online Preprint

BDWatchdog is a framework to assist in the in-depth and real-time analysis of the execution of Big Data frameworks and applications. Two approaches are used in order to get an accurate picture of what an application is doing with the resources it has available 1) per-process resource monitoring and 2) mixed system and JVM profiling using flame graphs.

With monitoring and profiling, used individually or combined, it is also possible to easily identify both resources and code bottlenecks as well as account for resource utilization or spot certain patterns that frameworks or applications may have.

We are proud to announce that a full working public demo of the framework is now available!

You can try it on here or by clicking on the "Try it" section of t

Per-process monitoring

For the per-process monitoring, resource usage metrics for CPU, memory, disk and network are retrieved, processed and pushed to a time series database in a continuous stream. Thanks to this stream-based approach, it is possible, in real-time, to see and use this data for several purposes like visualization, reporting or to take automated actions to tackle several situations like resource bottlenecks, loss of efficiency or even resource limitation.

Moreover, the per-process approach allows to deploy this monitoring solution on a broader type of virtualized infrastructures, from the already common bare-metal hosts or virtual machines (Cloud instances) to the newer type of containers.

Finally, thanks to the use of time series as the means to store the data, it is possible to perform several operations through filtering and aggregation, which may overall provide richer reports (e.g., Show the aggregated CPU usage of a command or process across a cluster, show the average disk bandwidth of several hosts).

Give me more details! »

Figure 1: Single instance CPU time series for a PageRank workload executed with Hadoop
Figure 2: Cluster-aggregated CPU time series for a TeraSort workload executed with Scala
Figure 3: Cluster-aggregated memory time series for a TeraSort workload executed with Scala
Figure 4: Cluster-aggregated disk time series for a TeraSort workload executed with Scala
Figure 5: Cluster-aggregated network time series for a TeraSort workload executed with Scala

Real-time profiling

For the real-time profiling of Java-based applications, the perf utils are used to continuously get system-wide CPU stacks, using a configurable frequency. The stacks are then processed, marked to differentiate them across applications and pushed to a document-based database. Once in the database, the stacks can be retrieved and processed with several tools (console script, web interface) to create the interactive flame graphs.

These flame graphs can be created in real time with data collected between two time points on an application's execution. Once created, information regarding the percentage of spent time for each class can be interactively analyzed in a recursive way by following the stack calls.

Please tell me more! »

example of flame graph
Figure 6: Example of a flame graph extracted during the execution of a Hadoop TeraSort
Figure 7: Flame graph for a SQL join task in Spark, showing the most consuming methods being the CSV readers
Figure 8: Flame graph for a Spark job that combines JVM and application stacks, showing a bottleneck in JVM administration