BDWatchdog has now been published! If you like or if you are interested in our work, you can cite us with:
Jonatan Enes, Roberto R. Expósito, Juan Touriño. BDWatchdog: Real-time monitoring and profiling of Big Data applications and frameworks. Future Generation Computer Systems, vol. 87, pages 420-437, October 2018. Online Preprint
BDWatchdog is a framework to assist in the in-depth and real-time analysis of the execution of Big Data frameworks and applications. Two approaches are used in order to get an accurate picture of what an application is doing with the resources it has available 1) per-process resource monitoring and 2) mixed system and JVM profiling using flame graphs.
With monitoring and profiling, used individually or combined, it is also possible to easily identify both resources and code bottlenecks as well as account for resource utilization or spot certain patterns that frameworks or applications may have.
For the per-process monitoring, resource usage metrics for CPU, memory, disk and network are retrieved, processed and pushed to a time series database in a continuous stream. Thanks to this stream-based approach, it is possible, in real-time, to see and use this data for several purposes like visualization, reporting or to take automated actions to tackle several situations like resource bottlenecks, loss of efficiency or even resource limitation.
Moreover, the per-process approach allows to deploy this monitoring solution on a broader type of virtualized infrastructures, from the already common bare-metal hosts or virtual machines (Cloud instances) to the newer type of containers.
Finally, thanks to the use of time series as the means to store the data, it is possible to perform several operations through filtering and aggregation, which may overall provide richer reports (e.g., Show the aggregated CPU usage of a command or process across a cluster, show the average disk bandwidth of several hosts).
For the real-time profiling of Java-based applications, the perf utils are used to continuously get system-wide CPU stacks, using a configurable frequency. The stacks are then processed, marked to differentiate them across applications and pushed to a document-based database. Once in the database, the stacks can be retrieved and processed with several tools (console script, web interface) to create the interactive flame graphs.
These flame graphs can be created in real time with data collected between two time points on an application's execution. Once created, information regarding the percentage of spent time for each class can be interactively analyzed in a recursive way by following the stack calls.