Big Data Watchdog Homepage


To get an effective accounting of the resources used by an application or process during its execution, time series are plotted. For each resource, several "metrics" are defined to measure how the resource is differently used (e.g., CPU has user and kernel execution modes, disk has write and read bandwidths).

Furthermore, thanks to the fact that this monitoring is focused on processes and abstracted from the underlying instances used (e.g., bare hosts, virtual machines or software containers), other possible operations such as filtering and aggregations allow a better study of how the applications and frameworks use the infrastructure resources as a whole.

Overall with this process-centered monitoring it is possible to quickly spot bottlenecks or get an idea of the degree of resource utilization that the frameworks and applications have over their execution time, which in turn can be used to take corrective measures like performing a re-scale of instances or resources or adjust the environment or application's configuration.

The next plots show some examples of time series that illustrate aggregations of resources across a Hadoop cluster during the execution of a Spark Terasort workload.

Aggregated DataNode disk bandwidth example

Aggregated disk bandwidth for 6 DataNodes in a Hadoop cluster

Aggregated DataNode net bandwidth example

Aggregated network bandwidth for 6 DataNodes in a Hadoop cluster


For profiling, flame graphs are used which allow to plot this kind of information in a better visualization manner to quickly spot code hot spots or, in case a JVM is profiled, the amount of time spent in any class including those internal of the JVM such as the garbage collector. Moreover, for this type of profiling a system profiler (perf) is used and thus minimal configuration is required and what is more important, no code instrumentation or agent attachment is needed.

Overall with this type of profiling it is possible to perform any analysis of how the code is being executed at any time on any already deployed application, while it is running and with low overhead. This is also possible to be performed in real-time to get a quick picture of where a possible code bottleneck is occuring even when resource usage appears to be optimal.

The next example shows a flame graph for an instance with 4 JVMs running a Spark Terasort workload. Although it is embedded as an image, it is interactive and can bew individually analysed. The profiled stacks can be independently zoomed. Try it here!!

example of flame graph