BDWatchdog currently consists of two main subprojects that provide Monitoring and Profiling capabilites and a simple fully client-side web user interface that can be used to plot and analyze at the same time the results such projects, metrics time series and flame graphs. Each of this subprojects is developed independently and has its own GitHub page.
Although each of the subprojects is further explained next, the user is encouraged to visit their GitHub page for updated information and better guidance for a quickstart.
Monitoring (GitHub repo): This subproject or module is capable of providing a stream of data, specifically time series for every process running in an instance (process-focused monitoring) and for the resources of the instance itself (system-wide monitoring), using the raw output of programs that are able to provide with such information such as atop, nethogs or turbostat. To do so, the output of these generator programs has to be processes and treated in a series of stages that in the end conform a pipeline architecture.
In the final stage of this pipeline, the data is sent to a time series database to be persisted for later analysis. These kind of databases are specifically designed to store large amounts of unstructured numerical data from a very large number of sources while preserving concurrency. Moreover, by using underlying scalable storage systems like a distributed and redundant filesystem such as HDFS, these databases can easily scale horizontally if needed and overall remain efficient. In this framework and as of now, only OpenTSDB is supported.
Finally, it is worth comenting that the metrics that create the time series are not stored using an SQL paradigm but rather a more object-oriented architecture (OpenTSDB uses HBase underneath). This, combined with a tagging system that is applied to the metrics (e.g., to specify the host a process is running on, its command name and/or its PID), allows for a very easy to use and understand analysis that is nevertheless extremely flexible. An example of this flexibility is that by using a tag that only leaves unfiltered all the processes named Spark and then the average aggregator, it is straightforward to get the average usage of resources like CPU, memory, disk or network of all the Spark processes across all the instances. For examples see the Homepage.
Profiling (GitHub repo):
This module or subproject takes care of the profiling feature by periodically polling the CPU stack and properly dumping such data into a sctructured format and a database.
Web User Interface (GitHub repo):
This subproject provides a very simple web user interface to properly display both time series plots for monitoring and flame graphs for profiling.