BDWatchdog currently consists of two main subprojects that provide Monitoring and Profiling capabilites and a simple fully client-side web user interface that can be used to plot and analyze at the same time the results such projects, metrics time series and flame graphs. Each of this subprojects is developed independently and has its own GitHub page.
Although each of the subprojects is further explained next, the user is encouraged to visit their GitHub page for updated information and better guidance for a quickstart.
Monitoring (Time series)(GitHub repo): This subproject or module is capable of providing a stream of data, specifically time series for every process running in an instance (process-focused monitoring) and for the resources of the instance itself (system-wide monitoring), using the raw output of programs that are able to provide with such information such as atop, nethogs or turbostat. To do so, the output of these generator programs has to be processes and treated in a series of stages that in the end conform a pipeline architecture.
In the final stage of this pipeline, the data is sent to a time series database to be persisted for later analysis. These kind of databases are specifically designed to store large amounts of unstructured numerical data from a very large number of sources while preserving concurrency. Moreover, by using underlying scalable storage systems like a distributed and redundant filesystem such as HDFS, these databases can easily scale horizontally if needed and overall remain efficient. In this framework and as of now, only OpenTSDB is supported.
Finally, it is worth comenting that the metrics that create the time series are not stored using an SQL paradigm but rather a more object-oriented architecture (OpenTSDB uses HBase underneath). This, combined with a tagging system that is applied to the metrics (e.g., to specify the host a process is running on, its command name and/or its PID), allows for a very easy to use and understand analysis that is nevertheless extremely flexible. An example of this flexibility is that by using a tag that only leaves unfiltered all the processes named Spark and then the average aggregator, it is straightforward to get the average usage of resources like CPU, memory, disk or network of all the Spark processes across all the instances. For examples see the Homepage.
Profiling (Flame Graphs)(GitHub repo):
This module or subproject takes care of the profiling feature by periodically polling the CPU stack and properly dumping such data into a sctructured format and a database.
Web User Interface (GitHub repo):
You can also try a live version of it here.
Applications Timestamps Snitch(GitHub repo):
With this subproject it is possible to track the begin and end timestamps for both experiments and applications in an automatic way, that is, the moments when either an experiment or an application ends, using a UNIX timestamp. As conceived, an experiment is though as a set of applications that can be grouped according to some rule (e.g., A set of Hadoop applications and a set of Spark jobs). An application is considered a workload, job or application (e.g., a Hadoop WordCount, a Spark TerSort...).
The timestamping control is fully automatic and the generated data is pushed to a MongoDB database for later querying. Such data can later be used with other tools such as the CLI commands to retrieve time series or flamegraphs, or for a more usable and automatic way of visualization through the Web User Interface.