.. .. .. Licensed under the Apache License, Version 2.0 (the "License"); .. you may not use this file except in compliance with the License. .. You may obtain a copy of the License at .. .. http://www.apache.org/licenses/LICENSE-2.0 .. .. Unless required by applicable law or agreed to in writing, software .. distributed under the License is distributed on an "AS IS" BASIS, .. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. .. See the License for the specific language governing permissions and .. limitations under the License. .. ****************************** Traffic Monitor Administration ****************************** .. _tm-golang: Installing Traffic Monitor ========================== The following are hard requirements requirements for Traffic Monitor to operate: * CentOS 7 or later * Successful install of Traffic Ops (usually on a separate machine) * Administrative access to the Traffic Ops (usually on a separate machine) These are the recommended hardware specifications for a production deployment of Traffic Monitor: * 8 CPUs * 16GB of RAM * It is also recommended that you know the geographic coordinates and/or mailing address of the site where the Traffic Monitor machine lives for optimal performance #. Enter the Traffic Monitor server into Traffic Portal .. note:: For legacy compatibility reasons, the 'Type' field of a new Traffic Monitor server must be 'RASCAL'. #. Make sure the :abbr:`FQDN (Fully Qualified Domain Name)` of the Traffic Monitor is resolvable in DNS. #. Install Traffic Monitor, either from source or by installing a :file:`traffic_monitor-{version string}.rpm` package generated by the instructions in :ref:`dev-building` with :manpage:`yum(8)` or :manpage:`rpm(8)` #. Configure Traffic Monitor according to `Configuring Traffic Monitor`_ #. Start Traffic Monitor, usually by starting its :manpage:`systemd(1)` service #. Verify Traffic Monitor is running by e.g. opening your preferred web browser to port 80 on the Traffic Monitor host. .. _tm-configure: Configuring Traffic Monitor =========================== Configuration Files ------------------- Traffic Monitor is configured via two JSON configuration files, :file:`traffic_ops.cfg` and :file:`traffic_monitor.cfg`, by default located in the ``conf`` directory in the install location. traffic_ops.cfg """"""""""""""" :file:`traffic_ops.cfg` contains Traffic Ops connection information. Specify the URL, username, and password for the instance of Traffic Ops of which this Traffic Monitor is a member. However, this *also* sets some settings relating to the Traffic Monitor API server. :``cdnName``: The name of the CDN to which this Traffic Monitor belongs. Used to fetch configuration and to determine which :term:`cache servers` to monitor. :``certFile``: The path to an SSL certificate file that corresponds to ``keyFile`` which will be used for Traffic Monitor's HTTPS API server. :``httpListener``: Sets the address and port on which Traffic Monitor will listen for HTTP requests in the format :samp:`{address}:{port}`. If ``address`` is omitted, Traffic Monitor will listen on all available addresses. :``httpsListener``: Sets the address and port on which Traffic Monitor will listen for HTTPS requests in the format :samp:`{address}:{port}`. If ``address`` is omitted, Traffic Monitor will listen on all available addresses. If not provided, ``null``, or the empty string, Traffic Monitor will only serve HTTP, and ``keyFile`` and ``certFile`` are not used. If this is provided, the ``httpListener`` address will be used only to redirect clients to use HTTPS. :``insecure``: A boolean that controls whether to validate the HTTPS certificate presented by the Traffic Ops server. :``keyFile``: The path to an SSL key file that corresponds to ``certFile`` which will be used for Traffic Monitor's HTTPS API server. :``password``: The password of the user identified by ``username``. :``url``: The URL at which Traffic Ops may be reached e.g. ``"https://trafficops.infra.ciab.test"``. :``username``: The username of the user as whom to authenticate with Traffic Ops. :``usingDummyTO``: A boolean with no real effect. This value is used internally within the runtime of Traffic Monitor, and should never be set manually in its configuration file. .. deprecated:: ATCv7 The dependency on this field being valid will be removed in the future. It already has no effect. traffic_monitor.cfg """"""""""""""""""" :file:`traffic_monitor.cfg` contains log file locations, as well as detailed application configuration variables such as processing flush times, initial poll intervals, and the polling protocols. Once started with the correct configuration, Traffic Monitor downloads its configuration from Traffic Ops, and any :term:`Parameters` set on the Monitor's :term:`Profile` that configure the same thing as a field in this configuration file will take precedence over said fields. The :term:`Parameters` known to override configuration here are - ``tm.polling.interval`` - ``health.polling.interval`` - ``peers.polling.interval`` - ``heartbeat.polling.interval`` Upon receiving this configuration, Traffic Monitor begins polling :term:`cache server` s. Once every :term:`cache server` has been polled, :ref:`health-proto` state is available via RESTful JSON endpoints and a web browser UI. :``cache_polling_protocol``: Defines the internet protocol used to communicate with :term:`cache servers`. This can be "ipv4only" to only allow IPv4 communication, "ipv6only" to only allow IPv6 communication, or "both" to alternate between each version. Default is "both". .. Note:: ``both`` will poll IPv4 and IPv6 and report on availability based on if the respective IP addresses are defined on the server. So if only an IPv4 address is defined and the protocol is set to ``both`` then it will only show the availability over IPv4, but if both addresses are defined then it will show availability based on IPv4 and IPv6. :``crconfig_backup_file``: The path to a file within which a backup of the most recently fetched CDN :term:`Snapshot` will be stored. Default is ``/opt/traffic_monitor/crconfig.backup``. :``crconfig_history_count``: The number of historical CDN Snapshots to store, which can then be retrieved through the :ref:`tm-api`. Default is 100. :``distributed_polling``: A boolean that controls whether `Distributed Polling`_ is enabled. Default is ``false``. .. seealso:: The `Distributed Polling`_ section has more information on this setting. :``health_flush_interval_ms``: Defines an interval as a number of milliseconds on which Traffic Monitor will flush its collected health data such that it is made available through the :ref:`tm-api`. Default is 200. .. seealso:: The `Stat and Health Flush Configuration`_ section has more information on this setting. :``http_polling_format``: A MIME-Type that will be sent in the :mailheader:`Accept` HTTP header in requests to :term:`cache servers` for health and stats data. Default is :mimetype:`text/json` (**not** :mimetype:`application/json`). .. seealso:: The `HTTP Accept Header Configuration`_ section has more information on this setting. :``http_timeout_ms``: Sets the timeout duration - in milliseconds - for all HTTP operations (both peer-polling and stat/health data polling). Default is 2000. :``log_location_access``: A logfile location to which access logs will be written, or ``null`` to not log access events.\ [#log-locations]_ Default is ``null`` :``log_location_debug``: A logfile location to which debug logs will be written, or ``null`` to not log debug messages.\ [#log-locations]_ Default is ``null`` :``log_location_error``: A logfile location to which error logs will be written, or ``null`` to not log error messages.\ [#log-locations]_ Default is "stderr". :``log_location_event``: A logfile location to which event logs will be written, or ``null`` to not log events.\ [#log-locations]_ Default is "stdout" :``log_location_info``: A logfile location to which informational logs will be written, or ``null`` to not log informational messages.\ [#log-locations]_ Default is ``null`` :``log_location_warning``: A logfile location to which warning logs will be written, or ``null`` to not log warning messages.\ [#log-locations]_ Default is "stdout" :``max_events``: The maximum number of changes to stored aggregate data that should be retained at any one time. Default is 200. :``monitor_config_polling_interval_ms``: The interval - in milliseconds - on which to poll Traffic Ops for this Traffic Monitor's "monitoring configuration" as returned by :ref:`to-api-cdns-name-configs-monitoring`. :``peer_optimistic_quorum_min``: Specifies the minimum number of peers that must be available in order to participate in the optimistic health protocol. Default is zero. .. seealso:: The `Peering and Optimistic Quorum`_ section has more information on this setting. :``serve_read_timeout_ms``: Sets the timeout - in milliseconds - of the Traffic Monitor API server for reading incoming requests. Default is 10,000. :``serve_write_timeout_ms``: Sets the timeout - in milliseconds - of the Traffic Monitor API server for writing responses. Default is 10,000. :``short_hostname_override``: Sets a hostname for the Traffic Monitor. It will behave as though this were its hostname, rather than the hostname actually reported by the operating system. If not provided, ``null``, or the empty string, the Traffic Monitor will use the hostname provided by its host operating system. Default is the empty string. :``stat_buffer_interval_ms``: An interval - in milliseconds - for which to buffer collected stats before processing them. If this is not provided, ``null``, or zero, then all stats will be processed immediately. Default is zero. .. seealso:: The `Stat and Health Flush Configuration`_ section has more information on this setting. :``stat_flush_interval_ms``: Defines an interval as a number of milliseconds on which Traffic Monitor will flush its collected stats data such that it is made available through the :ref:`tm-api`. Default is 200. .. seealso:: The `Stat and Health Flush Configuration`_ section has more information on this setting. :``stat_polling``: A boolean that controls whether :term:`cache servers` are polled for stats data. Default is ``true``. .. seealso:: The `Optional Stat Polling`_ section has more information on this setting. :``static_file_dir``: The directory within which Traffic Monitor will look for its web interface's static files. Default is ``/opt/traffic_monitor/static``. :``tmconfig_backup_file``: A file location to which a backup of the "monitoring configuration" as returned by :ref:`to-api-cdns-name-configs-monitoring` currently in use by Traffic Monitor will be written. Default is ``/opt/traffic_monitor/tmconfig.backup``. :``traffic_ops_disk_retry_max``: The number of times Traffic Monitor should attempt to log in to Traffic Ops before using its backup monitoring configuration and CDN Snapshot (if those exist). Default is 2. :``traffic_ops_max_retry_interval_ms``: Traffic Monitor will exponentially increase the amount of time it waits between attempts to log in to Traffic Ops each time it fails (up to a maximum number of times set by ``traffic_ops_disk_retry_max``). This controls the maximum amount of time - in milliseconds - that this waiting duration will be. Default is 60,000. :``traffic_ops_min_retry_interval_ms``: Traffic Monitor will exponentially increase the amount of time it waits between attempts to log in to Traffic Ops each time it fails (up to a maximum number of times set by ``traffic_ops_disk_retry_max``). This controls the minimum amount of time - in milliseconds - that this waiting duration will be. Default is 100. Optional Stat Polling --------------------- Traffic Monitor has the option to disable stat polling via the ``stat_polling`` (default: ``true``) option in :file:`traffic_monitor.cfg`. If set to ``false``, Traffic Monitor will not poll caches for stats; it will only poll caches for health. This can be useful in lowering the amount of resources (CPU, bandwidth) used by Traffic Monitor while still allowing it to retain its core functionality (determining cache availability) via health polling alone. However, disabling stat polling also prevents some other ATC features from working properly (basically anything that requires stats data from caches, e.g. Traffic Stats data), so it should only be disabled when absolutely necessary. Distributed Polling ------------------- Traffic Monitor has the option to enable distributed polling via the ``distributed_polling`` (default: ``false``) option in :file:`traffic_monitor.cfg`. If set to ``true``, Traffic Monitor groups will each poll their own disjoint subsets of the CDN. In order to enable this option, ``stat_polling`` must be disabled. In order to function properly, all Traffic Monitors in a CDN must have ``distributed_polling`` enabled; otherwise, the results are undefined. .. note:: Traffic Monitors are said to be in the same "Traffic Monitor group" if they are in the same :term:`Cache Group`. Each Traffic Monitor in the same Traffic Monitor group (referred to as local peers) polls the same disjoint subset of the CDN and combines availability states with its local peers via the Health Protocol. This is similar to how Traffic Monitor behaves in its legacy, non-distributed mode except Traffic Monitor is not polling the entire CDN. In order to get availability data for the rest of the CDN, each Traffic Monitor also polls every other Traffic Monitor group in parallel (these are referred to as distributed peers). It does this by selecting one distributed peer per group at a time, cycling through each distributed peer in the group for subsequent polls in a round-robin manner. Upon startup, Traffic Monitor will retrieve its config (either from TO or on-disk backup file), then begin polling the :term:`Cache Groups` for which its Traffic Monitor group is responsible. Once it has polled the :term:`Cache Groups`, it will start serving requests for ``/publish/CrStates?raw`` (the raw, uncombined health states of its local caches) and ``/publish/CrStates?local`` (the combined health states of its local caches derived from all Traffic Monitors in its group). Once Traffic Monitor has received ``/publish/CrStates?local`` responses from all other Traffic Monitor groups, it will start serving requests for ``/publish/CrStates`` (the combined health states of all caches in the CDN). Peering and Optimistic Quorum ----------------------------- As mentioned in the :ref:`health-proto` section of the :ref:`tm-overview` overview, peering a Traffic Monitor with one or more other Traffic Monitors enables the optimistic health protocol. In order to leverage the optimistic quorum feature along with the optimistic health protocol, a minimum of three Traffic Monitors are required. The optimistic quorum feature allows a Traffic Monitor to withdraw itself from the optimistic health protocol when it loses connectivity to a number of its peers. To enable the optimistic quorum feature, the ``peer_optimistic_quorum_min`` property in ``traffic_monitor.cfg`` should be configured with a value greater than zero that specifies the minimum number of peers that must be available in order to participate in the optimistic health protocol. If at any time the number of available peers falls below this threshold, the local Traffic Monitor will serve 503s whenever the aggregated, optimistic health protocol enabled view of the CDN's health is requested. Traffic Monitor will continue serving 503s and logging errors in ``traffic_monitor.log`` until the minimum number of peers are available. Once the minimum number of peers are available, the local Traffic Monitor can resume participation in the optimistic health protocol. This prevents negative states caused by network isolation of a Traffic Monitor from propagating to downstream components such as Traffic Router. Stat and Health Flush Configuration ----------------------------------- The Monitor has a health flush interval, a stat flush interval, and a stat buffer interval. Recall that the monitor polls both stats and health. The health poll is so small and fast, a buffer is largely unnecessary. However, in a large CDN, the stat poll may involve thousands of :term:`cache servers` with thousands of stats each, or more, and CPU may be a bottleneck. The flush intervals, ``health_flush_interval_ms`` and ``stat_flush_interval_ms``, indicate how often to flush stats or health, if results are continuously coming in with no break. This prevents starvation. Ideally, if there is enough CPU, the flushes should never occur. The default flush times are 200 milliseconds, which is suggested as a reasonable starting point; operators may adjust them higher or lower depending on the need to get health data and stop directing client traffic to unhealthy :term:`cache servers` as quickly as possible, balanced by the need to reduce CPU usage. The stat buffer interval, ``stat_buffer_interval_ms``, also provides a temporal buffer for stat processing. Stats will not be processed except after this interval, whereupon all pending stats will be processed, unless the flush interval occurs as a starvation safety. The stat buffer and flush intervals may be thought of as a state machine with two states: the "buffer state" accepts results until the buffer interval has elapsed, whereupon the "flush state" is entered, and results are accepted while outstanding, and processed either when no results are outstanding or the flush interval has elapsed. Note that this means the stat buffer interval acts as "bufferbloat," increasing the average and maximum time a :term:`cache server` may be down before it is processed and marked as unhealthy. If the stat buffer interval is non-zero, the average time a :term:`cache server` may be down before being marked unavailable is half the poll time plus half the stat buffer interval, and the maximum time is the poll time plus the stat buffer interval. For example, if the stat poll time is 6 seconds, and the stat buffer interval is 4 seconds, the average time a :term:`cache server` may be unhealthy before being marked is :math:`\frac{6}{2} + \frac{4}{2} = 6` seconds, and the maximum time is :math:`6+4=10` seconds. For this reason, if operators feel the need to add a stat buffer interval, it is recommended to start with a very low duration, such as 5 milliseconds, and increase as necessary. It is not recommended to set either flush interval to 0, regardless of the stat buffer interval. This will cause new results to be immediately processed, with little to no processing of multiple results concurrently. Result processing does not scale linearly. For example, processing 100 results at once does not cost significantly more CPU usage or time than processing 10 results at once. Thus, a flush interval which is too low will cause increased CPU usage, and potentially increased overall poll times, with little or no benefit. The default value of 200 milliseconds is recommended as a starting point for configuration tuning. HTTP Accept Header Configuration -------------------------------- The Accept header sent to caches for stat retrieval can be modified with the ``http_polling_format`` option. This is a string that will be inserted in to the Accept header of any requests. The default value is ``text/json`` which is the default value used by the astats plugin currently. However newer versions of astats also support CSV output, which can have some CPU savings. To enable that format using ``http_polling_format: "text/csv"`` in :file:`traffic_monitor.cfg` will set the Accept header properly. Troubleshooting and Log Files ============================= Traffic Monitor log files are in :file:`/opt/traffic_monitor/var/log/`. .. _admin-tm-extensions: Extensions ========== Traffic Monitor allows extensions to its parsers for the statistics returned by :term:`cache servers` and/or their plugins. The formats supported by Traffic Monitor by default are ``astats``, ``astats-dsnames`` (which is an odd variant of ``astats`` that probably shouldn't be used), and ``stats_over_http``. The format of a :term:`cache server`'s health and statistics reporting payloads must be declared on its :term:`Profile` as the :ref:`health.polling.format ` :term:`Parameter`, or the default format (``astats``) will be assumed. For instructions on how to develop a parsing extension, refer to the :atc-godoc:`traffic_monitor/cache` package's documentation. Importantly, though, a statistics provider *must* respond to HTTP GET requests over either plain HTTP or HTTPS (which is controlled by the :ref:`health.polling.url ` :term:`Parameter`), and it *must* provide the following statistics, or enough information to calculate them: - System "loadavg" (only requires the one-minute value) .. seealso:: For more information on what "loadavg" is, refer to the :manpage:`proc(5)` manual page. - Input bytes, output bytes, and speeds for all monitored network interfaces When using the ``stats_over_http`` extension this can be provided by the ``system_stats`` plugin which will inject that information in to the ATS stats which then get returned by ``stats_over_http``. The ``system_stats`` plugin can be used with any custom implementations as it is already included and built with ATS when building with experimental-plugins enabled. There are other optional and/or :term:`Delivery Service`-related statistics that may cause Traffic Stats to not have the right information if not provided, but the above are essential for implementing :ref:`health-proto`. .. [#log-locations] These respect the rules and special string constants of :atc-godoc:`lib/go-log`.