SHARP - Scalable Hierarchical Aggregation Protocol
-------------------------------------------------------------------------------

Copyright (c) 2016-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

License
-------------------------------------------------------------------------------

See LICENSE file.

Overview
-------------------------------------------------------------------------------
This document addresses system-level management of the Scalable Hierarchical
Aggregation Protocol (SHARP) resources. This includes system-wide resource
manager (Aggregation Manager - AM), SHARP library (Libsharp) which is local to each
compute node and provides accesses to switch-based collective communication
capabilities, and libsharp_coll - user level communication library.


### Terminology

* __AN (Aggregation Node)__:  ASIC hardware and local firmware implemented in Quantum switches.
* __Tree (Aggregation Tree)__: A SHARP tree represents a reduction-tree.
  The tree is composed of leaves representing data sources, internal nodes representing
  aggregation nodes, with the edges entering the junction representing the association of the children with the parent node.
* __Job__: SHARP resources are allocated for a job.
* __CP__: Computation Process. OOB (Out Of Band) process, e.g. MPI process. CP#__n__  in that notation n – is process id
* __Group__: The SHARP group is an aggregation collective group describes the vertices,
   leaves and aggregation nodes, associated with a given concrete reduction operation.
  For example, the leaves of a collective group may be mapped to an MPI communicator,
  with the rest of the elements being mapped to switches. Specific reduction operations have their data sources
  on a subset of the system nodes. The subset of leaf nodes and the aggregation nodes form the reduction tree are called the aggregation
  group, and correspond to a subtree of the SHARP tree.
* __AM (Aggregation Manager)__: system wide entity responsible for SHARP resource management.
* __libsharp API__ : a library (shared object) responsible for connection
  establishment from the customer app to AM and resource mamagement on job level.
* __libsharp_coll API__ : high level API exposes collective abstraction over SHARP.
* __SMX__ : communication library used for Libsharp to AM and Libsharp to Libsharp messaging.
* __OST__ :  Outstanding Operation.
* __Group channel__ is a client process (Computation Process) in the node selected for
  sending collective operation to assigned AN.
* __Radix__ is a number of children in the Aggregation Node.
* __Child index__ is an index of group member in the list of node children.
* __"Job Scheduler" ("JS")__ is a system for management resources in HPC cluster. For example: SLURM, IBM Platform LSF.

### Aggregation Manager

The Aggregation Manager (AM) is a system management component used for system
level configuration and management of the switch-based reduction capabilities.
It is used to setup the SHARP trees, manage the use of these entities.

AM is responsible for:

* SHARP resource discovery.
* Creating topology aware SHARP trees.
* Configuring SHARP switch capabilities.
* Managing SHARP resources.
* Assigning SHARP resource on request.
* Freeing SHARP resources on job termination.

For a new job launch, AM allocates SHARP resources. The resource allocation
includes two main steps:

* __Tree matching.__ AM selects an available tree which has non-broken subtree that spans
  all job hosts. For each host, AM assigns AN which the host may form connection.
* __Resource allocation.__ AM sets resources for each AN which serves the job. This includes
  buffers, OSTs, maximum number of groups and QPs available for children connection.


AM can be configured via config parameters.
To get a list of all supported config parameters and their description,
the best approach is to create a config file and use it.
To create a config file:
sharp_am -c <config file full path>

To run sharp_am with the config file:
sharp_am -B -O <config file full path>


A user can configure pre-defined trees in AM. In the user-defined trees file,
the ANs are identified by the node names, as in the topology file created by the SM.
The file format is as follows:

```
tree <tree-id>
node {node description} [GUID:<port_guid_num>]
subNode {node description} [GUID:<port_guid_num>]
subNode {node description} [GUID:<port_guid_num>]
...
node {node description} [GUID:<port_guid_num>]
subNode {node description} [GUID:<port_guid_num>]
...
node {node description} [GUID:<port_guid_num>]
computePort {node description} [GUID:<port_guid_num>]
computePort {node description} [GUID:<port_guid_num>]
...
tree <tree-id>
node {node description} [GUID:<port_guid_num>]
````
See also [Trees Configuration Reference](doc/TreesConfigurationFile.md).

Relevant parameters (AM):

* `trees_file`



### Libsharp

The Libsharp is running inside each customer process, local to each compute node.
Libsharp interacts with following entities:

 * AM. Job startup/termination.
 * SM. Service record fetching.
 * Other Libsharps. Group creation and destruction.

Only Rank#0 interacts with AM. The interaction is limited to sending resource allocation,
and requesting and receiving data from openSM's ftree-ca-order dump file.
request for a job, receiving job data and sending termination request. Job data distribution
between Libsharp participating in the job is out of scope of SHARP software and has to be done in
OOB (Out Of Band) level using push API.
Rank#0 is responsible for resource management on communicator level. Rank#n>0 interacts with
Rank#0 and requests resources for a group. For each group a fraction of available resources
can be allocated.
A user application can control resource allocation policy using the following environment variables:

* `SHARP_COLL_GROUP_RESOURCE_POLICY (1 - equal 2 - take_all by first group 3. User input percent)`
* `SHARP_COLL_USER_GROUP_QUOTA_PERCENT`

Each group should be joined to the Aggregation Tree before sending
collective operations. If multiple processes are participating in the group in the same node, HCOLL
can group these process based on socket locality and use multiple processes for sending collective
operations to network. Inside the sub-group, shared memory is used for collective. Group channel
process is a process selected for participating in sharp group. Application can ask a number of
group channels from AM. Multiple group channels affects tree radix and as result buffer allocation in AN.
If AN can't allocate the asked number of group channels, computation jobs fails. See [Multi-channel group].

Libsharp discovers AM address using Service Record fetching from SM, but can also get the service record from AM directly via rdma_cm.


### Inter-component messaging

SMX messaging library is responsible for communication between SHARP software components.
There are three communication protocols:

* AM <-> Libsharp#0. This protocol is used on job level. It includes following messages:

   * SHARP_MSG_TYPE_BEGIN_JOB
   * SHARP_MSG_TYPE_END_JOB
   * SHARP_MSG_TYPE_JOB_DATA

Libsharp#0 initiates connection to AM. Libsharp discovers AM's address using service record.


* Libsharp <-> Libsharp#0. This protocol is used on communicator level and includes following messages:

   * SHARP_MSG_TYPE_ALLOC_GROUP
   * SHARP_MSG_TYPE_GROUP_DATA
   * SHARP_MSG_TYPE_GET_JOB_DATA
   * SHARP_MSG_TYPE_RELEASE_GROUP

Libsharp#>0 knows Libsharp#0 address from job information distributed among Libsharps.

SMX wraps following underling communication mechanisms:

*   TCP socket. This is main communication mechanism used for production environment. A user
    has to configure at least one network interface.
*   Files. This mode serves debug and versification purposes.
*   UCX. This mode allows in-band message communication and uses [UCX - Unified Communication X library]
    (https://github.com/openucx/ucx).

### MAD communication

AM use ibis for high-performance, parallel processing: [ibis](https://github.com/Mellanox/ibis_tools).
Libsharp is libibumad or ibverbs based application.

### APIs

SHARP includes APIs:

* [libsharp_coll](src/api/sharp.h) . This high-level public API
  available for third-party integration.
* [libsharp](src/api/sharp_ctl.h). This is low-level private API.



System configuration
-------------------------------------------------------------------------------

* Only one instance of AM is allowed.
* AM and SM have to share the same server.

```
  +--------------------------------------+    +---------------------------------+
  |  Compute host                        |    | MGMT NODE                       |
  |                                      |    |                                 |
  |                                      |TCP |                                 |
  |  +--------------------------------+  |UCX |  +------------+  +-----------+  |
  |  | SMX                            +-----------> SMX       |  |           |  |
  |  +--------------------------------+  |    |  +------------+  |           |  |
  |  | libsharp                       |  |    |  |            |  |           |  |
  |  +--------------------------------+  |    |  |            |  |           |  |
  |  | libsharp_coll                  |  |    |  |  AM        |  |  SM       |  |
  |  +--------------------------------+  |    |  |            |  |           |  |
  |  | hcoll/UCC/NCCL                 |  |    |  |            |  |           |  |
  |  +--------------------------------+  |    |  |            |  |           |  |
  |  | Computation Process (CP)       |  |    |  |            |  |           |  |
  |  +--------------------------------+  |    |  +------------+  +-----------+  |
  |                                      |    |                                 |
  +--------------------------------------+    +---------------------------------+
  ```

Installation - Linux
-------------------------------------------------------------------------------

To build SHARP from source, the following tools are needed:
 * autoconf
 * automake
 * libtool
 * pkg-config

If you get the SHARP sources from github, you have to generate
configure script at first:

```shell
% ./autogen.sh
```

To build and install SHARP run following:

```shell
% module load mofed/hpcx
% ./configure --with-mpi=$OMPI_HOME --prefix=$PWD/install
% make install
% module unload mofed/hpcx
```

To build SHARP with ROCm support:

Install ROCm following installation guide here: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html. ROCm packages are installed to /opt/rocm by default.

```shell
./configure --with-cuda=/usr/local/cuda --with-rocm=/opt/rocm
```

To compile debugging code configure the project with the --enable-debug option:
```shell
% ./configure --with-mpi=$OMPI_HOME --prefix=$PWD/install --enable-debug
```

RPM installation:
```js
# rpm -ivh <sharp.rpm>
```

DEB installation:
```js
# dpkg -i <sharp.deb>
```

After installation the following daemons will be setup:

- *sharp_am* - will be disabled in all rc

For daemons manual installation/removing (requires root permission):
```
$prefix/bin/sharp_daemons_setup.sh or $top_source_dir/contrib/sharp_daemons_setup.sh
```

How to use the script:

```
Usage: sharp_daemons_setup.sh <-s SHARP location dir> <-r> <-a> <-d daemon> <-b>
    -s - Setup SHARP daemons
    -r - Remove SHARP daemons
    -a - All daemons (sharp_am)[default]
    -d - Daemon name (sharp_am)
```

Example of daemons configuration:

```js
# $prefix/sharp_daemons_setup.sh -s -d sharp_am    # Setup sharp_am daemon
# $prefix/sharp_daemons_setup.sh -r                # Remove sharp_am 
```

After setup procedure daemons startup scripts are:

```
/etc/init.d/sharp_am
```
Daemons config files are:
```
/etc/sysconfig/sharp_am
```

`$SHARP_OPTIONS` which was defined in sysconfig file will be passed to appropriate daemon as parameter

There is possibility to run SHARP daemons from nonstandard location by setting `SHARP_STARTUP_SCRIPT` env. var.

Example:
```js
# SHARP_STARTUP_SCRIPT=/my/script/location/sharp_am /etc/init.d/sharp_am start
```

Also `SHARP_DEVEL` may be set to force daemon script to use `$prefix` dir for `lockfile` and `pidfile` instead of default location (`/var/lock/subsys` and `/var/run` respectively).


**Unit Test**

```
% make unittest
% make gtest
```

**Run Valgrind Test**

```
% make valgrind
```

**Run MPI Test**

```
% make runtest
```

Using libsharp
-------------------------------------------------------------------------------

To compile a package using libsharp, you need provide
CFLAGS and LDFLAGS. libsharp is integrated with pkg-config.
If you use pkg-config, you can use it for getting compilation
flags:

```
PKG_CONFIG_PATH=<sharp destination folder> pkg-config --cflags sharp # prints CFLAGS
PKG_CONFIG_PATH=<sharp destination folder> pkg-config --libs sharp # prints LDFLAGS
```

Logging
-------------------------------------------------------------------------------

Following logs are useful for SHARP troubleshooting:

* AM log. Default location is /var/log/sharp_am.log .
*  Following parameters controls log creation in AM:

  * `log_file`
  * `log_verbosity . Possible values: 1 - Errors; 2 - Warnings; 3 - Info;
    4 - Debug; 5 - Trace.`
  * `verbose`
  * `log_max_backup_files`
  * `log_file_max_size`

* SHARP_COLL logging.

   * `SHARP_COLL_LOG_LEVEL` - Messages with a level higher or equal to the selected will be printed.
     Possible values are: 0 - fatal, 1 - error, 2 - warn, 3 - info, 4 - debug, 5 - trace.

* SM log.

* SMX doesn't have own logging system. It reports messages into application log (AM or Libsharp).

Releases
------------------------------------------------------------------------------

### Naming convention

 __XX.YY.ZZ-PRERELEASE__

*  __XX__: Major version
*  __YY__: Minor version
*  __ZZ__: Fix for released version
*  __PRERELEASE__: This is an optional tag that indicates a pre-release. Pre-release is not yet stable enough for production use.
                 This is an essential milestone before release.

### How to check SHARP version

* Open `<SHARP FOLDER>share/doc/sharp/SHARP_VERSION`

	```
	PACKAGE VERSION: 1.3.0
	PRERELEASE: 0dev
	SOURCE REVISION: c51b664
	IBIS SOURCE REVISION: ab5c9f9
	BUILD DATE: Jan/30/2017 11:40:30
	```

* Run `sharp_am --version`

	```
sharp_am (sharp) 3.11.0
Copyright (C) 2025 NVIDIA CORPORATION & AFFILIATES, Inc.
License: See LICENSE file
There is NO WARRANTY, to the extent permitted by law.

Build Date: Apr 13 2025
Last commit: 2e1de3b
	```

* Search for "Version:" in log file

```
[Mar 19 18:42:48 912450][GENERAL][518468][output] - Package: sharp
[Mar 19 18:42:48 912472][GENERAL][518468][output] - Version: 3.10.3
[Mar 19 18:42:48 912492][GENERAL][518468][output] - Build Date: Mar 13 2025
[Mar 19 18:42:48 912512][GENERAL][518468][output] - Last commit: 6718c98
[Mar 19 18:42:48 912532][GENERAL][518468][output] - IBIS last commit: 8f06e82
```

### Branches

Development is done in "master" branch. Once a version is released a new maintenance branch is created.


Contributing to the project
-------------------------------------------------------------------------------

See [CONTRIBUTING.md](.github/CONTRIBUTING.md)

References
------------------------------------------------------------------------------
[Multi-channel group]: https://github.com/Mellanox/sharp/wiki/Multi-channel-group
