4. How To¶
4.1. Debug PS-Lite¶
One way to debug is loggining all communications. We can do it by specifying
the environment variable
PS_VERBOSE=1: logging connection information
PS_VERBOSE=2: logging all data communication information
For example, first run
make test; cd tests in the root directory. Then
export PS_VERBOSE=1; ./local.sh 1 1 ./test_connection
Possible outputs are
[19:57:18] src/van.cc:72: Node Info: role=schedulerid=1, ip=127.0.0.1, port=8000 [19:57:18] src/van.cc:72: Node Info: role=worker, ip=126.96.36.199, port=58442 [19:57:18] src/van.cc:72: Node Info: role=server, ip=188.8.131.52, port=40112 [19:57:18] src/van.cc:336: assign rank=8 to node role=server, ip=184.108.40.206, port=40112 [19:57:18] src/van.cc:336: assign rank=9 to node role=worker, ip=220.127.116.11, port=58442 [19:57:18] src/van.cc:347: the scheduler is connected to 1 workers and 1 servers [19:57:18] src/van.cc:354: S is connected to others [19:57:18] src/van.cc:354: W is connected to others [19:57:18] src/van.cc:296: H is stopped [19:57:18] src/van.cc:296: S is stopped [19:57:18] src/van.cc:296: W is stopped
W stand for scheduler, server, and worker respectively.
4.2. Use a Particular Network Interface¶
In default PS-Lite automatically chooses an available network interface. But for
machines have multiple interfaces, we can specify the network interface to use
by the environment variable
DMLC_INTERFACE. For example, to use the
ib0, we can
export DMLC_INTERFACE=ib0; commands_to_run
If all PS-Lite nodes run in the same machine, we can set
DMLC_LOCAL to use
memory copy rather than the local network interface, which may improve the
export DMLC_LOCAL=1; commands_to_run
4.3. Environment Variables to Start PS-Lite¶
This section is useful if we want to port PS-Lite to other cluster resource
managers besides the provided ones such as
To start a PS-Lite node, we need to give proper values to the following environment variables.
DMLC_NUM_WORKER: the number of workers
DMLC_NUM_SERVER: the number of servers
DMLC_ROLE: the role of the current node, can be
DMLC_PS_ROOT_URI: the ip or hostname of the scheduler node
DMLC_PS_ROOT_PORT: the port that the scheduler node is listening
4.4. Retransmission for Unreliable Network¶
It’s not uncommon that a message disappear when sending from one node to another node. The program hangs when a critical message is not delivered successfully. In that case, we can let PS-Lite send an additional ACK for each message, and resend that message if the ACK is not received within a given time. To enable this feature, we can set the environment variables
PS_RESEND: if or not enable retransmission. Default is 0.
PS_RESEND_TIMEOUT: timeout in millisecond if an ACK message if not received. PS-Lite then will resend that message. Default is 1000.
We can set
PS_DROP_MSG, the percent of probability to drop a received
message, for testing. For example,
PS_DROP_MSG=10 will let a node drop a
received message with 10% probability.