Operation of MPTCP
TCP has the ability to include 40 bytes of TCP options in the TCP header, indicated by the Data Offset
value. If the Data Offset value is greater than 5, then the space between the final 32 bit word of the
TCP header (checksum and Urgent Pointer) ands the first octet of the data can be used for options.
MPTCP uses the Option Kind value of 30 to denote MPTCP options. All MPTCP signaling is
contained in this TCP header options field.
The MPTCP operation starts with the initiating host passing a MP_CAPABLE capability message in
the MPTCP options field to the remote host as part of the initial TCP SYN message when opening the
TCP session. The SYN+ACK response contains a MP_CAPABLE flag in its MPTCP options field of
the SYN+ACK response if the other end is also MPTCP capable. The combined TCP and MTCP
handshake concludes with the ACK and MP_CAPABLE flag, confirming that both ends now have
each other’s MPTCP session data. This capability negotiation exchanges 64 bit keys for the session, and
each party generates a 32 bit hash of the session keys which are subsequently used as a shared secret
between the two hosts for this particular session to identify subsequent subjoin connection attempts.
Further TCP subflows can be added to the MPTCP session by an a conventional TCP SYN exchange
with the MPTCP option included. In this case the exchange contains the MP_JOIN values in the
MPTCP options field. The values in the MP_JOIN exchange includes the hash of the original
receiver’s session key and includes the token value from the initial session, so that both ends can
associated the new TCP session with the existing session, as well as a random value intended to prevent
replay attacks. The MP_JOIN option also includes the sender’s address index value to allow both ends
of the conversation to reference a particular address even when NATs on the path perform address
transforms. MPTCP allows these MP_JOINs to be established on any port number, and by either end
of the connection This means that while a MPTCP web session may start using a port 80 service on the
server, but subsequent subflows may be established on any port pair and it is not necessary for the
server to have a LISTEN open on the new port. The MPTCP session token allows the 5-tuple of the
new subflow (Protocol number, source and destination addresses, source and destination port
numbers) to be associated with the originally established MPTCP flow. Two hosts can also inform each
other of new local addresses without opening a new session by sending ADD_ADDR messages, and
remove them with the complementary REMOVE_ADDR message.
Individual subflows use conventional TCP signaling. However, MPTCP adds a Data Sequence Signal
(DSS) to the connection that describes the overall state of the data flow across the aggregate of all of
the TCP sub flows that are part of this MPTCP session. The sender sequence numbers include the
overall data sequence number and the subflow sequence number that is used for the mapping of this
data segment into a particular subflow. The DSS Data ACK sequence number is the aggregate
acknowledgement of the highest in-order data received by the receiver. MPTCP does not use SACK, as
this is left to the individual subflows.
To prevent data loss causing blockage on an individual subflow, a sender can retransmit data on
additional subflows. Each subflow is using a conventional TCP sequencing algorithm, so an unreliable
connection will cause that subflow to stall. In this case MPTCP can use a different subflow to resend
the data, and if the stalled condition is persistent it can reset the stalled subflow with a TCP RST within
the context of the subflow.
Individual subflows are stopped by a conventional TCP exchange of FIN messages, or through the
TCP RST message. The shutting down of the MP-TCP session is indicated by a data FIN message
which is part of the data sequencing signaling within the MPTCP option space.
Congestion control appears still to be an open issue for MPTCP. An experimental approach is to
couple the congestion windows of each of the subflows, increasing the sum of the total window sizes at
a linear rate per RTT interval, and applying the greatest increase to the subflows with the largest
existing window. In this way the aggregate flow is no worse than a single TCP session on the best
available path, and the individual subflows take up a fair share of each of the paths it uses. Other
approaches are being considered that may reduce the level of coupling of the individual subflows.