The Firmata protocol is used for a variety of uses, from home automation to robots. Any time a computer needs to control a low-power device you might find it being used. It’s built upon MIDI, which is a very old protocol used mostly for music. Because of MIDI’s simplicity, microcontrollers (like the Arduino) can parse it without much overhead. Even though it was standardized in 1983 it’s still in use today in the music, theater controls, and robotics industries.
Standard Message Value Encoding
All Midi messages (or “packets”) are made up of two parts:
- A header byte that has to be equal to or over
- A fixed number of “7-bit” bytes (between 0-2 of them). They are not allowed to use the 8th bit which makes their max value
SYSEX Message Value Encoding
SYSEX or System Exclusive messages are where you can add your own types, they follow the format:
- A header byte of
- Any of number of “7-bit” bytes
- A footer byte of
Technically you’re supposed to have a vendor ID as the first part of your data, but if you control both ends of the serial link this isn’t needed.
The message formats mean you can only use the last 7 bits of a byte to encode values in MIDI. This was done for two reasons.
There are a class of messages called “System Real-Time Messages” which can interrupt any message. They consist only of a header byte and should be processed immediately before going back to the packet you were reading. The standard MIDI baud rate was pretty slow by today’s standards and this was done for timing issues and other purposes. Firmata doesn’t need “System Real-Time Messages”. It uses one to report the protocol version (and incorrectly provides data bytes). But it never has them interrupt other messages.
This makes the protocol self synchronizing, if you lose data, you can wait for the start of a new message and continue parsing messages. It’s easy to tell when a new message begins, even if you start in the middle of a previous message.
Efficiently encoding 7-bit encoding strings, floats, and other values can prove troublesome. While there are some novel practices in production today, it’s always a bit of overhead.
A new binary protocol would have to satisfy a number of requirements.
- It has to support message types. There are currently a number of messages that encode 1 bit of data into two bits to satisfy the MIDI message requirements, and plenty of SYSEX based messages.
- It should allow a similar byte addressing of message types. Even if the message are self-describing (eg. they contain a length). I want to change the packaging, not the product.
- It needs to be simple to parse, with hopefully no more code than midi
And finally one unknown:
- Does it need to be self synchronizing?
I think to answer this question we need to look at the full stack.
If you think in terms of layers (similar to the OSI model) you can paint a picture of a computer talking to an Arduino over Serial like so:
- Application (Firmata)
- Data (MIDI)
- Physical (UART/Serial)
The Physical layer moves the bytes. The UART protocol used for serial commination is self-synchronizing. It’s able to use the analog properties of the digital signal to figure out when a byte starts and ends. This allows us to rely on having whole bytes available higher up the stack.
The Data layer describes how we encode messages. In MIDI’s case we use predefined message sizes and use 7-bit bytes for data.
The Application layer decides what to send and what the data it receives means. Firmata does this for command and control of the Arduino.
Because UART is already self-synchronizing, do we need our Data layer to be too? I think so. The only guarantees we have are that each byte might be received, and when they are they will be in order. Interference, bytes being dropped by the UART (when its internal buffer is too full, or errors are detected), and starting to listen in the middle of a transmission are all possible scenarios. These would lead to undefined behavior if there is no way to detect when a packet starts.
In The Wild
Lets see what’s being used in the wild.
- Ethernet has a long preamble (7 bytes of repeating data) with a ending frame check sequence (FCS) which is used for error detection.
- ccTALK (used for money taking devices) assumes you will always receive full packets in order.
- OBD, for car diagnostics, uses a similar scheme to midi to synchronize.
- CAN, for parts of a car to talk to each other, uses a large preamble of 0s and injects opposite polarity bits into the stream (“bit stuffing”).
- Modbus, a common protocol used in industrial controls, has a couple of different schemes, including a CRC check and restricting the data in a message to not include delimiters (such as ‘\r\n’ and ‘:’)
- USB is a spec I’m still learning about. The packet has a sync header (1-4 bytes) that also includes the message type (“PID”) and a footer of some sort. There is also some sort of bit stuffing but I haven’t yet found a source with details.
- SLIP is an early IP over serial protocol. It frames data by sending an END byte after the message. From wikipedia: “if the END byte occurs in the data to be sent, the two byte sequence ESC, ESC_END is sent instead, if the ESC byte occurs in the data, the two byte sequence ESC, ESC_ESC is sent.” Some variants also start messages with an END byte.
- KISS (TNC) is used in ham radio. It’s very similar to SLIP but uses different bytes to mark a message’s end and includes a command byte as the first byte of the message to control some aspects of the sending and receiving.
- ROS Serial, the robot operating system, uses a short preamble, checksums, and length bytes in its message format. It would be possible to keep dropping packets with invalid checksums until you find a real start of a packet.
(Have any other protocols I should look at? Please let me know!)
I think my favorite is
SLIP. It’s really simple and does everything we need including synchronizing. It does lack a checksum but we haven’t had one anyway. Most importantly, it has very low overhead and none of Firmata’s messages obviously conflict with the
ESCAPE bytes. Firmata’s “report analog pin” message is
0xC0 which is the same as
END, but this request for data sets a flag and isn’t sent too often. It should be low overhead and save us bytes on the wire when compared to MIDI.
- If this is a new firmata version, how do we know which protocol version a device is talking? Can we silently upgrade if both ends of a connection support the newer standard?
- Would this be better released as an alternate firmata instead of a new version of it?
- We’d have to rework some of the encoding of messages to take advantage of being able to use the full byte.
In any case, real world testing and a prototype implementation should be made to know for sure, but I think we can have a slightly faster more efficient Firmata by adopting