Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3245: Voice messages (using extensible events) #3245

Open
wants to merge 13 commits into
base: old_master
Choose a base branch
from
139 changes: 139 additions & 0 deletions proposals/3245-voice-messages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# MSC3245: Voice messages (using extensible events)
turt2live marked this conversation as resolved.
Show resolved Hide resolved

Voice messages are a useful way to quickly send a message to someone without having to use the more
awkward keyboard. Typically short in length, voice messages can be sent as annotated audio files
to recipients.
Comment on lines +3 to +5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MSC doesn't motivate why voice messages should exist as a separate concept in addition to the more generic audio messages. I had to go over to MSC2516 to find some (potential) reasons. Could that be added?


More information about voice messages and what they can be used for can be found on
[MSC2516 - Voice messages via msgtype](https://github.com/matrix-org/matrix-doc/pull/2516). This
MSC inherits a lot of the beliefs and usecases of that MSC, but instead packages the event contents
a bit differently. Specifically, this makes use of [MSC1767 - Extensible Events](https://github.com/matrix-org/matrix-doc/pull/1767).

This MSC additionally relies upon [MSC3927](https://github.com/matrix-org/matrix-doc/pull/3927) and
[MSC3551](https://github.com/matrix-org/matrix-doc/pull/3551).

## Proposal

Much like MSC2516, voice messages are defined as OGG files, encoded with Opus, using relatively sane
settings for voice recordings. This proposal does not define specific settings for clients to use,
but does strongly recommend reducing file size without losing audio quality as much as possible. Some
suggested default settings are:

* Sample rate: 48kHz
* Bitrate: 24kbps
* Mono (single channel)
* Appropriate complexity and resample quality for the platform.
* Encoder application: 2048 (voice, default is typically 2049 as audio). This doesn't have any signficant
impact on the resulting recording.

We use Opus to be compatible with other messaging platforms, particularly the ones that can be bridged
easily to Matrix. This proposal aims to avoid having bridges (and to a degree, clients) transcode
voice messages as that would likely push voice messages further away from the "faster communication"
use case. Bridges are already needing to do processing on the events and can see seconds worth of latency:
an extra couple seconds to re-encode a voice message would not be helping that.

No maximum duration is specified, however clients are encouraged not to send long-running recordings
as they might be rejected/ignored on the receiving end for file size reasons. Typically, this should
be less than 5 minutes worth of audio.

Using [MSC1767](https://github.com/matrix-org/matrix-doc/pull/1767)'s system, a new `m.voice` event
type is introduced.

An example is:

```json5
{
"type": "m.voice",
"content": {
"m.text": [
// Format of the fallback is not defined, but should have enough information for a text-only
// client to do something with the voice message, just like with plain file uploads.
//
// Another option might be to include speech-to-text conversion here, so text-only clients can
// "see" the contents without having to download them.
{"body": "Voice Message (8 KB, 1:30) https://example.org/_matrix/media/v3/download/example.org/abcd1234"}
],
"m.file": {
"mimetype": "audio/ogg",
"url": "mxc://example.org/abcdef",
"name": "Voice message.ogg",
"size": 7992
},
"m.audio_details": {
"duration": 90,
"waveform": [0, 256, /*...*/ 1024] // https://github.com/matrix-org/matrix-doc/pull/3246
}
}
}
```

No new content blocks are introduced in this MSC.

Together with content blocks from other proposals, an `m.voice` is described as:

* **Required** - An `m.text` block to act as a fallback for clients which can't process voice messages.
* **Required** - An `m.file` block to contain the audio itself. Clients use this to represent the voice
message. Per above, it MUST be in the file format described by this MSC.
* **Required** - An `m.audio_details` block to describe any audio-specific metadata, such as duration.
* Under this MSC, the `waveform` is required in this usage. `duration` is already required.

The above describes the minimum requirements for sending an `m.voice` event. Senders can add additional
blocks, however as per the extensible events system, receivers which understand voice messages should not
honour them.

Note that `m.file` supports encryption and therefore it's possible to encrypt audio too.

If a client does not support rendering voice messages inline, the client would instead typically represent
the event as a regular audio file, then plain file upload, and finally plain text message.

## Potential issues

The schema duplicates some of the information into the text fallback, though this is unavoidable
and intentional for fallback considerations.

## Alternatives

As mentioned, [MSC2516](https://github.com/matrix-org/matrix-doc/pull/2516) exists to make voice messages a
dedicated `msgtype`, which would work in the near term. There's additionally discussion on that MSC on whether
or not a flag within the event would be more appropriate than a `msgtype`. This proposal believes that both a
`msgtype` and flag within the `content` would accomplish the same thing as this MSC, however that the fallback
scenarios are less desirable. Flags in the `content` are subject to extensive proliferation if accepted, making
turt2live marked this conversation as resolved.
Show resolved Hide resolved
it harder to migrate to event formats like Extensible Events. Dedicated message types have the issue of not all
clients handling unknown message types the same way. Some clients would render it as a plain file upload while
many others would simply ignore the message entirely. This is not great representation for an audio message.
This proposal counters both problems by using Extensible Events out of the box, which results in clients being
able to render whatever they can, which will typically be either the voice message or an audio event, and fall
back accordingly to the not-great representations if they need to.

turt2live marked this conversation as resolved.
Show resolved Hide resolved
## Security considerations

Voice messages by nature are human voices being sent over the internet. This can be used for malicious purposes
outside the control of Matrix: users are cautioned to not send voice messages to untrusted places, such as large
public rooms or unknown individuals.

As with all media events, clients should be wary that the contained file is actually an audio file. Playing JPEGs
or executables over the user's speakers are unlikely to go very well.

Voice messages are likely best used in encrypted rooms due to the high likelihood that the members of the room are
trusted, and the user's voice is not uploaded plainly to the media repo. Typically, this will be DMs or other
forms of private chats in most clients.

## Unstable prefix
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonnyandrew says:

@turt2live we're planning to implement this MSC in Element X but we're conscious that an older version (v1) has already been implemented in the Element Classic apps. Should we prefer to implement v2 in Element X in order to help progress the MSC? Or will it be okay to stick with v1 so that EC and EX can talk to each other? cc @nimau

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall which version is already deployed out in the wild, but would recommend using the one that other clients use. If it turns out that no one implements the v2 stuff (like in the case with Polls), we can fix this MSC.


While this MSC is not considerede stable, implementations should use `org.matrix.msc3245.voice.v2` in place
of the `m.voice` event type, additionally using any applicable prefixes for content blocks and similar.

As this is a new event type and clients would not be massively impacted by seeing the event, clients are
specifically permitted to send this event type into rooms which *don't* support extensible events: clients
which understand voice messages should be parsing the event as such, and clients which understand extensible
events but not voice messages should *not* attempt to represent the event (unless it's in an applicable room
version).

**Note**: We use a "v2" event here because a prior draft of this MSC was implemented in the wild. The MSC's
version history represents that possible schema, which used `m.room.message` and an older version of extensible
events instead. This version of the proposal does not describe that schema.

## Dependent MSCs

This MSC requires [MSC1767 - Extensible Events](https://github.com/matrix-org/matrix-doc/pull/1767) in order
to make the most sense in the specification.