Text formatting in Flarum

JoshyPHP · Jul 5, 2015

Hello Flarum people!

I'm the author of s9e\TextFormatter, a text formatting library that supports a bunch of different markup as plugins. It is uniquely geared towards forum software in that it's designed to handle input from untrusted sources, while being customisable and performant. Two years ago I contacted @Franz about FluxBB 2 and I'd love to extend the same offer to @Toby and Flarum.

Flarum's default markup is Markdown. The library supports Markdown-like markup (here's a demo) but I would consider looking into using league/commonmark's parser as a plugin if there was a need for it.

The main differences between s9e\TextFormatter and other libraries are:

Most (every?) other libraries are designed transform plain text to HTML. That means you either completely reparse every post on every page and take the performance hit, or you have to cache the HTML and pay for it with increased storage requirements while making it harder to customise the output without nuking the cache.

s9e\TextFormatter separates parsing from rendering, letting it do 90% of the work at posting time rather than when the text is displayed. Here's how it works.
Every part of it is meant to handle malicious content. Checks and limits are baked into every component. For example:
- The BBCodes plugin can be used to create custom BBCodes but it won't allow a custom BBCodes to use raw user input inside of an onclick attribute.
- The HTML plugin can be used to allow a whitelisted subset of HTML to be used but again, you cannot accidentally enable harmful markup such as ``.
- URLs used in links and/or images can support a whitelist or a blacklist of hosts. Allowed URL schemes are configurable and won't allow javascript: pseudo-URLs.
- There are default limits to the amount of markup that can be used. For instance, a user cannot post a million of emotes or nest a hundred blockquotes inside of each other unless you choose to set those limits that high.
It supports custom markup that's definable by the user. I'm not just talking about PHP extensions. For users who want BBCodes, custom BBCodes can be created using the same syntax as phpBB. (phpBB 3.2 uses s9e\TextFormatter by default) For others, safe PCRE-style replacements can be defined.
It's got a JavaScript port. I don't know how relevant this is for you but since Flarum uses JavaScript heavily it may be of interest. The parsing and rendering part (not the configuration) can be run in a browser. Here's a couple of demos: BBCodes / Markdown.

TL;DR

I'd love it if you chose s9e\TextFormatter for your text formatting needs. phpBB 3.2 uses it. Here's how it works. Here's a simple example of how to use it. Hey, there's a JavaScript demo too! Hit me up.

ameo · Jul 6, 2015

hi @JoshyPHP

I'd like to ask, unrelated to this, would you make plugin FlarumMediaBBCodes, as you did with Xenforo?
That would be hella awesome!

Franz · Jul 6, 2015

JoshyPHP Hey Joshy, nice to see you here, too! =)

Toby and I already briefly talked about considering your library, but didn't yet get very far in making any decisions.

First of all, I'd love to see TextFormatter support the CommonMark spec.
Also, can you expand a bit on where your library stands performance-wise?

JoshyPHP · Jul 6, 2015

I'd like to ask, unrelated to this, would you make plugin FlarumMediaBBCodes, as you did with Xenforo?
That would be hella awesome!

For those who don't know, this is the extension ameo is talking about. That will depend on how Flarum handles markup and extensions. Working with XenForo was easy because they have a "media site" feature that already supports half of what one needs to embed third party content and the rest can be done via callbacks. Porting the extension to phpBB was possible because they store posts already parsed.

If Flarum stores posts as HTML, anytime a third-party content provider changes their embed code the stored HTML would have to be invalidated/regenerated, which would be complicated and/or resource intensive. Alternatively, if most of the markup handling is done when the post is displayed, you run into another kind of performance problem. For some media site you need to contact the original site to get some values/token required for the embed code and doing so whenever a post is displayed would delay the page too much. Or finally, it would be possible to create a "Media" button like XenForo's editor but that would be relatively complicated too.

Short version: it depends on whether it can be done in a simple and performant manner.

First of all, I'd love to see TextFormatter support the CommonMark spec.

I could look into plugging a third-party parser like league/commonmark but you have to ask yourself: how much of the CommonMark specs do you want to use? CommonMark, like Markdown, allows arbitrary HTML to be used and I assume that's not something you want.

Also, can you expand a bit on where your library stands performance-wise?

Parsing performance is good, rendering performance is excellent.

Using my first post above as an example of a moderately-sized text that contains a bunch of markup, parsing it with that bundle on PHP 5.6 with Opcache takes 7.3ms on first call, ~6.2ms on subsequent calls. Rendering the same post takes ~650µs on first call, ~550µs on subsequent calls. On PHP 7, parsing takes 2.9-2.7ms, rendering 240-180µs.

Using your post as an example of a shorter text gives me:

PHP 5.6 -- parsing: 1ms-450µs, rendering 130µs-50µs
PHP 7.0 -- parsing: 660µs-200µs, rendering 60µs-20µs

The script is there if you want to reproduce locally: https://gist.github.com/anonymous/1fc1f4519eeed1efbdb7
You'll either need a copy of the repository or to install the library via Composer and use its autoloader.

Toby · Jul 6, 2015

Hi JoshyPHP,

First of all – fantastic work on the library! It's a clever approach and seems to be very well executed.

I must say, the idea of a live JavaScript preview excites me. A few concerns/questions though:

Regarding performance: It's really good to see that rendering performance is fast. However, my thought it still that any extra work that has to be done on every page request needs to be very carefully considered. I altered the benchmark script you provided to render 20 pre-parsed posts (the amount we load per page in Flarum), each the size of JoshyPHP's, and was getting results between 0.01-0.02 seconds (PHP 5.5, MacBook Air). That's negligible, but still, these things can add up. What do you think @Franz?

Regarding strategy: Flarum's current approach is to store raw, unformatted posts in the database, and then render + cache them when they need to be displayed. This way they only have to be rendered once, until the cache is cleared (if a formatting extension is enabled/disabled, for example), at which point they are rendered again, only once, on demand. I think the only downside to this approach is that it uses more storage space – but storage is dirt cheap these days, so that honestly doesn't concern me.

If we forget about storage space, what really is the benefit of TextFormatter's strategy over this one? If a formatting extension is enabled/disabled (say, BBCode is turned on/off), wouldn't that affect the structure of the XML – and thus, wouldn't you have to unparse and then reparse every post in order to update output?

One other minor concern is that it tightly couples all post content to a library. Storing posts as XML kind of obfuscates their content in the database, so if you want to go in and manually edit things, it's a lot harder. If people want to export their data, or migrate to another forum, it makes it harder.

Anyway, I really appreciate you reaching out to us. Definitely keen to consider it!

JoshyPHP · Jul 7, 2015

Glad you like it.

If we forget about storage space, what really is the benefit of TextFormatter's strategy over this one?

For one, you don't have to invalidate your whole cache when a template is changed. Some general examples, not all may apply to Flarum specifically:

Adding a class name to links
Changing the markup for an emoticon, or the path to their images
Whenever Twitch will decide to ditch their embeddable Flash player for a `` tag

It's also more dynamic. You can more easily localise some things like dates, or use variables in templates. For example, in the XenForo add-on I was talking about people can post a link to an Amazon product and it's displayed as an Amazon widget. If they become Amazon Associates, they can enter their affiliate ID and the widgets will earn them revenue. The ID is stored in a template variable so when they add an affiliate ID, it applies to old posts as well as future ones.

In phpBB, template variables are used for localised strings, and to apply some user preferences such as toggling images, emoticons or Flash applets. They can also be used to display things differently whether the current user is logged in ("Hide from guests" extensions are a common user request in every forum) whether it's a bot or depending on the user's style preference.

As for storage, the extra space is only one aspect of the storage requirements. DB storage is relatively cheap these days so it's rarely an issue. I'm more interested in the extra column(s) or table(s) an HTML cache would need, especially if you ever want to cache content from more than posts. Perhaps you want to allow users to leave each others messages on their profile or have some sort of Twitter-like micro blogging and if it's not built using the same structure as posts then you may end up with multiple HTML caches.

If a formatting extension is enabled/disabled (say, BBCode is turned on/off), wouldn't that affect the structure of the XML – and thus, wouldn't you have to unparse and then reparse every post in order to update output?

If you want to retroactively remove BBCodes from old posts while enabling them in future posts then yes, you'd have to reparse everything. Otherwise, you could just replace every template used for BBCodes with something that contains only text.

One other minor concern is that it tightly couples all post content to a library. Storing posts as XML kind of obfuscates their content in the database, so if you want to go in and manually edit things, it's a lot harder. If people want to export their data, or migrate to another forum, it makes it harder.

The very first decision I took about the library, before I wrote the first line of PHP, was about the storage format. It's XML and if you load it in any XML parser/DOMDocument and retrieve the textContent of its root node, you end up with the original message. The original message can be retrieved in any language with little effort and no knowledge of the format.

```xml
Hello world
```

Dominion · Jul 7, 2015

I'd like to add a quick comment, from a prospective Flarum admin's point of view.

Although I can't say anything specifically for or against the use of JoshyPHP's library, I am interested in the idea that Flarum could easily support multiple markup methods "out of the box" as it were. My reasons for this are partly selfish: since I've just spent a bit of time writing some BBcode documentation for my Japanese users, I'd just as soon not have to tell them, just a few months down the road, that we'll now be using a new method with a completely different syntax.

That isn't too big an issue, since you guys have said that the Flarum editor will have GUI buttons, which will help ease the learning curve. But I also have the a feeling that my users may find BBcode easier to learn and deal with than Markdown, simply because it uses clearly defined tags. One of these days I need to find the time to do a few tests combining Markdown with Japanese, just to see if there's anything to that feeling.

In the meantime, I'm glad you're willing to consider the possibility of supporting (via JoshyPHP's library or some other technology) markup systems other than Markdown.

Toby · Jul 7, 2015

JoshyPHP Thanks for that, all good points. The benefit of dynamics you described does indeed sound powerful.

Originally our plan with Flarum is to have a "progressive enhancement" kind of approach to formatting. Flarum's core would only contain a formatter that wraps paragraphs in `` tags and linkifies URLs. A Markdown extension would add markdown support, and a BBCode extension would add BBCode support. Additional extensions like Textile, custom BBCode tags, word filters, etc. could be made. Forum admins could pick and choose which they wanted to support on their forum.

So to reiterate what you said about retroactively removing BBCodes/enabling them in future, it seems we have two options if we want to keep this extension-based approach:

Parse all of the formatting styles in core, but only render them if the appropriate extension is enabled. Still, this would limit the possibilities of what extensions can do, without having to...
Reparse everything whenever formatting config is changed. Either do it in one hit, or mark each post as dirty and then reparse on demand. (The latter being roughly equivalent to what we do now with our HTML caching.)

How does something like the censor plugin work, where the words that are being formatted could be changed quite frequently? Would that require a global reparse?

Another question I forgot to ask before: Does this library have any dependencies? We want to make sure Flarum is installable on low-end shared hosts.

JoshyPHP · Jul 7, 2015

As far as multiple formatting styles/markup languages go, I rather like that once it's been posted, the structure of a message isn't effected by changes in the supported markup languages. For example, Textile and Markdown have an incompatible syntax for emphasis/italic vs strong emphasis/bold, or for headings vs enumerated lists. If the admin switches off Markdown for Textile then old posts could look completely different after reparsing.

Realistically, I don't think admins change their markup language very often though.

Parse all of the formatting styles in core, but only render them if the appropriate extension is enabled.

I don't understand how that would work. For example, in Textile ## is an enumerated list but in Markdown it's a heading. It's the user's choice to decide which one it should be. What you could have is an option to choose markup languages by default and/or at posting time.

Still, this would limit the possibilities of what extensions can do, without having to...

It does but it's a blessing and a curse. For example, let's say someone installs an extension that turns YouTube links into videos. Old posts will still display links, not videos. It sucks for short posts that say "Hey! look at this video: http://youtube..." but in other cases it can be preferable. For example, someone may have posted the list of the 100 dankest YouTube memes as a list of 100 URLs and if they turn into embedded videos overnight the thread may become unusable because of the 100 iframes it would contain.

How does something like the censor plugin work, where the words that are being formatted could be changed quite frequently? Would that require a global reparse?

The Censor plugin works two ways, as a normal plugin or as a standalone and you can use either or both. The normal plugin applies at parsing time; changing which words are censored does not effect old posts. The standalone is an helper class that's completely separate object that can be used at rendering time.

I don't have a strong opinion on what's the best approach there. Censoring at rendering time incurs a small performance penalty, proportionate to the complexity of the list of censored words. Censoring at parsing time means it costs no resources at rendering time and it can still be toggled (via a template parameter/conditional) to display the text censored or uncensored. On the other hand, if you add a word to the list, that's one case where you'll want it to be retroactive and apply to old posts. On that note, the helper class has a method to quickly update what's censored in an old post without touching any of the markup.

Another question I forgot to ask before: Does this library have any dependencies? We want to make sure Flarum is installable on low-end shared hosts.

No dependencies.

Toby · Jul 7, 2015

JoshyPHP Thanks again for the detailed reply. Again, all good points.

No dependencies.

Awesome.

Well, I think I'm sold. You really do seem to have thought of everything with this library, kudos. We'll wait and see what @Franz thinks... but all of my concerns have been addressed

Oh, one more: Are you going to do SemVer? We need something we can rely on to not break BC.

It sucks for short posts that say "Hey! look at this video: http://youtube..." but in other cases it can be preferable.

Ha, looks like you found a bug in our current formatter. All the more reason to switch to yours

Franz · Jul 7, 2015

I think I'm pretty sold, too.

JoshyPHP What you could have is an option to choose markup languages by default and/or at posting time.

Can you expand on that?
I always thought your intermediate XML syntax is so generic that it can be converted into more or less any of the possible markup languages (by unparsing). Is that true?
Because if so, I'd think it would be really cool to have a setting for users where they can select their preferred markup langugage, and even change them on a post-by-post basis.

JoshyPHP Yes, we don't want raw HTML, that is true. Apart from that, though, CommonMark has a nice specification that defines behaviour for many edge cases. I think it's the future for Markdown implementations.

Toby · Jul 7, 2015

Franz I'd think it would be really cool to have a setting for users where they can select their preferred markup langugage, and even change them on a post-by-post basis.

The intermediate XML syntax contains the original syntax that the user used to compose the post, so I don't think this would be necessary. Users can just use their preferred markup language without a setting

Franz Apart from that, though, CommonMark has a nice specification that defines behaviour for many edge cases. I think it's the future for Markdown implementations.

I agree, although this is by no means a deal-breaker. Especially given that the JS implementation of CommonMark is huge compared to @JoshyPHP's implementation of Litedown. Still, it'd be an extra win if the CM spec could be implemented concisely.

JoshyPHP · Jul 7, 2015

Oh, one more: Are you going to do SemVer? We need something we can rely on to not break BC.

I'm still working on my release process. Up till now I didn't need a version number so I never stamped one on the sources. SemVer could be the way to go.

I always thought your intermediate XML syntax is so generic that it can be converted into more or less any of the possible markup languages (by unparsing). Is that true?

Unparsing always returns the original message. The XML could be used to transform the original message from one markup language to another, but that's not a current feature. It's been on my TODO list for years but I haven't prioritized it for lack of application.

A user could select whichever markup language but switching it would not change [b]bold[/b] into **bold** or *bold* in the original text.

Franz · Jul 7, 2015

Okay, thanks for clarifying.
Will we need to store the used markup format as meta information with the post (In order to unparse it to the correct input format e.g. when editing)?

Toby · Jul 7, 2015

Franz If you type:

```
Hello world
```

The XML generated and stored is:

```
Hello world
```

The content inside the and is removed when the post is rendered. If the user wants to edit their post, the XML is unparsed by removing all the tags, so their original choice of syntax remains intact.

Franz · Jul 7, 2015

Ah, cool.

Toby · Jul 22, 2015

Started implementing this today. Including live previews!

@JoshyPHP Is there anywhere I can look to get more of an idea about how to write a custom plugin? I need to get @mentions and @replies#123 to format properly. Here are the requirements:

Parser: find @mentions and @replies#123, look up the username/post numbers in the database, and convert them into tags which store the ID of the user or post that was mentioned. Also keep a record of which users/post IDs have been mentioned so we can store that information in the database, send notifications, etc.
Renderer: replace the parsed mention/reply tags with {username} tags, where {href} is a link to the user/post based on the ID that was stored, and {username} is the username of the user/post author retrieved from the database (not what the user originally typed). Obviously we would want to avoid an n+1 query situation, hence why we stored a record of which users/posts a post mentions in the database.

How would I go about doing this? (Let me know if any of that doesn't make sense!)

JoshyPHP · Jul 22, 2015

I don't remember writing any doc about custom plugins. One of the reason is that every time I started thinking about an example of a custom plugin, the use case was better handled using an existing plugin. I wrote a page about custom parsers if that's any help. A plugin is a parser + a configurator.

I don't know how @replies#123 is supposed to work, can you break it down for me please? I assume that 123 is the ID of the post, but what's the replies part?

About @mentions, the simplest way would be to use the Preg plugin to match them. Then you have to retrieve the user ID. The simplest way is to use a tag filter to add the user ID to each tag during parsing. That means it'll perform one query per username mentioned. If you're concerned about a malicious user mentioning thousands of usernames, you can set a limit to the number of mentions in a post. Here's a working example that matches /@(?\w+)/ and creates a tag named MENTION for each mention.

If you really want to run only one query to retrieve all user IDs, you could preload user IDs before parsing, or implement your own parser that would preload all IDs before creating any tags. The downside is that it would query IDs for things that look like mentions but are not. For example, things in inline code such as @mention.

Alternatively, you could not add any ID during parsing and inject them after parsing in the XML. You wouldn't query IDs for inline code and such. On the other hand, it's more complicated.

Toby · Jul 22, 2015

That's a huge help, thanks!

JoshyPHP I don't know how @replies#123 is supposed to work, can you break it down for me please? I assume that 123 is the ID of the post, but what's the replies part?

The "replies" part is the username. So for example, that reply in the quote just above is @JoshyPHP#18. Going by the logic in my previous post, the username part is actually useless in parsing/rendering, because the post number would be used to get the post/user/username from the database. But I guess it's good to have there for reference when you're writing a post. Anyway, I think I can adapt the @mentions example you gave me to work for this too

Regarding that example, is there a way to have it so that only the user ID is stored in the XML, and then when rendering we sub in the username? This is so that if a user ever changes their username, mentions of their name would stay up-to-date. Or perhaps if we ever had something like a "full name" extension, it would sub in the user's full name instead of their username – but only for logged in users (so guests wouldn't see full names). Could something like that be done?

Franz · Jul 22, 2015

I think we definitely need to make it possible to do only one query, even if that means writing our own plugin.