• Dev
  • Text formatting in Flarum

JoshyPHP Thanks for that, all good points. The benefit of dynamics you described does indeed sound powerful.

Originally our plan with Flarum is to have a "progressive enhancement" kind of approach to formatting. Flarum's core would only contain a formatter that wraps paragraphs in `` tags and linkifies URLs. A Markdown extension would add markdown support, and a BBCode extension would add BBCode support. Additional extensions like Textile, custom BBCode tags, word filters, etc. could be made. Forum admins could pick and choose which they wanted to support on their forum.

So to reiterate what you said about retroactively removing BBCodes/enabling them in future, it seems we have two options if we want to keep this extension-based approach:

  • Parse all of the formatting styles in core, but only render them if the appropriate extension is enabled. Still, this would limit the possibilities of what extensions can do, without having to...
  • Reparse everything whenever formatting config is changed. Either do it in one hit, or mark each post as dirty and then reparse on demand. (The latter being roughly equivalent to what we do now with our HTML caching.)

How does something like the censor plugin work, where the words that are being formatted could be changed quite frequently? Would that require a global reparse?

Another question I forgot to ask before: Does this library have any dependencies? We want to make sure Flarum is installable on low-end shared hosts.

As far as multiple formatting styles/markup languages go, I rather like that once it's been posted, the structure of a message isn't effected by changes in the supported markup languages. For example, Textile and Markdown have an incompatible syntax for emphasis/italic vs strong emphasis/bold, or for headings vs enumerated lists. If the admin switches off Markdown for Textile then old posts could look completely different after reparsing.

Realistically, I don't think admins change their markup language very often though.

Parse all of the formatting styles in core, but only render them if the appropriate extension is enabled.

I don't understand how that would work. For example, in Textile ## is an enumerated list but in Markdown it's a heading. It's the user's choice to decide which one it should be. What you could have is an option to choose markup languages by default and/or at posting time.

Still, this would limit the possibilities of what extensions can do, without having to...

It does but it's a blessing and a curse. For example, let's say someone installs an extension that turns YouTube links into videos. Old posts will still display links, not videos. It sucks for short posts that say "Hey! look at this video: http://youtube..." but in other cases it can be preferable. For example, someone may have posted the list of the 100 dankest YouTube memes as a list of 100 URLs and if they turn into embedded videos overnight the thread may become unusable because of the 100 iframes it would contain.

How does something like the censor plugin work, where the words that are being formatted could be changed quite frequently? Would that require a global reparse?

The Censor plugin works two ways, as a normal plugin or as a standalone and you can use either or both. The normal plugin applies at parsing time; changing which words are censored does not effect old posts. The standalone is an helper class that's completely separate object that can be used at rendering time.

I don't have a strong opinion on what's the best approach there. Censoring at rendering time incurs a small performance penalty, proportionate to the complexity of the list of censored words. Censoring at parsing time means it costs no resources at rendering time and it can still be toggled (via a template parameter/conditional) to display the text censored or uncensored. On the other hand, if you add a word to the list, that's one case where you'll want it to be retroactive and apply to old posts. On that note, the helper class has a method to quickly update what's censored in an old post without touching any of the markup.

Another question I forgot to ask before: Does this library have any dependencies? We want to make sure Flarum is installable on low-end shared hosts.

No dependencies.

    JoshyPHP Thanks again for the detailed reply. Again, all good points.

    No dependencies.

    Awesome.

    Well, I think I'm sold. You really do seem to have thought of everything with this library, kudos. We'll wait and see what @Franz thinks... but all of my concerns have been addressed 🙂

    Oh, one more: Are you going to do SemVer? We need something we can rely on to not break BC.

    It sucks for short posts that say "Hey! look at this video: http://youtube..." but in other cases it can be preferable.

    Ha, looks like you found a bug in our current formatter. All the more reason to switch to yours 😉

    I think I'm pretty sold, too.

    JoshyPHP What you could have is an option to choose markup languages by default and/or at posting time.

    Can you expand on that?
    I always thought your intermediate XML syntax is so generic that it can be converted into more or less any of the possible markup languages (by unparsing). Is that true?
    Because if so, I'd think it would be really cool to have a setting for users where they can select their preferred markup langugage, and even change them on a post-by-post basis.

    JoshyPHP Yes, we don't want raw HTML, that is true. Apart from that, though, CommonMark has a nice specification that defines behaviour for many edge cases. I think it's the future for Markdown implementations.

    • Toby replied to this.

      Franz I'd think it would be really cool to have a setting for users where they can select their preferred markup langugage, and even change them on a post-by-post basis.

      The intermediate XML syntax contains the original syntax that the user used to compose the post, so I don't think this would be necessary. Users can just use their preferred markup language without a setting 🙂

      Franz Apart from that, though, CommonMark has a nice specification that defines behaviour for many edge cases. I think it's the future for Markdown implementations.

      I agree, although this is by no means a deal-breaker. Especially given that the JS implementation of CommonMark is huge compared to @JoshyPHP's implementation of Litedown. Still, it'd be an extra win if the CM spec could be implemented concisely.

      Oh, one more: Are you going to do SemVer? We need something we can rely on to not break BC.

      I'm still working on my release process. Up till now I didn't need a version number so I never stamped one on the sources. SemVer could be the way to go.

      I always thought your intermediate XML syntax is so generic that it can be converted into more or less any of the possible markup languages (by unparsing). Is that true?

      Unparsing always returns the original message. The XML could be used to transform the original message from one markup language to another, but that's not a current feature. It's been on my TODO list for years but I haven't prioritized it for lack of application.

      A user could select whichever markup language but switching it would not change [b]bold[/b] into **bold** or *bold* in the original text.

      Okay, thanks for clarifying.
      Will we need to store the used markup format as meta information with the post (In order to unparse it to the correct input format e.g. when editing)?

      • Toby replied to this.

        Franz If you type:

        ```
        Hello world
        ```

        The XML generated and stored is:

        ```
        Hello world
        ```

        The content inside the and is removed when the post is rendered. If the user wants to edit their post, the XML is unparsed by removing all the tags, so their original choice of syntax remains intact.

        15 days later

        Started implementing this today. Including live previews! 😃

        @JoshyPHP Is there anywhere I can look to get more of an idea about how to write a custom plugin? I need to get @mentions and @replies#123 to format properly. Here are the requirements:

        • Parser: find @mentions and @replies#123, look up the username/post numbers in the database, and convert them into tags which store the ID of the user or post that was mentioned. Also keep a record of which users/post IDs have been mentioned so we can store that information in the database, send notifications, etc.

        • Renderer: replace the parsed mention/reply tags with {username} tags, where {href} is a link to the user/post based on the ID that was stored, and {username} is the username of the user/post author retrieved from the database (not what the user originally typed). Obviously we would want to avoid an n+1 query situation, hence why we stored a record of which users/posts a post mentions in the database.

        How would I go about doing this? (Let me know if any of that doesn't make sense!)

        I don't remember writing any doc about custom plugins. One of the reason is that every time I started thinking about an example of a custom plugin, the use case was better handled using an existing plugin. I wrote a page about custom parsers if that's any help. A plugin is a parser + a configurator.

        I don't know how @replies#123 is supposed to work, can you break it down for me please? I assume that 123 is the ID of the post, but what's the replies part?

        About @mentions, the simplest way would be to use the Preg plugin to match them. Then you have to retrieve the user ID. The simplest way is to use a tag filter to add the user ID to each tag during parsing. That means it'll perform one query per username mentioned. If you're concerned about a malicious user mentioning thousands of usernames, you can set a limit to the number of mentions in a post. Here's a working example that matches /@(?\w+)/ and creates a tag named MENTION for each mention.

        If you really want to run only one query to retrieve all user IDs, you could preload user IDs before parsing, or implement your own parser that would preload all IDs before creating any tags. The downside is that it would query IDs for things that look like mentions but are not. For example, things in inline code such as @mention.

        Alternatively, you could not add any ID during parsing and inject them after parsing in the XML. You wouldn't query IDs for inline code and such. On the other hand, it's more complicated.

        • Toby replied to this.

          That's a huge help, thanks!

          JoshyPHP I don't know how @replies#123 is supposed to work, can you break it down for me please? I assume that 123 is the ID of the post, but what's the replies part?

          The "replies" part is the username. So for example, that reply in the quote just above is @JoshyPHP#18. Going by the logic in my previous post, the username part is actually useless in parsing/rendering, because the post number would be used to get the post/user/username from the database. But I guess it's good to have there for reference when you're writing a post. Anyway, I think I can adapt the @mentions example you gave me to work for this too 🙂

          Regarding that example, is there a way to have it so that only the user ID is stored in the XML, and then when rendering we sub in the username? This is so that if a user ever changes their username, mentions of their name would stay up-to-date. Or perhaps if we ever had something like a "full name" extension, it would sub in the user's full name instead of their username – but only for logged in users (so guests wouldn't see full names). Could something like that be done?

          I think we definitely need to make it possible to do only one query, even if that means writing our own plugin.

            is there a way to have it so that only the user ID is stored in the XML, and then when rendering we sub in the username

            You could inject the username in the XML before rendering. There's a method for that, Utils::replaceAttributes() (API). You'd need to preload usernames if you want to keep the number of queries flat. There's currently no method for that but it's not complicated. I just need to think of the right API. Maybe something like Utils::getAllAttributeValues($xml, $tagName, $attrName) except with a name that doesn't suck. It would return an array of every value used in given attribute in given tag. It could also be used after parsing to get a list of all the mentionned users.

            https://gist.github.com/s9e/0e553fb7f2c5c0ef916d

            • Toby replied to this.

              JoshyPHP Perfect. Thanks so much. That API sounds useful – let me know if it makes it in!

              It's really just a question of when. I'll add it as soon as I settle on a method name. (I'm open to suggestions btw 😃) I want to provide enough support for reading and post-processing the XML that users of the library never have to do it themselves.

              What about something like scanAttributeValues or scanAttributes?

              I settled on getAttributeValues().

              You can use it like this: Utils::getAttributeValues($xml, 'MENTION', 'id') (API)

              Franz You won't have to implement a plugin. Using my previous example as a base, you could capture what looks like mentions with a preg_match_all() before parsing, prefetch user IDs and fill the mentionator cache in advance.

              @Toby Forgot to add: if you spec out how mentions should work, I don't mind writing the initial implementation.

                JoshyPHP Thanks! I'm gonna give it a shot based on your examples, but if I'm struggling I'll take you up on that offer. 🙂

                Would it be possible to get fenced code blocks in the Litedown plugin? We use them occasionally on this forum, and I don't want those posts to break when we upgrade!