Rich Embeds

clarkwinkelmann · Aug 3, 2022

Oops, definitely a bit of an oversight on my part, I should have seen that coming!

I'm using the same code to loop through images on both OpenGraph and the fallback crawler, so generally OpenGraph has a limited number of images, but fallback could find way more of them.

I'm not sure what the reasonable image count is, so I think I'll add this as configurable value. And probably apply the same value to both OpenGraph and fallback, in case you want to limit all of them to 1.

CCyberGene · Aug 12, 2022

If I just paste an image link without any formatting, is it this extension that will render it? I guess not because it doesn’t seem like the image gets the dimension overlays. But it’s still rendered and even clickable and opens the image in a new tab… What other extension is taking precedence, if that’s the case? FoF Formatting? Is there any way of reordering what extension renders what type of raw links?

clarkwinkelmann · Aug 12, 2022

CyberGene that's AutoImage from FoF Formatting I would guess.

I wouldn't expect any conflict between the 2 extensions, because AutoLink only modifies the post parser and my extension only impacts the rendering templates.

Can you confirm there's still an issue when both my extension and AutoImage are used? I'll take a look if so

Imgur and other hosting websites have a full MediaEmbed template, in which case my extension won't do anything

CCyberGene · Aug 12, 2022

clarkwinkelmann Can you confirm there's still an issue when both my extension and AutoImage are used? I'll take a look if so

No, no, sorry, there’s no issue. I’m just wondering if I still benefit from your extension’s ability to allocate the image dimensions, so that there are no page scroll jumps in loading. Frankly, it’s the main reason I use your extension but it’s a terrific functionality and I would like to make sure your extension takes precedence over FoF Formatting for any type of links that it can work with, even raw ones.

clarkwinkelmann · Aug 12, 2022

For users trying to use YouTube or Google Drive together with MediaEmbed, I have now published a new extension that can be used to enable only a sub-selection of MediaEmbed websites:

https://discuss.flarum.org/d/31404-selective-mediaembed

MediaEmbed will still have priority over Rich Embeds, but now you can disable those you don't want overridden.

CCyberGene · Aug 13, 2022

Just found a small rendering issue. If the URL is too long it will be visualized outside the boundaries of the little rectangular frame that contains the preview:

I'm not sure if it has something to do with the fact it's a MP3 file that can't be previewed. Besides, I'm using the Inline Audio extension that shows a little play button, maybe it messes up with the Rich Embeds somehow?

clarkwinkelmann · Aug 13, 2022

CyberGene it's normal for that popup to appear even for links that can't be previewed, I added it because otherwise there's no way to access the refresh button in case that was a temporary error with the target website.

I will add something to prevent URLs going out of the screen.

010101 · Aug 14, 2022

Customer feedback.

For me, this works better than ever. But, I’ll admit, I leave most settings off. I may try turning on the proxy and scraping again one day. But, keeping the settings/features minimal, I’ve had no issues and it’s so fast!

clarkwinkelmann · Aug 14, 2022

010101 thanks!

Just using the OpenGraph feature is perfectly fine!

The proxy features are only really necessary if you are using advanced security measures on your website like a CSP configuration that blocks external image loading.

I will release a cloud subscription in the near future that allows using all the advanced options without the need to configure any API tokens, and it will also come with a CDN domain to load images from to give a tiny performance improvement by stripping cookies out of the requests.

CCyberGene · Sep 10, 2022

How can I blacklist file extensions? For instance I don’t want this extension to show the preview for audio files. They are covered by the Inline Audio Player extension, however when I click on the link to play it, a preview is shown below and it obscures the other links for audio files that follow. An option to provide a regex for blacklisting would suffice I guess, such as .*\.mp3

clarkwinkelmann · Sep 10, 2022

CyberGene in the extension page in the admin panel, under URL Blacklist, add a new line containing /\.mp3$/

This should do it.

If you wanted to still retrieve the embed (for example for Separated layout) but just never show the failed embed inline hover style, I could try adding a new option. But that seems like a very niche situation.

010101 · Oct 10, 2022

I went to install this at a new forum and it won't activate. Error is:

Syntax error or access violation: 1071 Specified key was too long; max key length is 3072 bytes

Update: I changed the error column in the database from utf8 to latin which is less bytes. That let me activate the extension. Then I changed it back the ut8 in the database. We'll see if I have issues in the future, but for now, this hack got me going again.

Maybe something to think about for a future update: Can anything be reduced in the database to prevent this error in more restrictive environments?

clarkwinkelmann · Oct 11, 2022

010101 can you clarify which column was affected by the error? You mean the error column on kilowhat_rich_embeds ? I'm a bit confused because that column is not supposed to be a key at all so there shouldn't be any opportunity for this error to arise.

If you have any error related to the url_hash column, please let me know. Changing the index or column type of that column might break the extension.

If the error relates to the url column, it doesn't really matter because the index on that column is then deleted by one of the migrations that follow. That column should stay utf8 though.

If the problem relates to the key name, maybe your database table prefix is too long?

EDIT: please include your MySQL/MariaDB version information. You can run select version() as version via SQL on the SQL server, and I think this information is also shown in the php flarum info output and the admin dashboard.

clarkwinkelmann · Nov 27, 2022

Version 1.2.4 - November 27, 2022

This is a security update. All users should upgrade as soon as possible.

Changed Sanitize SVG files to prevent image proxy endpoint being used as part of an XSS attack.
Changed explicitly whitelist common MIME types for images instead of previous image/* to reduce attack vectors.
Changed limit proxy file size to 5MB to make it harder to use as part of a denial of service attack.
Changed prevent usage of proxy endpoint if image proxy feature is disabled.
Changed proxy errors now returned as images. Previous JSON responses weren't very useful as they always resulted in broken image tags in frontend.

I am not aware of these issues having been actively exploited. I discovered them through internal review. The XSS was only possible by tricking a user to click a URL that points to the proxy script.

If you are using a whitelist of trusted domains, the XSS was only possible if an attacker could upload a malicious SVG file to one of the trusted domains.

Darkle · May 5, 2023

Amazon.com drops what looks like a CAPTCHA (in the picture) and the rest of the information is not able to capture it, is there anything I can do? not sure if it's on my end.

clarkwinkelmann · May 5, 2023

Darkle do you have an example Amazon link I could try on my test site to see if it happens on the first try?

Otherwise in general I assume it must be their anti-bot protection kicking in, though it makes me wonder how they hope for less known search engines to crawl the page meta tags if they refuse to return the page If it doesn't happen all the time or only recently started happening, your server IP might have been added to their bot list. Maybe it'll be automatically whitelisted again after a given time, as those bans are rarely permanent and just temporary web firewall rules.

I could add an option in a future version to retry failed requests, either for all links or specific domains. Though if the IP was temporarily restricted due to bot activity, re-trying too soon might just make things worse.

I could also introduce an API-based embed for Amazon, it might be more efficient and less likely to break if you have many Amazon links on the forum. Not sure if Amazon has a free API for that kind of use though.

Darkle · May 6, 2023

clarkwinkelmann Actually any amazon.com link, it's not about a particular link, it must be what you say, some kind of anti bot protection, honestly I couldn't tell you if it's been up for a long time or a short time, I'll keep an eye out to see if it's removed.

CCyberGene · May 6, 2023

Darkle I have this problem when I use anti-tracking or ad blocking extensions in my browser (the adblockers block tracking cookies too). Disable all browser extensions temporarily and see if it helps.

clarkwinkelmann · May 6, 2023

Darkle CyberGene good point, it could make sense to check the image URL and whether there's any redirect in the browser dev tools. My initial assumption was that the server-side crawler saw a captcha page and saved the captcha image in the database, but maybe the crawler actually gets the correct meta image but Amazon switches the client-side response based on information sent by the browser.

This would still be an unexpected behavior if they do that with the OpenGraph image since that's one of the exact use cases OpenGraph is for.

If you use the image proxy feature, then it's probably still down to the server IP, as the proxy script does not forward any client header to the final website.

Do you have the "fallback" HTML crawler enabled? If the crawler saves the image to a captcha I assume it must be enabled, it would be odd for Amazon to put that image in the OpenGraph data of the error page.

clarkwinkelmann · May 22, 2023

Version 1.2.5 - May 22, 2023

This is a security update. All users should upgrade as soon as possible.

Changed limit OpenGraph/Rich/Image crawler download to 5MB per URL to make it harder to use as part of a denial of service attack.
Fixed issue where the Image crawler could be exploited to access meta information of arbitrary images on the server filesystem or intranet, or leak server IP despite a blacklist.

The vulnerability affected all versions of the extension since 1.1.0.

Attack vector: an attacker could post a link to a malicious HTTP endpoint that would return a special payload.
The endpoint would have to be an attacker-controlled server or a file hosting service that can be fooled into returning an incorrect MIME type in the HTTP headers.
The extension would then automatically access any arbitrary URL or file path contained in the malicious payload.

Exposed information: If the arbitrary URL points to a valid image on the filesystem or intranet, the image width, height and EXIF data would be made available through the Flarum REST API to anyone with permission to view embeds.
If an asynchronous queue is not used, an attacker could time the request to try guessing whether a file (image or not) exists at a given path or intranet URL.
The server IP is sent along with the request to the arbitrary URL, which could leak the server IP if a blacklist usually restrict it from being shared.

Mitigating circumstance: if a whitelist/blacklist was used to restrict the domains to trusted websites, it's unlikely that an attacker could host the required attack payload on a regular un-compromised website.
If a whitelist/blacklist was not used, the IP leak is not a vulnerability since any user could already publish a link to a server they control and get it accessed by the crawler.

Additional remedial steps: scan the kilowhat_rich_embeds.exif column of your database for any maliciously exposed information.
Set the value to MySQL NULL to redact it.

There is no evidence of this vulnerability being exploited, it was discovered through internal audit.

This version is compatible with Flarum versions 1.2 to 1.8.