RTE: cleanup imported document code

Thomas_Rother · October 25, 2025, 7:20am

I am currently moving some content from Dokuwiki (https://www.dokuwiki.org/dokuwiki) to Discourse. The Dokuwiki syntax is not clean markdown, therefore it needs manual editing. I normally use the old editor as I can see all characters there. But with the old editor I see strange “jumping effects”: when you mark a text block and try to format it, the cursor jumps up and down. Re-formatting of longer text is nearly impossible that way, as you always have to re-position your editing window again. It is hard to describe this, I could only show it with screencasts … The effect was describe earlier in Cursor jumping around in composer / editor text box

The RTE editor editor does not show this effect. But I am missing an option to clean up junky code imported from other systems …

renato · October 25, 2025, 10:40am

Can you share what this junk code looks like?

pfaffman · October 25, 2025, 4:14pm

How much? Tens, hundreds, thousands of posts?

If it’s more than a few, it probably makes sense to get an import script to do it. If it’s not that many, then it likely still would make sense to get some code to fix the markdown rather than trying to edit it by hand. (An even more wacky solution would be to have a plugin handle the docuwiki edditites).

Thomas_Rother · October 26, 2025, 7:14am

Its not that many, but maybe enough, to think about a script-based/programatic solution. The tricky thing is that the code is dokuwiki syntax ( https://www.dokuwiki.org/wiki:syntax ) plus enhanced UI code from a bootstrap3 template (https://getbootstrap.com). It looks nice but I did not have content migration in mind when I’ve setted it up this way. The main issue is not the dokuwiki syntax, but the bootstrap

… stuff. Code example:

<div class="level1"> </div> <h2 class="page-header pb-3 mb-4 mt-5">Plattenplatz ermitteln</h2> <div class="level2"> <p>Filtern auf ext4, was ist verfügbar?</p> <pre class="code"> root@tokoeka ~ # df -h -t ext4 --total Filesystem Size Used Avail Use% Mounted on /dev/mapper/pve-root 196G 39G 148G 21% / /dev/md0 486M 400M 57M 88% /boot /dev/mapper/pve-data 3.0T 560G 2.3T 20% /mnt/data /dev/mapper/pve-backup 414G 40K 393G 1% /mnt/backup total 3.6T 598G 2.8T 18% - </pre> <p> </p> <p>Filtern auf ext4, was wird genutzt?</p> <pre class="code"> root@tokoeka ~ # df -h -t ext4 --output=used Used 39G 400M 560G 40K 598G </pre> <p> </p> </div>

pfaffman · October 26, 2025, 11:38am

Yeah. That’s a mess. You can probably spend a little time with nokogiri and get it into markdown.

renato · October 26, 2025, 2:03pm

If you paste a clipboard with this exact text/html content in rich editor mode you get a content resulting in this markdown:

## Plattenplatz ermitteln

Filtern auf ext4, was ist verfügbar?

```
 root@tokoeka ~ # df -h -t ext4 --total Filesystem Size Used Avail Use% Mounted on /dev/mapper/pve-root 196G 39G 148G 21% / /dev/md0 486M 400M 57M 88% /boot /dev/mapper/pve-data 3.0T 560G 2.3T 20% /mnt/data /dev/mapper/pve-backup 414G 40K 393G 1% /mnt/backup total 3.6T 598G 2.8T 18% - 
```

 

Filtern auf ext4, was wird genutzt?

```
 root@tokoeka ~ # df -h -t ext4 --output=used Used 39G 400M 560G 40K 598G 
```

It’s lossy regarding stuff we don’t care about (divs, classes, etc), but will understand hN, pre, or anything defined in our ProseMirror schema, respecting our various editor extensions registering parseDOM definitions used by ProseMirror’s parser, including those from theme components or plugins.

As for the original request:

I think when the rich editor is loading the document, it’s not this same HTML anymore, is it?

Because a post raw containing HTML blocks should be rendered as a “pass-through” code editor node:

This can then be edited the same way it could in Markdown mode.

Topic		Replies	Views
Importing posts with Markdown Migration	10	1003	October 19, 2023
Clean-up html tags in all posts after migration? Migration flarum	22	2674	January 2, 2021
How to fix formatting issues? - markdown badly migrated to HTML Migration flarum	8	634	March 29, 2024
Coincidental Markdown in formatted content pasted to rich text editor rendered on publish Bug composer	0	81	October 27, 2025
Rich Text editor in topics breaks white-space characters in multiple ways Bug composer , markdown , code	4	316	October 15, 2025

RTE: cleanup imported document code

Related topics