Get back the real "raw" data that created a post?

If I paste some html into a topic to create a topic, Discourse automatically reformats the HTML (it “cooks” it), keeping some (but not all) of the underlying formatting, and removing the HTML tags from the view.

Is there a way for me to later get back the original HTML that I pasted into the topic when I created it? The closest I can find is doing an API call with “raw=true” in the endpoint and see the response under response.data.post_stream.posts[0].raw, like described here.

However, the “raw” text that comes back is not the original HTML. I’m not sure what it is exactly–it looks like the basic cooked topic having removed all spaces.

Is there a way for me to get back that original HTML that I pasted in?

Sure, there’s a route to see the raw post content. Use your post as an example. The URL is

https://meta.discourse.org/t/get-back-the-real-raw-data-that-created-a-post/189183

To see the raw content, replace the /t/slug with /raw.

https://meta.discourse.org/raw/189183
3 Likes

Thanks, but that doesn’t seem to be working for me. I’d like to get back data that keeps the exact HTML as when I originally pasted it in–so if the original html had a div tag, for instance, I want the data returned to have that div tag in it.

What I’ve found in the “raw” response, for example, has markdown formatting when the original data had none.

I’ll try to put together an example now to show what I mean.

1 Like

Ok. Here’s my attempt at an example:

I wrote a few lines of different formatting in a word processor, and the html produced was:


Original HTML:

<style>
     ...
     /* Style Definitions */
     table.MsoNormalTable
    	{mso-style-name:"Table Normal";
    	mso-tstyle-rowband-size:0;
    	mso-tstyle-colband-size:0;
    	mso-style-noshow:yes;
    	mso-style-priority:99;
    	mso-style-parent:"";
    	mso-padding-alt:0in 5.4pt 0in 5.4pt;
    	mso-para-margin:0in;
    	mso-pagination:widow-orphan;
    	font-size:12.0pt;
    	font-family:"Calibri",sans-serif;
    	mso-ascii-font-family:Calibri;
    	mso-ascii-theme-font:minor-latin;
    	mso-hansi-font-family:Calibri;
    	mso-hansi-theme-font:minor-latin;
    	mso-bidi-font-family:"Times New Roman";
    	mso-bidi-theme-font:minor-bidi;}
    </style>
    <![endif]-->
    </head>

    <body lang=EN-US style='tab-interval:.5in;word-wrap:break-word'>
    <!--StartFragment-->

    <p class=MsoNormal>Here is some text. Font family = Calibri, size 12pt<o:p></o:p></p>

    <p class=MsoNormal><o:p>&nbsp;</o:p></p>

    <p class=MsoNormal><b>This text is bold</b>.<o:p></o:p></p>

    <p class=MsoNormal><o:p>&nbsp;</o:p></p>

    <p class=MsoNormal><i>This text is italicized</i>.<o:p></o:p></p>

    <p class=MsoNormal><o:p>&nbsp;</o:p></p>

    <p class=MsoNormal><span style='font-size:18.0pt'>This text is in a larger font
    = Calibri, size 18</span>.<o:p></o:p></p>

    <p class=MsoNormal><o:p>&nbsp;</o:p></p>

    <p class=MsoNormal align=center style='text-align:center'><span
    style='font-size:11.0pt;font-family:"Garamond",serif'>And here is some more random
    text that is centered = Garamond, size 11. Lorem ipsum dolor sit amet,
    consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et
    dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco
    laboris nisi ut aliquip ex ea commodo consequat.<o:p></o:p></span></p>

    <!--EndFragment-->
    </body>

    </html>

Cooked Discourse Post:

If I paste this html into a discourse post, it looks like this (after it goes through the “cook” process):

Here is some text. Font family = Calibri, size 12pt

 

This text is bold.

 

This text is italicized.

 

This text is in a larger font = Calibri, size 18.

 

And here is some more random text that is centered = Garamond, size 11. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.


What I’m Trying to Get Back

It’s fine for Discourse to cook the post and display it like it does. But what I want now is to be able to later take a post and get back the original, underlying HTML. So the data I get back should look like this:

<body lang=EN-US style='tab-interval:.5in;word-wrap:break-word'>
<!--StartFragment-->

<p class=MsoNormal>Here is some text. Font family = Calibri, size 12pt<o:p></o:p></p>

<p class=MsoNormal><o:p>&nbsp;</o:p></p>

<p class=MsoNormal><b>This text is bold</b>.<o:p></o:p></p>

(etc.)

Right now if I plug in the “raw” endpoint like you provided, it does not provide this html. Rather, it just provides the text with, it seems, most formatting and spacing removed.

Is it possible to get the original, underlying html back?

So I’m still not following, or am missing a step. I just created a post over at try.discourse.org: https://try.discourse.org/t/testing-html-cooking/1405. The raw content is https://try.discourse.org/raw/1405. The raw content sure looks like what you said it should look like.

Can you provide me step by step instructions of how to reproduce this? I took the “Original HTML” you shared above, added an <html> and <head> tag to the top so it was valid HTML, and created a post with the HTML in the body.

1 Like

You’re right. The raw link does look like what I mean. Not sure why this didn’t work before. I’ll test a few things out and respond.

1 Like

I think that the issue is that the HTML you want gets converted when you paste into the browser, so that stuff never gets to raw.

Stuff from your fragment that you can see in raw

See https://meta.discourse.org/raw/189183/7

Here is some text. Font family = Calibri, size 12pt

 

This text is bold.

But if you copy the stuff from Word and then paste it, all of the HTML gets converted to markdown on the front end.

1 Like

Thanks, all. Yes, @pfaffman that is at least in part what some of my confusion is here.

Let’s say the goal is to copy from word, and then paste into a discourse topic; and then later be able to look at the “raw” data of the topic and get back the original formatting from word.

Is that possible?


EDIT: Here’s maybe a more clear example (hopefully). Here’s some html:

<html>
<body>

<h1>Heading</h1>

<p style="font-size: 35pt">This text should be very big</p>

</body>
</html>

If I print the output (for example, in the W3Schools try it editor here) it produces two lines of text, with the words “This text should be very big” showing up larger than the first line–consistent with the styling “font-size: 35pt”.

If I copy that output (not the html code, but the output), and then paste in most places, the formatting is kept. For example, if I paste it into gmail the formatting is kept, and if I paste it into microsoft word the formatting is kept–in both cases, the second line is, correctly, bigger than the first.

In Discourse, however, the formatting gets lost even in the raw editor window: All formatting is removed, except a markdown “#” is added to the h1.

So, if I copy actual html code and paste it into the editor, then indeed the html code is preserved and I can get it back in its raw form. But, if I paste in the result of html code, the formatting is lost, even in the raw form.

When I paste the output into gmail or word, and they keep the formatting, they must be keeping the html. Discourse, however, seems to strip away the html when I paste in the html output.

Is it possible to not strip it away?

No. You can look in the composer and see the raw text that gets created when you paste from Word. That’s all there is because the paste-buffer-to-markdown magic happens in the browser between when you paste and when you see it in the composer window.

3 Likes

Here’s the crucial misunderstanding: the raw is Markdown, not HTML.

We have magic :sparkles: that converts HTML into Markdown when you paste into the composer.

4 Likes

And what makes it more confusing is that the markdown can contain HTML. :man_shrugging:

3 Likes

But this conversion does not happen (in the composer) if you paste in actual html code. That’s what I think I missed earlier.

Or perhaps, like @pfaffman said, the conversion does indeed happen no matter what, but markdown can keep html if pasted in with straight code, but not if the output of the code is pasted in.

Right, it has to have a content type of text/html in your clipboard to work.

2 Likes

Can other formats, such as RTF (rich text format), also be preserved in the markdown like HTML?

HTML isn’t preserved per se in the markdown, it’s just that some of the important tags are supported (e.g. <em> emphasis) as we try and follow Postel’s Law where it doesn’t cause problems.

What problem are you trying to solve?

Goal is to understand how the process works (relevant for a few projects). This conversation has already helped with that. I am still wondering if you can have access to the originally formatted text when you copy from miscrosoft word.

No. You would need to look into the javascript code that processes it on paste.

1 Like

This is due to the “HTML to Markdown” handling, which occurs on paste.

Here’s the relevant source file. You will notice several bits of Microsoft Word specific handling (e.g. MsoListParagraphCxSpFirst).

To ensure you get raw text, pass your paste through Notepad or similar first.

2 Likes

This is very helpful. Just to clarify: why would passing through Notepad make a difference?

Doing that would strip out of all the Word formatting and leave you with just plain text.

This is happening because when you “copy” something, the application actually presents multiple different views of the data to apps you paste it into. Discourse has opted for the “fancy” version with all the Word formatting, while Notepad will opt for the “plain” version. By passing it through Notepad, you leave only the “plain” version for Discourse to see.