The great UserSearch refactor, inline SQL vs scopes


(Sam Saffron) #1

I have been having a long discussion with both @blowmage and @agraves

Much of this discussion is hidden away in gists, and if feels kind of wrong, I would prefer it to be out here in the open so future Discourse devs can learn from it. I am going to start by adding some links to the current state:

Current refactored UserSearch class (based on work by myself and @blowmage ):

Proposed changes by @agraves (commentary in gist link)

Following proposed changes by @blowmage

###My commentary on the last proposal

On a personal level I am very comfortable with SQL, it reads better to me and easier for me to follow. I guess it is a side effect of dealing with so many frameworks and languages over the years. When hopping frameworks having SQL in common makes it way easier for me to get my footing. Particular annoyances in this sample:

  • AREL has no clean way of passing params in a joins clause and the syntax is somewhat weird scope.joins("left join(...)") often you find yourself composing SQL to match the DSL at hand as opposed to implementing your mental model.

  • A whole bunch of calls to sanitize_sql_array really bother me, this is something I feel should be handled transparently, its a huge thing to type for such a common case. Further more, sanitize_sql_array returns a string, so when you get down to executing the kaboodle you have no concept of what is a parameter and what is not. This basically makes prepared SQL impossible, and outlaws a whole class of optimisations that are possible with sql_builder.

  • Parameter locality is critical for me, for example take this invented syntax: scope.left_join('table on a = :a'), I really want to attach :a at that point and not later down the line, if I am forced to attach it at the end code gets very complex.

  • I am bothered by the chaining pattern used … I understand the reasoning (composability) … it just bothers me to see scope = scope.bla over and over again, this can be easily resolve by jquery style coding.

  • It bothers me that a method is now broken down into so many bits and pieces. When I am diagnosing this stuff I find it very very hard to build a mental model of how the SQL will look cause now it is sprawled.


On the other hand, one could argue that “this is the way people are used to getting stuff done in Rails”, don’t swim against the stream why re-invent the wheel. Additionally, by using AREL scopes you now have a result you can compose later on in almost every way you can imagine. For example Post.limit(1).limit(20).to_a will return 20 results.

To me, if the scope solution can be done cleanly, in particular not outlaw prepared SQL I am ok with it. Its just as is the final proposal feels a tad over engineered and for me is a bit hard to follow.

I understand that people feel very strongly about certain things, but do not think it is “The Rails Way” or “The Highway”. I think there is some middle ground. When I started Ruby one of the wonderful things was that I was able to define DSLs where existing ones did not fit. These kind of DSLs / shortcuts etc. are strewn through discourse, for example:


SiteSetting.bla = "some setting" # safe in an environment where multiple dbs live in the same process

MessageBus.publish('/channel', :hello, user_id: {100}); # send a message to user_id 100 on the channel, funnels its way to the browser

class MyOnebox < HandlebarsOnebox
   matcher /some_regex/
   ...
end

# and Post.sql_builder(sql)

The last thing I want to do is cause a big stir in the community or discourage contribution and collaboration, I don’t think stuff all boils down to “your way is right, my way is wrong” or the opposite. I do feel uneasy though being forced to use a DSL that does not fit properly to a problem at hand, and AR is less of a good fit for hairy SQL. We are trying to localize our hairy SQL to as few spots as possible the majority of Discourse uses good old AR. However there are 2 cases where we fine dropping to SQL is important

  • Performance critical stuff, our topic page and topic list page are holy, we can not afford any performance regressions there, if we find AR bottlenecks with mini profiler we may be forced to side step AR
  • Hairy SQL, occasionally we want to make use of DB specific features like fulltext or an update statement with a returns clause, or build complex aggregate reports.

(Dan Neumann) #2

Thanks for rounding up the discourse.

After using Sequel instead of Active Record on a personal project that saved me from writing a good amount of SQL strings, I wonder if Squeel (AR-compatible layer over Arel) could help clean up and decompose some SQL.

Also, the trade-off between clarity and indirection is always challenging. The second I need custom SQL with AR, I feel alone in the world. Alone with my menagerie SQL strings.


(Sam Saffron) #3

@blowmage fyi do support refactor that adds a unique index on lower(username) provided perf is identical and this does not complicate international ie. turkish installs, etc.


(Robin Ward) #4

I would like to chime in and say I vastly prefer @blowmage’s approach. There are several key advantages to using AR whenever possible that we lose with direct SQL:

  • It is easier to test
  • It allows us to use plugins such as acts_as_paranoid for safety, so we don’t need to remember to include AND deleted_at IS NULL type of clauses.
  • ActiveRecord has a local query cache to prevent duplicate queries in the same chunk of code. Your manual SQL approach does not.

Lately I have seen this style of code pop up in more than one place in our code base, and it is quite worrisome:

(from flags_controller.rb)

sql =
"select id, username, name, email from users
where id in (?)"
    users = User.exec_sql(sql, users.to_a).to_a

Why would this query ever need to be in SQL rather than active record? The AR is shorter and has all the advantages I outlined above: User.select(:id, :username, :email).where(id: users.to_a).

Can we, at the very least, agree that :

  1. We should NOT be using raw SQL when the active record query is shorter and easier to understand?

  2. If we have to use raw SQL, it should be placed in a model or service class and not in a controller?


(Sam Saffron) #5

I am fine with

  1. using AR where its easier to understand, if I need to drop to SQL for perf, leave a big fat comment
  2. placing raw SQL in a model or service class and not in the controller

(Sam Saffron) #6

Also, I am not saying I am totally against refactoring this bit into AR, just disturbed by the fact it is over-decomposed, sanitize_sql_array calls are messy and excludes a class of db plan caching, also got lots of repetition. I did not see you respond to any of that.


(Robin Ward) #7

You’re right I should have responded to those points too, but I’ll have to get back to it tonight :slight_smile:


(Sam Saffron) #8

Since the initial post is unwieldy, here is the code at hand:

Option 1:

class UserSearch

  def self.search term, topic_id = nil
    sql = User.sql_builder(
"select id, username, name, email from users u 
/*left_join*/
/*where*/
/*order_by*/")

    if topic_id
      sql.left_join "(select distinct p.user_id from posts p where topic_id = :topic_id) s on s.user_id = u.id", topic_id: topic_id
    end
    
    if term.present? 
      sql.where("username_lower like :term_like or
              to_tsvector('simple', name) @@
              to_tsquery('simple',
                regexp_replace(
                  regexp_replace(
                    cast(plainto_tsquery(:term) as text)
                    ,'\''(?: |$)', ':*''', 'g'),
                '''', '', 'g')
              )", term: term, term_like: "#{term.downcase}%")

      sql.order_by "case when username_lower = :term then 0 else 1 end asc"
    end
      
    if topic_id
      sql.order_by "case when s.user_id is null then 0 else 1 end desc"
    end

    sql.order_by "case when last_seen_at is null then 0 else 1 end desc, last_seen_at desc, username asc limit(20)"

    sql.exec
  end

end

Option 2:

class UserSearch
 
  # This is a helper method to make your code prettier.
  # Uses the default search method.
  def self.search term, topic_id = nil
    UserSearch.new(term, topic_id).search_ar
  end
 
  def initialize term, topic_id
    @term, @topic_id = term, topic_id
  end
 
  def search_ar
    scope = User.select [:id, :name, :username, :email]
 
    if @term.present?
      scope = scope.select select_username_exact_match_sql
      scope = scope.where where_username_search_sql, term: @term, term_like: "#{@term.downcase}%"
      scope = scope.order "username_exact_match DESC"
    end
 
    if @topic_id.present?
      scope = scope.select select_user_in_topic_sql
      scope = scope.order "user_in_topic DESC"
    end
 
    scope = scope.order "CASE WHEN last_seen_at IS NULL THEN 0 ELSE 1 END DESC"
    scope = scope.order "last_seen_at DESC"
    scope = scope.order "username ASC"
    scope = scope.limit(20)
    scope
  end
 
  private
 
  def select_username_exact_match_sql
    sql = "CASE WHEN username_lower = :term_lower THEN 1 ELSE 0 END AS username_exact_match"
    # FWIW, rather have an index on LOWER(username) than the username_lower column
    # sql = "CASE WHEN LOWER(username) = LOWER(:term) THEN 1 ELSE 0 END AS username_exact_match"
    sanitize_sql_array sql, term_lower: @term.downcase
  end
 
  def select_user_in_topic_sql
    sql = "CASE WHEN (SELECT COUNT(posts.user_id) FROM posts WHERE posts.user_id = users.id AND posts.topic_id = :topic_id) > 0 THEN 1 ELSE 0 END AS user_in_topic"
    sanitize_sql_array sql, topic_id: @topic_id
  end
 
  def where_username_search_sql
    # Do you need both the LIKE *or* the TSEARCH? Can't you just have the TSEARCH?
    # Or, are you trying to avoid the TSEARCH by matching the LIKE first?
    # I would think a properly indexed TSEARCH would be faster than a LIKE.
    # The intent is hard to derive here...
    # I'd still prefer an index on LOWER(username) to a username_lower column
    <<-SQL
      username_lower LIKE :term_like OR
      TO_TSVECTOR('simple', name) @@
      TO_TSQUERY('simple',
        REGEXP_REPLACE(
          REGEXP_REPLACE(
            CAST(PLAINTO_TSQUERY(:term) AS text)
            ,'\''(?: |$)', ':*''', 'g'),
        '''', '', 'g')
        )
    SQL
  end
 
  def sanitize_sql_array *args
    ActiveRecord::Base.send(:sanitize_sql_array, args)
  end
 
end

(Jeff Atwood) #9

I think this should be bedrock guidance. Agreed.

I am, however, unapologetically a fan of SQL where where performance actually matters on the hot code paths in the site.

I do think Ruby is a little behind other languages in terms of realizing that ORMs are a giant Vietnam, and moving back to simpler, easier, faster SQL-with-assistance constructs. I don’t want to see us insisting on Active Record because it is dogma, the apple of 2005’s eye. It is now 2013.


(Robin Ward) #10

(I still plan in responding to Sam’s points later tonight)

Nobody is disagreeing about using SQL where performance matters. All of these discussions have been about using SQL where performance is not an issue at all. In fact, @blowmage’s implementation with AR runs faster in early benchmarks than Sam’s version. The code I linked above had no performance benefit with raw SQL. Outside of the prepared statements, all of Sam’s other points refer to aesthetics.

Citation needed?


(Jeff Atwood) #11

http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern

The chief offender that I’m talking about is ActiveRecord, made famous by Ruby on Rails and ported to half a dozen languages since then. However, the same criticisms largely apply to other ORM layers like Hibernate in Java and Doctrine in PHP.

And Ted, who is a Python programmer, he of the “Node is cancer”


(Mike Moore) #12

The funny thing is that PostgreSQL is way better at dealing with text than Ruby is. For instance, Ruby won’t downcase anything but ASCII characters, but PostgreSQL will.

Ruby:

# encoding: UTF-8
"Özgür Çevik".downcase #=> Özgür Çevik

PostgreSQL:

select LOWER('Özgür Çevik'); -- özgür çevik

So if dealing nicely with Turkish names is a priority then lowering the strings in PostgreSQL is a better solution. Assuming, of course, that extended characters are even allowed in usernames, which they aren’t. The UsernameValidator restricts usernames to /[^A-Za-z0-9_]/, most likely because usernames are in the URL. (Like so: 301 Moved Permanently) So how extended characters in username are lowered doesn’t seem to be a problem.

But the point isn’t what supports text better, it’s whether concerns like search affect the design of the domain models. The username_lower attribute/column is an artifact of a solution that can be better solved at a different layer in the application IMO. The solution is misplaced; username_lower shouldn’t be an attribute in the domain model or a column in the database table. Instead, the solution should isolate the concern from infecting the rest of the domain, and placing it in how the data is queried seems like a better choice to me.


(Sam Saffron) #13

Sure, I really like this, I totally support a refactor that dumps the username_lower column from the db and replaces it with a unique index on lower(username), it is cleaner than having this float around the db.


(Mike Moore) #14

Agreed. Parameterized SQL has been an historically difficult sell because MySQL has no support for it. Or worse, degraded when using it. But Aaron has been making awesome progress in bringing AR forward. Just because those two instances of sanitize_sql_array are needed to maintain compatibility with the original SQL now, it doesn’t mean they will be necessary tomorrow.

Rails in general, and ActiveRecord in particular, have a very specific set of contraints. You have 3 real options for dealing with those constraints:

  1. Change your app to work within the constraints. 99% of the time this can be done, and a majority of the time the result I’ve found my apps are better for reduced complexity.

  2. Use a different query mechanism. Other ORMs like Sequel may better suited to your comfort zone. Or Stay in AR and provide the SQL to call. Or call a function.

  3. Fix or enhance ActiveRecord. I bet it would be possible to make the select and order methods accept parameters, removing the need for sanitize_sql_array. The following should be supported, IMO:

    User.order [“CASE WHEN username = :username THEN 1 ELSE 0 END DESC”, username: “blowmage”]


(Mike Moore) #15

You don’t have to attach it later. You can put those conditions right next to each other in ActiveRecord.

User.joins(:topics).where( topics: { title: "Welcome to Try Discourse!" } )

(Sam Saffron) #16

Totally agree there, maybe we should pull aaron in to this conversation. My preference here is 3) making AR better is wins all around


(Mike Moore) #17

You mentioned the scope = scope. repetition here and above as well. I think you are nitpicking. I intentionally chose to make that code verbose so the code would be easier to vertically scan and so it would be extra clear what was going on. This was not a pull request, this was in answer to your tweet asking for an implementation in AR.

Yes, the majority of those reassignments could “easily” be replaced by chaining the calls together. That was not the point. The point was to show that you can build that query as you specified it, using ActiveRecord/Arel, and maintain the existing sorting behavior.


(Sam Saffron) #18

Cool, keep in mind, we have many priorities in play here:

  • We would like to conform to “best practices” to maximise the pool of contributors

  • We would like to keep open the door for dropping to SQL where performance is critical (UserSearch is
    not an example of that as perf is similar)

  • We want a very readable code base

  • We would like to extend AR so we can rid ourselves of hackiness (like the mess sanitize_sql_array calls)

The last thing I want to do is lose contributes such as yourself due to ego, I don’t care much for my ego or for being right or wrong. I want to do what is best for Discourse.


Totally fine for you to submit a patch to remove sql_builder there, but would much prefer if that sanitize_sql_array stuff was not there. Also, it does feel a tad “over” decomposed making it hard for me to follow. But you know, many concerns are at play here.


(Robin Ward) #19

As promised, a more in-depth rebuttal:

Inner Join

User.joins(:topics).where('topics.archived').limit(3)

Outer Join:

User.includes(:topics).where('topics.archived').limit(3)

In which case would the above not work for you?

(emphasis mine)

We have hundreds of ActiveRecord queries in our code base. The only place I imagine we are building custom columns in order to sort by them are the search components. Also, if the two calls (not a whole bunch!) are truly a “huge thing to type”, we could easily add a new method to our freedom_patches near exec_sql that does the same thing. Even better, we could submit the patches to AR itself so other people could use them!

@blowmage’s approach will not use prepared statements. However, neither does yours. If you’re talking about investing the time in adding them to sql_builder, why not help add them to ActiveRecord proper? I don’t understand why it would be better to create a new ORM when these features could b

This is a personal aesthetic. Myself, I prefer having all my conditions in the WHERE clause rather than the JOIN clause. I find it’s all too easy to skip those JOIN lines when reading the query. When I compose SQL, I prefer to keep the JOIN statements to refer to table names only. ActiveRecord does this by default, you normally don’t have to include them in your JOIN statement (See the examples above).

Not sure why you brought this up because AR supports chaining. Those could be trivially re-written, for example:

scope = scope.select(select_username_exact_match_sql)
                   .where(where_username_search_sql, term: @term, term_like: "#{@term.downcase}%")
                   .order("username_exact_match DESC")

The Rails way is currently to enable TurboLinks by default. It is to use Test::Unit. DHH hates rspec and client side JS.

We do a lot of things that aren’t the Rails way and our project is better for it. I think there is a lot of evidence that we aren’t just towing a line here, that we actually think there are major benefits to minimizing SQL in our code base, and there are significant drawbacks to your SqlBuilder such as those I wrote in my earlier post.

If we can prove empirically that there is a reason to use SQL for a particular piece of code, I am all for it. And note I am not talking about "Hey, our page loads 0.1ms faster, so it’s worth throwing lots of SQL strings into our model. I mean a serious performance benefit that is measurable and repeatable.

No argument here on this point. Although I do have a suspicion that enabling views in our database might solve the fulltext and aggregate reports problem in a way that doesn’t involve as much SQL in our code base. Not to mention there are performance benefits to views. So we should consider doing that!


(Mike Moore) #20

You are confusing Ruby for Rails, and Rails for ActiveRecord. Any pushback to the initial code release I saw was not because it specified SQL, it was because of untested SQL. It wasn’t implementing yet another ORM, it was releasing an incomplete and untested ORM. Moving the queries to AR made it easier to verify and test. It also provided an opportunity for separating concerns and improving the design.

ORMs are a terrible solution, except for all the other solutions we’ve tried. But I’m just a Ruby person, stuck in 2005, blinded by my zeal…