In this ORM I killed myself

An ORM can be one of the best tools at the start of a project, and it can become a nightmare by maintaining it. We analyze the main problems that we can find in projects developed with this technology and how to solve them.

An ORM (Object Relational Mapping) is an abstraction of a relational data model (the database) with the objects of an application. There are multiple implementations of ORM, all with their benefits and defects, but before moving on to them, let’s see a summary of the love-hate relationship between applications and relational databases over time.

Historically, an application communicated with a database using a proprietary interface. That is, the database had an access library to it and one used it to insert data, delete it or select it using SQL, which should be written carefully to improve performance and prevent memory problems. These tasks were handled by the DBA (Database Administrator) that should have an exhaustive knowledge of the SQL language.

In this ORM I killed myself

Over time, different providers of relational databases appeared, and with it the need to abstract from them. The idea was that, if the database provider was changed, all access to application data should not be changed. This need was met with ODBC (Open Database Connectivity) and, subsequently, with JDBC (Java Database Connectivity). Developers still needed DBAs to write SQL code that would take advantage of database functionalities without compromising performance.

JDBC

Later, the developers began to decide on the database by ourselves (we seem too clever), and the first ORMs begin to appear, which directly map rows of the database tables into objects within the applications.

These first ORMs (for example, Apache Torque, which against all odds still exists) were based on the generation of code from the database schema, and this code was generated again with each modification in it. Likewise, the ORMs began to dynamically generate the queries that were made to the database, translating source language code in SQL.

The DBAs were beginning to fall into disuse, as the ORMs seemed to be doing their tasks to optimize operations on the database. Their functions were delegated to define the data model, indexes, views, and the whole aspect of the DDL (Data Definition Language), so that the ORMs were “part of the work done” to execute DML (Data Manipulation Language).

And today, we have ORMs with which we define the entire database in code (Hibernate), including entities, relationships, indexes and even views and from this definition we can generate the complete database. With this, DBA’s work became an “a posteriori” analysis of what was happening in an application. Mostly, DBAs are the last line of defense against an application that gives poor performance in data access. The most common phrase in a DBA is “WTF!” when viewing queries executed from an application.

Hibernate

Later, we have continued to abstract from the database, adding JPA (Java Persistence API), an abstraction of which Hibernate is an implementation.

JPA

A short time later, Spring-Data, which is an abstraction about data access.

Spring Data

You can’t talk about ORMs without talking a little more about DBAs. These profiles knew SQL. And they knew it as well as a programmer knows Java or knows PHP.

Today, we find it hard to think of an excellent Java programmer who makes excellent servers with Ruby on Rails or excellent frontends in Angular. And that is because we are specializing in each vertical of knowledge. And we have “assimilated” the vertical of knowledge of SQL within the programming language of the application, or so we believe.

How does this become a problem? Easy, the ORMs have become very powerful tools that hide “too much” (and too well) the database, allowing us to make in code (application) operations that become true catastrophes from the point of view of SQL.

Next, let’s look at some of the most common problems that can cause suicide of the maintenance developer.

The consultations that bring everything, the house in tow.

@Entity
@Table(name = "customers")
public class Customer {

    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    private long id;

    private String name;

    private String email;

    public long getId() {
        return id;
    }

    public void setId(long id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getEmail() {
        return email;
    }

    public void setEmail(String email) {
        this.email = email;
    }
}

And the following repository (Spring-Data):

@Repository public interface CustomerRepository extends CrudRepository<Customer, Long> { }

The following code that gets the name of the customer object could be correct.

Customer foundCustomer = customerRepository.findById(userId).get();
System.out.println("Customer name is:" + foundCustomer.getName());

But what is happening is that, as programmers, we are taking advantage of the existing (magical) methods of CrudRepository to get the object by id. And from there, get the name. We are lazy programmers (and that’s good), but we don’t know what’s going on behind the magic (and that’s bad). And what is happening is this:

select customer0_.id as id1_0_0_, customer0_.email as email2_0_0_, customer0_.name as name3_0_0_ from customers customer0_ where customer0_.id=?

That is, although our code only wants to obtain the name of the user (and nothing else) we are bringing the entire row of the database. And this is a problem if tomorrow someone adds fields to the customer object, since we would consult them inadvertently.

What is the solution? The solution is simple: the database is a site where you can consult exactly the information you want to obtain in the most direct way possible, even if it requires more code.

    @Query("SELECT c.name from Customer c where c.id=:id")
    String getCustomerName(@Param("id") Long id);

Partial updates that become totals, just in case.

We execute the following code:

// Create customer
Customer customer=new Customer();
customer.setEmail("johndoe@sample.com");
customer.setName("John Doe");
customerRepository.save(customer);

// Search same user by id
Customer foundCustomer=customerRepository.findById(customer.getId()).get();
foundCustomer.setName("Other name");

// Save same user again
customerRepository.save(foundCustomer);

Seems right? Yes, from the point of view of a programmer nothing serious seems to be happening, but let’s see what code is running below, in the JDBC layer.

// Create customer
Hibernate: insert into customers (email, name, id) values (?, ?, ?)

// Search same user by id
Hibernate: select customer0_.id as id1_0_0_, customer0_.email as email2_0_0_, customer0_.name as name3_0_0_ from customers customer0_ where customer0_.id=?

// Save same user again
Hibernate: select customer0_.id as id1_0_0_, customer0_.email as email2_0_0_, customer0_.name as name3_0_0_ from customers customer0_ where customer0_.id=?
Hibernate: update customers set email=?, name=? where id=?

Oops, what happened? Well, two things have happened in the last save. The first is that to solve the call to save, the library defends itself against us, assuming we don’t know what we do, and wonders if what we want to do is an update, or an insert of a new object. So, to heal in health, make a select by id. And, since the object exists, it decides (correctly) that what we want to do is an update. But an update of what? Well here we have the second dilemma, we update everything.

Doctor, is it serious? Well, maybe not today. In this case we ignore it because our entity weighs nothing, and we will notice little or no impact. Unfortunately, this is one of those diseases of the source code that can get out of hand… Let’s see what happens if another programmer “B” decides to add a field to the entity.

 @Lob
 @Column(length = 10240)
 private byte[] userIcon;

What happens now with programmer code “A”? Well, what happens is that the last save produces the following SQL:

Hibernate: select customer0_.id as id1_0_0_, customer0_.email as email2_0_0_, customer0_.name as name3_0_0_, customer0_.user_icon as user_ico4_0_0_ from customers customer0_ where customer0_.id=?
Hibernate: update customers set email=?, name=?, user_icon=? where id=?

That is, we go to the database by the array of bytes and send it again to update it. We have just expanded (and not a little), the use of the network to simply update the user’s name. And that is a problem for programmer “A”, who without knowing very well why, its function takes 3 times longer to resolve.

What is this about? It is because “great power entails great responsibility”. If we use an ORM, in this case SpringData/JPA/Hibernate, we have to keep in mind that what we do not program, it must invent it “in the best way it can” and normally, without enough information, since that information the programmer has it, that if bet on reducing the amount of code (and that’s good) you can fall into the trap that the final software is totally inefficient (and that’s bad).

What solution is there? Easy. As much as an abstraction layer hides the database, we have to continue behaving like DBA. That is, look at what is going to happen in the database and act accordingly. If what Programmer A wanted to do is update the name from the ID, it should have done a “UPDATE customers SET name =? WHERE ID =?”, Which in JPA would have been a fairly simple method:

@Modifying
@Transactional
@Query("UPDATE Customer c set c.name=:name where c.id=:id")
void updateCustomerSetName(@Param("id") Long id,@Param("name") String name);

Inserting multiple rows, little by little

With an ORM, in this case, JPA, we have the possibility to define the relationships between two tables, for example, our CUSTOMER table could have an ORDER table associated in which to store the orders of said user. At first glance it is clear that there is a relationship between both objects, for a Customer row, we could have 0 or more Order rows.

Suppose that our model has in addition to the Customer table, an Order table, whose class within the application is as follows:

@Entity
@Table(name = "orders")
public class Order {

    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE)
    private long id;

    @ManyToOne(fetch = FetchType.LAZY)
    @JoinColumn(name="customerId")
    private Customer customer;

    private Double total=0.0;

    // Setters and Getters
}

and our Customer class would be like this:

package com.viafirma.blog.orm.entity;

import javax.persistence.*;
import java.util.List;

@Entity
@Table(name = "customers")
public class Customer {

    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    private long id;

    private String name;

    private String email;

    @OneToMany(mappedBy = "customer",cascade = CascadeType.ALL)
    private List orders;
    
    // Setters and Getters
}

Let’s try to create a Customer object and associate Order objects:

Customer customer = new Customer();
customer.setName("Name");
customer.setEmail("Email");

List orders=new ArrayList();
for(int i=0;i!=10;i++){
	Order order=new Order();
	order.setCustomer(customer);
	orders.add(order);
}
customer.setOrders(orders);
customerRepository.save(customer);

And let’s see the SQL result:

Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: call next value for hibernate_sequence
Hibernate: insert into customers (email, name, id) values (?, ?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)
Hibernate: insert into orders (customer_id, id) values (?, ?)

We see that JPA has been obtaining the ids of the objects to be inserted, inserts the Customer object and then inserts the Order objects in a sequence of inserts. But… can you do it better? Well yes, it is possible through an old friend of the DBA that is the insert in batch, that with a single sentence you can insert multiple rows in the database, but that functionality is not available in JPA by default, it is necessary to indicate it in explicitly to the implementation in this case, of hibernate, for example, with Spring we would indicate it in the application.properties.

spring.jpa.properties.hibernate.jdbc.batch_size=5
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=true
spring.jpa.properties.hibernate.jdbc.batch_versioned_data=true

In the batch_size property we would indicate in blocks of how many entities the inserts would have to be made, in this case, of blocks 5 in 5. With which we would remove unexpected surprises of slowness in multiple inserts.

SQL construction in code, hives.

Assume the following code:

StringBuilder sb=new StringBuilder("<html>")
.append("<head>")
.append("</head>")
.append("<body>");
for(int i=0;i!=10;i++){
sb.append("<p>Hello paragraph "+i+"</p>");
}
sb.append("</body>")
.append("</html>");

Has anyone noticed a vibration in the force, an outbreak of arachnid sense or burning? If so, it is because we have two languages “Java” and “HTML” mixed intimately in the same code, so changing some of the HTML means modifying the Java source and what is worse, we are assuming that the Java developer knows HTML, which is at least a guess (The example is simple, but I know you understand me).

Now let’s see an example of JPQL inside code:

List<Customer> foundCustomer=em.createQuery(
"SELECT c from Customer c Where c.name like :name", Customer.class)
      .setParameter("name","Name")
      .getResultList();

Or simplifying:

@Query("SELECT c from Customer c where c.name=:name")
    List findCustomersByName(@Param("name") String name);

Why does this code produce less alarms? It is not the same case? We are mixing two languages, JPQL (or SQL if it were a native query) with Java and with exactly the same consequences, a Java programmer does not have to make excellent queries, and of course, embedding SQL code within Java does not make it very comfortable that a DBA can analyze them.

Can we avoid this? Yes, JPA allows us to edit the queries in files external to the source, in a file named jpa-named-queries.properties within a META-INF directory in the project resources, with a key value list of the following way:

Customer.findCustomersByName=SELECT c from Customer c Where c.name like ?1

And in code, within a repository, it would suffice to indicate the name of the method that should match the property in the file, with the format of [entity]. [Method_name]

@Repository
public interface CustomerRepository extends CrudRepository<Customer, Long> {
    List<Customer> findCustomersByName(String name);
}

In this way, we would keep the SQL code separate from the java source, in a centralized place and easy to review by a DBA.

The infinite selects, or how to run out of memory in a SELECT.

Suppose we have to generate a report in a file with the results of a query, an ORM would invite us to do something like this:

Iterable<Customer> customers=customerRepository.findAll();
saveToDisk(customers)

This code has an advantage: it is short and direct. And a great disadvantage: it is a very simplified view of the problem.

Suppose our customer object was relatively large. Let’s say 16kb. And that in the table, after 3 years in production, they were 1500000. In a single call, we would occupy 20Mb of memory and the garbage collector should eliminate it at the end of saving it to disk. This is because the ORM gives us great freedom when launching that query, but if we are little responsible, we will cause serious performance problems.

Any way to solve it? Yes, processing the query rows 1 to 1. How? Well first, forgetting about JPA, since you’re not going to find the solution there. We have two options:

  • We go to the world of JDBC and go through the result with a ResultSet.
  • Do it directly with Hibernate.

With Hibernate, it would be something like this:

StatelessSession session = ((Session) em.getDelegate()).getSessionFactory().openStatelessSession();

Query query = session
      .createQuery("SELECT c FROM Customer c ");
query.setFetchSize(1000);
query.setReadOnly(true);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
while (results.next()) {
   Customer customer=(Customer)results.get(0);
   saveToDisk(customer);
}
results.close();
session.close();

Conclusion

In summary, the use of ORM (Spring-Daa/JPA/Hibernate) abstracts us from the database trying to simplify our work as much as possible, but the development of software that communicates with a database requires many Optimization cases that do not fit quite well with simplification. And without these optimizations the developments that are made quickly thanks to abstraction cause serious problems of memory and overall performance during maintenance.

The best tool that can be used during development with JPA / Hibernate is to enable SQL logging, direct inspection of queries, and understand why our ORM executes them, and in case of doubt, dig up a DBA that you surely know what happens.