Over the last couple of months in the iECM standards committee, we have been having a discussion about what the best language would be for content management. SQL presents a standard and most content management systems are built upon SQL databases. In JSR-170, the expert group decided to go with XML and Xpath since that followed the hierarchical model espoused by JCR. In March, we had decided on SQL for iECM, but some in the group had misgivings.
In some ways, the discussion that we have been having around query and domain models has been going on for a very long time. It is natural that it is cropping up again because of the dramatic rise in use of XML, which is much more abstract than tables and columns. It also says a lot about the relational model as we try to squeeze XML into the relational model. Just like the object database wars 15 years ago, we come up against how XML requires hierarchical data models that don’t fit into relational structures. (See Mike Stonebraker's "What Comes Around Goes Around")
In trying to distance the discussion from the specific language in my own mind, it seemed to me that we needed a model to determine what we are going to query. Even in discussing requirements, you naturally need a set of concepts for describing what you are going to query. At its core, the relational model has shown flexibility in describing lots of different types of constructs, except for a few which I’ll get further on. At the core are a ENTITIES and RELATIONSHIPS. On these, you can perform the following useful operations:
- SELECTION / RESTRICTION – This is the predicate that limits the results of a query to only items of interest
- PROJECTION – This returns only the properties that we are interesting in
- JOIN – This connects two ENTITIES together through correlating proprerties (entity1.property1 = entity2.property2)
- SORT – Order the results in a useful sequence for organizing or understanding the results
- UNION – Attach one or more ENTITIES together as though one follows another as a single entity
David Choy of IBM pointed out that the relational model falls short on two main areas:
- Hierarchies – Traversing the structure (transitive-closure) of a tree and evaluating predicates on portions of that tree. Hierarchies are critical for folder-file containment, hierarchical classifications in taxonomies and in compound, complex documents.
- Multi-valued properties – These are not allowed because they break the laws of relational structures. However, that doesn’t mean that they aren’t useful. It is much easier to set up a multi-valued property than to set up an association and new entity to model this, especially since the multiple values are in conjunction with a single entity.
If iECM were to modify the relational model to include these two concepts then we have a model that can both describe and query a content repository. It is extremely important to note as well that the content models are almost exclusively (Alfresco and Jackrabbit excluded) on relational models. It is worth noting how the relational operators are already used in conjunction to content:
- SELECTION – In set-oriented operations on content, predicates filter millions of records down to a few relevant content items
- PROJECTION – Present to the user only the information necessary. This is useful for both efficiency and quality of information eliminating extraneous information. The JCR model does not really support projection except to return everything. In reality, this is the same as just returning an ID and fetching whole objects. (Let’s just hope there is a lazy load.)
- JOIN – The core of the join is almost always the content item, but it helps to add User information or find content related to a particular set of users. It is also useful to join process information that may be stored separately from the content item, such as workflow information
- SORT – This is a construct that needed to be added to XPath since it is so important for presenting to users for them to consume
- UNION – Less common, but critical for cross repository searching. Consider [uk.accounts.documents UNION us.accounts.documents UNION asia.sales.documents] and SELECT a relevant subset.
All of these operations are SET oriented operations. In general, retrieval is set oriented and update is object oriented a single object at a time. Since the content repository is organized into folders and files generally and other types of hierarchies like classification, it is worth thinking of SET operations in tree structures as well. In fact, it has been argued that tree structures are more important than any other type of organizing structure. This is understandable since the vast majority of SELECTION and PROJECTION is on a single entity, the CONTENT or DOCUMENT entity. So it is worth discussing how hierarchies are used.
But first it is worth thinking about how queries are used and what set-oriented operations are commonly performed on content repositories. I would classify SELECTION process as against three main criteria:
- CONTENT – what the content contain (i.e. full-text search or XML structural search)
- CONTEXT – how the content relate to PROCESSES, OWNERSHIP/CONTROL, LOCATION or CLASSIFICATION. LOCATION is a particularly important paradigm since it aligns with the way the human mind works. This is why folders and files is such a popular metaphor.
- CHARACTERISTICS – how to describe the document based upon its type, size, date, resolution, etc.
Hierarchies fit more into the CONTENT and CONTEXT of a content item. In content they are the structure of the document. It is no surprise that XPath evolved significantly as part of XSLT, with the author of XSLT, Michael Kay, contributing significantly to XPath’s development. However, hierarchical structures as part of document filtering are much different than filtering based upon CONTEXT. In context, the document is part of an overarching organizational structure that organizes content and makes it easier to browse and fit into a simpler mental model than a huge flat structure.
Thus, I think iECM can simplify querying hierarchies by focusing on:
- COMPOUND STRUCTURES – Where we walk a tree of heterogeneous nodes to compose a larger more complex structure – essentially the case for XSLT – and to find relevant parts. This is really useful for compound documents and dynamic web sites.
- LOCATION – To isolate a search to a particular part of a tree that has special semantic meaning. Thus search within a particular path – normally a folder, but potentially a classification path. In the vast majority of cases these structures are *homogenous*, unlike the compound structures.
The distinction is that predicates may be applied at any level because structure can be different. The scale is also much different in that LOCATION can span billions of documents, where COMPOUND STRUCTURES may span thousands at most. This has implications for the relational model for two main extensions:
- HIERARCHICAL ENUMERATION – Or simply “Walk the Tree”. Since this concept is foreign to the relational model, a recursion capability has been added to relational structures. In SQL1999, this has been presented as a recursive UNION. It is not a new problem and has often been described as Bill of Materials or BOMs. Actually, folder hierarchies are a simple form of BOM.
- HIERARCHICAL SELECTION – This provides the location selection services against a path. In general, what is being searched are the leaf nodes within a path. Therefore this is generally just simple pattern matching against a path.
For multi-valued properties, these generally have been addressed in other relational structures. ARRAY types are valid relational types. In addition, there are JDBC and ODBC constructs to handle these multi-valued structures on PROJECTION. If not, we can introduce an XML() or CSV() (comma separated value) function to return values as a simple string type. a multi-valued property in SELECTION is generally not a problem, although we introduced a special “ANY entity.property in _some_set_ to clarify semantics and is consistent with the existing ANY modifier in SQL. So in general, multi-value doesn’t require the relational model to be bent too out of shape.
So this leaves us with a model that is workable and describes most content structures without necessarily defining a language. We have SELECTION, PROJECTION, JOIN, SORT, UNION plus HIERARCHY-ENUMERATION and HIERARCHY-SELECTION as a special type of predicate. This combined with the JSR-170 data model (minus residual values) gives us a relational model plus multi-valued properties and the aforementioned relational operators.
What this rules out, however, is XPath. Generally relational models are queried using relational algebras. SQL and XQuery are examples of relational algebras, with XQuery actually being a purer form of a relational algebra. XPath is sort of a relational calculus, a declarative way of expressing queries. However, XPath does not have PROJECTION and does not really have a JOIN. From a PROJECTION perspective, a simple “//fred/flinstone” is really equivalent to:
- SELECT * FROM nt:node WHERE path=”//fred/flintstone” -or- SELECT nt:uuid FROM nt:node WHERE path=”//fred/flintstone”
Since there is no way to only get the properties that are important, the entire object would need to be retrieved and what is really being projected is the ID so that the object can be fetched. For some systems, this can be horrendously expensive. XPath provides a good predicate, because that is what it was designed to do. It is worth providing this as the HIERARCHY-SELECTION extension.
XQuery is the only language that provides all of these capabilities and was designed to handle all of these cases. It has PROJECTION and SORT where XPath does not. JOIN is a implemented as FLOWER (For Let Where Repeat) operations. The major database vendors of Oracle, IBM and Microsoft support it. There are now some basic open source implementations that people can use. However, it is not the easiest language to get to grips with. As far as I am aware, it is still not an official standard.
SQL does not easily support the HIERARCHY-ENUMERATION or HIERARCHY-SELECTION operations. However, in Documentum, we extended the SQL language to accomplish these, with a recursive descent clause and a special predicate called “in folder(“/path”)” to search in hierarchies. I believe that the Oracle CONNECT-BY statement is a simpler HIERARCHY-ENUMERATION mechanism.It would be good to add an XPath clause for HIERARCHY SELECTION. An XPath qualification in SQL also solves any simplicity issues. As noted earlier, the HIERARCHY-SELECTION operation is generally applied to leaf nodes, so XPath can be more of a simple pattern matching and attribute qualifications pulled out and applied to the outer SELECT statement.
Supporting these extensions may or may not be easy in SQL. Some like Documentum, already support them. Search within hierarchies probably requires bridge tables to simplify the path search putting a significant table management issue on the repository. Hierarchy enumeration requires a recursion mechanism, but systems may be able to use existing database implementations. (Note that MySQL does not support these.)
Finally, I think it is important to note that the query language in content repositories is read-only. This is because querying is a set only operation and updates generally need to be evaluated at the object level. The complexity of multi-value storage, complex associations and security require updates to be performed on objects not through set oriented operations. Therefore, read-only operations on either XQuery or SQL are neutral in evaluation.
To summarize:
- XPath is incomplete, but is an important component of a query language, not a query language on its own
- XQuery
- + Already exists and supports this extended relational model
- + Supports XPath
- + Supported by Oracle, IBM and Microsoft databases
- + Existing open source implementations for those who don’t have it
- - Not yet a standard (but probably will be)
- - Not very easy to formulate
- - Immature development bindings usually resulting in DOM manipulation (not very efficient)
- - Difficult language to optimize which may result in poor performance
- - No developer community
- - No developer tools
- - It’s currently a read-only language and the update additions are not focused on transactional systems
- SQL
- + Millions of developers know (and love) it
- + Most repositories already support it
- + Supports PROJECTION, JOIN and UNION
- + Long established standard
- + Efficient language bindings with JDBC and ODBC
- + Lots of development tools
- - Need to specify and implement HIERARCHY ENUMERATION
- - Need to work around the multi-value problem with PROJECTION and SELECTION functions
From a technical perspective, I could go either way. Perhaps XQuery is the future, but I have lost some faith in this when XQuery failed to be standardized by W3C back in 2004. On the other hand, the millions of developers available to build content applications persuades me that SQL is the way to go. To really boost iECM and development of content applications, SQL can build on an existing base. Therefore, the choice becomes a political one more than a technical one. Sill, it was IBM that made SQL (along with a little help from its _friend_ Oracle) and it could make XQuery successful too.
And now, what is the different between XPath and SQL Query?
I have no idea. I just can find the differences between XML and SQL query.
thanks
Posted by: deean | 2008.07.01 at 05:59 PM
It's taken way too long, but the XQuery specifications are now Proposed Recommendations. The final Recommendation won't be until after 31 December 2006, but will probably be relatively soon thereafter (unless some unforeseen glitch is discovered).
See http://blogs.datadirect.com/jonathan_robie/2006/11/xquery_is_a_proposed_recommend.html.
Posted by: Jonathan Robie | 2006.11.30 at 05:51 PM
Very good article expressing the differences between those three.
However, I'm curious why WebDAV and the DASL basicsearch query mechanism was left out. This has seemed to solved most of the problems related to Document and Content management searches, but what is your opinion?
thanks,
-D
Posted by: Darren Hartford | 2006.11.28 at 03:58 PM
G'day John,
Have you looked at what's happening in the .NET world, in particular LINQ? My initial impression is that it might be a step in the right direction.
For the non-projection cases I'm also intrigued by the idea of transparent, lazy-initialised object graph traversal via proxies (ala Hibernate, but with the data provider being an app rather than an RDBMS). There's a stagnant SourceForge project called CarrierWave that started to explore some of these ideas.
Cheers,
Peter
Posted by: Peter Monks | 2006.09.07 at 11:02 PM
Hi Jeffery -
This is an argument that several have made in the committee.
In general SQL92 is close enough that we were able to create a fairly portable system with Documentum. It is only when you get into the esoterics of cursors, stored procedures and triggers that you have a real problem. Surprisingly, there has been a lot of concensus on bindings such as ODBC, JDBC and OleDB.
Fortunately, this is a standard and we can define the language how we want. We are definitely not going anywhere near things like cursors and stored procedures, and probably not venturing far into updates.
This really is about querying, how do we discover information in a repository? How do we entice developers into building applications that discover information in a repository?
A nice, simple SQL that adheres to the requirements above could do the trick. So could XQuery. Which would attract the most developers? My guess is something that resembles SQL, even if it were slightly quirky.
Posted by: John Newton | 2006.09.05 at 11:02 PM
The problem with SQL, though is: whose SQL are you talking about? As with (X)HTML SQL has suffered from a pretty big divergence between the textbook standard and vendor implementations.
Posted by: Jeffrey McManus | 2006.09.05 at 12:32 AM