By John Newton, David Caruana, and Paul Holmes-Higgin
(Warning: The content in this blog is technical and Java in nature. Proceed at your own risk.)
Alfresco is a complete Enterprise Content Management System and 100% open source. It is a comprehensive content management development platform and scalable repository supporting JSR-170, so is suitable for building complex enterprise-scale content applications. Access to the repository is provided through APIs, such as JSR-170 and web services, as well as through a virtual file system interface implementing the CIFS (Common Internet File System) protocol to emulate a Microsoft shared file system, FTP and WebDAV. The Alfresco system is built upon Spring taking full advantage of Spring’s dependency injection model to extend repository functionality, as well as incorporating Hibernate for persistence, Lucene for querying and indexing, jBPM for business process management, and the Mozilla Rhino JavaScript engine.
The Alfresco system has a web-based application that provides document management, web content management and records management capabilities. The web application, based upon the MyFaces implementation of JSF, is extensible, programmable and scriptable. Through Spring configuration it is possible to add new dialogs, views and wizards. The web application also provides dashboards to track repository and workflow activity. These dashboards are programmable either through Java or high-level templating languages, such as the open source FreeMarker templating engine.
As an example of building an enterprise content application, we use a scenario of building an email archiving application. For this, we use the JSR-170 interface to add content; add a new action to the repository specifically for email; use JavaScript for processing the email; then create an RSS feed using a Freemarker template, and use the templating engine to create a web view of email activity.
Defining a Metadata Aspect in Alfresco
For the purposes of our application, we will add a new metadata aspect to tag incoming emails. A metadata aspect is similar to a type, except that it can be added after the content has been created and more than one aspect can be added to the content object. It is similar to a JSR-170 mixin, except that in Alfresco it can also have behavior attached to it. We will use a predefined aspect in this example, the “cm:emailed” aspect, which includes the following metadata:
- Originator
- Addressee
- Subject Line
- Sent Date
Alfresco models are defined in XML and can be loaded dynamically. In this example, we create a new aspect called “tagged”, which is defined as follows:
<aspect name="cm:tagged">
<title>Tagged</title>
<properties>
<property name="cm:tag">
<type>d:text</type>
<multiple>false</multiple>
</property>
</properties>
</aspect>
In order to view this in the Alfresco web client, we can extend the properties sheet with the following configuration:
<config evaluator="aspect-name" condition="cm:tagged">
<property-sheet>
<show-property name="cm:tag" />
</property-sheet>
</config>
There are metadata aspects available as well for Dublin Core, DOD 5015.2 records management, basic Microsoft Office metadata, automatic counters, workflow process data, classification, auditing and localization among others.
Content Storage through JSR-170
Alfresco is a JSR-170 compliant repository therefore supports the JCR API. For purposes of this example, we assume that there is an email listener that stores the email as a JCR node using the JSR-170 interface. The node is placed in an “email drop zone”, a well known path for processing the content based upon rules in the repository. The drop zone is identified as a node in its own right and the email is attached as a child of that node. The actual binary content of the email is then added to this child node.
public class EmailListener
{
...
private Node importEmail(InputStream msg)
{
// locate email dropzone folder
Node rootNode = session.getRootNode();
Node zone =
rootNode.getNode("app:company_home/cm:email/cm:dropzone");
// add email to folder (which fires registered rules)
Node email =
zone.addNode(GUID.generate(), "cm:content");
email.setProperty("cm:content", msg);
return email;
}
...
}
The purpose of placing content in a drop zone is that it allows the business logic of email filing to be specified independently of the application and more importantly by business users. This is done in Alfresco through rules and associated actions.
Defining Rules to Process Incoming Content
The Alfresco repository organizes information in a hierarchical structure similar to other enterprise content management repositories. These structures are called spaces, which are similar to folders in a file system, but also contain rules for processing content that is added, removed, moved or updated in that folder. They also have users associated with that space that may have different roles in interacting with content in the space.
The rules associated with the space determine the disposition of content in the space. Rules can be used to change the type of the content being added, add aspects of metadata, attach behavior such as locking and versioning, and transform and copy to other spaces. In the email example, we will use a rule to extract metadata from the content and use that metadata to classify and move the content to a new space determined by the metadata.
To do this, we use the Alfresco web client, navigate to the email drop zone space and specify rules for content entering the space. Through a set of wizards, we add the following actions for all new items of mimetype email or “message/rfc822”:
- Add the email aspect - this is the email metadata mentioned previously and can be combined with other aspects such as record data or process data.
- Add the tagged aspect - this is the aspect we defined earlier.
- Extract metadata - this is a standard capability of Alfresco that looks inside standard file formats to extract standard information such as author, title and subject. In this example, we will extend the system to extract additional standard metadata from emails.
- Execute the “emailtag.js” JavaScript - this is a server-side JavaScript example that we will show in a later section. JavaScript is stored in the repository and can be executed just like Alfresco internal actions.
Rules and actions can be combined and chained to create more complex logic. Rules can include tests of types of content, which aspects are applied and what metadata has been set. These rules in turn fire off the actions in sequential order or can be executed asynchronously for long running operations. Common actions performed in rules are transformation, copying, moving and metadata setting and extraction.
Adding a New Behavior to the Alfresco Repository
Although the Alfresco repository already has an action to extract metadata from email, since it is a relatively new extension of the existing repository, it is worth showing how it was added. In addition, it is a good example of how Alfresco uses the dependency injection pattern of Spring to add new functionality without requiring rebuilding the repository system. In this example, the metadata extraction action has a standard Java interface defined as follows:
public interface MetadataExtractor
{
public double getReliability(String sourceMimetype);
public long getExtractionTime();
public void extract(ContentReader reader, Map<QName, Serializable> destination);
}
For this example, we will add a new interface to inject into the MetadataExtractor interface that uses the open source POI Java access tool to read the proprietary Microsoft file format. We first insure that the file actually is a Microsoft Exchange message or rfc822 and then we read the fields delimited by the following hex codes:
- 0C1F - The message originator
- 0037 - The message subject
- 39FE - The message addressee
The following code accesses these fields through POI and then sets the appropriate content properties on the metadata. Obviously, more complex processing or more metadata fields could be added to the code.
public class MailMetadataExtractor extends implements MetadataExtractor {
private static final String PREFIX = "__substg1.0_";
private MetadataExtracterRegistry registry;
...
public void extract(ContentReader reader, Map<QName, Serializable> props)
{
POIFSReaderListener listener = new POIFSReaderListener()
{
public void processPOIFSReaderEvent(final POIFSReaderEvent event)
{
if (event.getName().startsWith(PREFIX))
{
String type = event.getName();
type = type.substring(PREFIX_LENGTH,
PREFIX_LENGTH + 4);
if (type.equals("0C1F"))
props.put(PROP_ORIGINATOR, extractText());
else if (type.equals("0037"))
props.put(PROP_SUBJECT, extractText());
else if (type.equals("39FE"))
props.put(PROP_ADDRESSEE, extractText());
...
}
}
POIFSReader poi = new POIFSReader();
poi.registerListener(listener);
poi.read(reader.getContentInputStream());
};
}
}
To register this bean, we merely added the following Spring configuration:
<bean class="MailMetadataExtractor" init-method="register">
<property name="registry">
<ref bean="metadataExtractorRegistry"/>
</property>
</bean>
Similar extensions can be added for transformations from one format to another, authentication interfaces, encryption and compression mechanisms on content transfer, and even rules and actions.
Using JavaScript to Add Repository Behavior
Previously, we mentioned the “emailtag.js” JavaScript for using the metadata to classify the emails. We could implement this in Java, but for simple tasks, it is often easier and just as efficient to implement them using JavaScript. Alfresco incorporates the Mozilla Rhino JavaScript engine. It includes all of the standard functions and classes of ECMA Script, but also has the ability to work with the Alfresco content model as well as the JBoss jBPM model for workflow applications. A special data dictionary space is provided for storing and managing scripts just as one would for any other content, allowing complete versioning, locking, auditing and CIFS access.
In this example, the “emailtag.js” JavaScript is invoked through the rule associated with email drop zone space. This script finds a tag from the subject line that has just been extracted from the previous rule action, searches for any term that is delimited by square brackets and adds that to the tagged metadata aspect. All that is required is the following four lines.
var subject= document.properties.subjectline
var tag= subject.substring(subject.indexOf('[')+1,subject.indexOf(']'));
document.properties.tag = tag;
document.save();
The script is atomic in that either the whole action occurs or it doesn’t. The script could also set up complex classifications or relationships to another content objects. Most of the Alfresco processing that can be done in Java can also be done in JavaScript, so the choice becomes one of performance and extension rather than capabilities.
Building an RSS Feed using FreeMarker
The Alfresco system also includes the FreeMarker templating engine. FreeMarker was chosen for its extensibility to other data models as well as its ability handle XML. The templating language is particularly suited to production of HTML and XML. Like other templating languages such as Velocity, Perl or PHP, directives to access and manipulate data are defined in tags interwoven with the static output to be delivered. The FreeMarker language has constructs for manipulating lists, defining reusable macros, and string and variable manipulation.
Alfresco has an open templating engine interface into which FreeMarker has been incorporated. FreeMarker has access to the Alfresco data model and can query and access content. FreeMarker can iterate through a folder, walk through a parent-child tree structure, and access properties and content. This ability provides a convenient tool for constructing complex content and provide re-use of content. In addition, FreeMarker has access to the URLs and icons for content to generate query-driven links and good report writing capabilities. Although designed for generating HTML and XML, FreeMarker can be used to generate any type of content and is the content is URL-addressable from Alfresco.
For this example, we use a FreeMarker template to generate an RSS feed for specifically tagged emails that have been collected by our email listener over the last seven days. In the FreeMarker template, we set up the normal RSS headers and use references to the Alfresco model to set up the description of the feed. For brevity, we include the heart of the RSS feed, which is a list generated by an XPath query of all content in the email space that has the tag of the argument tag associated with it. The template then pulls out metadata out of the content node to populate the appropriate RSS tags.
<?xml version="1.0"?>
<rss version="2.0">
<channel>
...
<#assign weekms=1000*60*60*24*7>
<#list space.childrenByXPath
[".//*[@cm:tag:${args.tag}]"] as child>
<#if (dateCompare(child.properties["cm:modified"], date, weekms) == 1)
|| (dateCompare(child.properties["cm:created"], date, weekms) == 1)>
<item>
<title>${child.properties.name}</title>
<link>${hostname}${child.url}</link>
<description>
${"<a ref='${hostname}${child.url}'>"?xml}
${child.properties.name}
${"</a>"?xml}
<#if child.properties["cm:description"]?exists
&& child.properties["cm:description"] != "">
${child.properties["cm:description"]}
</#if>
</description>
<pubDate>
${child.properties["cm:modified"]?string(datetimeformat)}
</pubDate>
<guid isPermaLink="false">${hostname}${child.url}</guid>
</item>
</#if>
</#list>
...
To invoke this RSS feed, first save the above script in the Presentation Templates space of the data dictionary. Then navigate to the email drop zone space and open the properties dialog. There is a tabbed area for RSS feeds. Apply the above script as the RSS feed and copy the URL link for the RSS feed. Add an argument of “?tag=tag_name” and add this to your RSS reader.
Scalability and Clusterability
This application provides an example of the capabilities for storing, managing and accessing content from the Alfresco repository. This application can sit side by side with the other applications that Alfresco provides out of the box. Nothing is required to make this application and others scalable.
The Alfresco system can scale from small organizations to hundreds or even thousands of users on inexpensive off-the-shelf hardware. In benchmarks validated by independent parties, Alfresco using RHEL 4 and MySQL 5.1 was able to produce the following numbers on a SuperMicro 3GHz Opteron dual core, dual processors system with 12Gbytes of memory of which 4Gbytes were allocated to Java and 6 x 100G RAID-configured drives.
- 10 Million objects total in repository
- Bulk load 60 documents per second into 10 Million object repository
- Up to 128 concurrent threads
- Access via unique id in under 0.1 seconds
- Concurrent active mix of reads and writes at 128 per second
To support even larger systems, the Alfresco system can be clustered in loosely coupled hardware to take advantage of existing hardware resources. This is due to the fact that Alfresco is architected as stateless system with all operations performed in the context of transactions coordinated through the underlying database. Using the distributed EHCache open source cache means that all clustered systems share a common view of the contents of the cache and their freshness. Combined with a clustered database such as MySQL 5.1, the Alfresco system can be extremely scalable.
Conclusion
We have seen an example of how the Alfresco system can be used to build an enterprise-class application such as the archival and retrieval of email and enhance that storage with rules that can extract additional metadata and act upon that data. We have seen how the Alfresco system can be used as a web conduit for monitoring and delivering content from an enterprise repository. The system itself can scale to the requirements of the enterprise using the inherent scalability of the components upon which Alfresco has been built and through the transactional clustering capability of the system.
If you would like to know more about Alfresco, please visit the developer web site at http://www.alfresco.org.
John Newton is Chief Technology Officer of Alfresco. David Caruana is the Chief Architect of Alfresco. Paul Holmes-Higgin is the Vice President of Engineering for Alfresco.