Question

A customer need a document managment system and I'm building information about this.

I know about sharepoint & alfresco, but in this case I'm evaluating the necesary info for build it from scratch, so please refrain to suggest the use of any of these (we are doing the evaluation of them separately, this is all about develop, not implement a existent solution).

This are the requeriments:

  • Have a very specific requeriment from legal managment of the documents that is specific to our local goverment, but apart from this:
  • A operation similar to google docs from the point of view of the end-user
  • Need store info from 200 + end-users (UPDATE: Are really +700 end-users)
  • Mainly office documents, pdf, text. I already have the extraction of plain text from this binary files.
  • No wiki, no portal creation, barely workflow but very simple, is only managment of files
  • Central repository, share across the company, integrated with the Active directory
  • Fast searching
  • Transparent desktop integration
  • Web interface
  • Multiplataform, if possible

So, this is the things I have on top of my head:

  • Storage: I know that sharepoint save all in the db (Alfresco too?). That is a nightmare, IMHO. I prefer put the metadata in a DB, and the files on disk.

I thinking about force the use of ZFS in this case & leverage their capabilities for versioning, snapshots & scaling. Or maybe use git as storage backend (git will work fine?)

So, where I can know more about how handle a large pool of documents, in ZFS or any regular file system? For example, how layout the folder structure to easy managemnt & fast responses, easy backup, etc.

  • Metadata: I think in a regular DB here, but wonder if have more merit save everything in Lucene (I have some experience on Lucene, but worry because Lucene can't be federated, rigth?).

If I use a search engine as metadata database I can save some work (not need a second pass for indexing), but a regular database engine is more standard.

  • Tech: I probably will build this in Django, PyLucene, Postgress, and do the shell integration for windows (I have not problems for do that).

I will apreciate any hints or info in how properly implement this solution.

Was it helpful?

Solution

Personally I find the "similar to Google Docs" and "Transparent desktop integration" requirements a bit vague, IMHO. But judging from the question you are more concerned about the backend and document storage, and looking more on using a more open source stack (with integration with AD)?

Anyway, personally I'm using KnowledgeTree as our Document Management System and their implementation is that all files resides on a file directory and the database will keep track on the path, corresponding metadata, access logs and versioning information. They basically kept several versions of the same file if a document has been updated - which I think was a fair enough idea implementation wise considering Microsoft Office documents are mostly binary (up until 2003).

You may want to understand how much documents they currently have and how many documents that they are sort of expecting to flow into this system on a daily basis. (Or from a different point of view, what kind of documents they are planning to store would generally give you hints on what kind of load your server is supposed to handle)

My guess is that most likely you could get away with the setup of having local filesystems and database storing metadata stuff unless you are sure that the system is expected to be handling a massive load of documents on a daily basis (imagine being Flickr for documents ;) ).

OTHER TIPS

  1. SharePoint and Alfresco are platforms where you can do quite a bit of customization, so even using them really means you are building something.

  2. SharePoint stores blobs in the DB by default, but has ways to put them on a filesystem

  3. If you make it yourself, support the frontpage extensions that Office apps use to communicate with SharePoint and Alfresco, and serve the documents with the right headers that tell IE to start the app. This way you get the same integration to Office apps that SharePoint has (users really love this feature) -- it's just a simple HTTP protocol

  4. If you go with SharePoint, my company as a free document previewer that can view PDF and soon will have Office docs. We sell the underlying tech, but it's Windows only.

  5. I love Django, and use it for all personal projects, but I really think .NET and Java will have more third-party support for the things you need, and much of your code will be portable to SharePoint or Alfresco if you decide to go that way later.

EDIT: More info on #3 as requested

http://blogs.msdn.com/mikefitz/archive/2005/03/14/395112.aspx http://blogs.msdn.com/stcheng/archive/2008/12/17/wss-use-rpc-protocol-to-access-wss-v3-site.aspx Official docs: http://msdn.microsoft.com/en-us/library/ms442469.aspx

Alfresco should be a great solution here. It supports every single one of your list of requirements except for the government thing.

But if you are building "from scratch", maybe take the ideas from it, at least?

Storage: the file content is saved on the filesystem. Easy to manage, store, backup and stuff. The files do not keep the names though, just their content is saved in binary format and the file is named as hashes (I guess hash of the content?)

Metadata: is placed in the database. Fast to access, change, update and stuff. Each node has properties - those are name, title, descripion, dates, audit info, whatever you need. It is just info and it is all saved into the "properties" table.

Search: Alfresco uses Solr for search, it used to be Lucene. I had pretty big installations, and if you put lucene index on the SSD, it's blazing fast. (lucene is fast anyway). It indexes both file content and properties - so you get to the node ID very fast.

Alfresco has CIFS implemented, as well as webdav, ftp and whatnot. The point is, you can just mount it to the users' desktops as folders or disks.

Web interface is there, central repo mgmt is there, all the reqs. And since it is open source, you could get some of that source and use it in your project. Although it would be much better to take Alfresco Community and just contribute back a bit if you feel okay.

Are you trying to build the Document management system? Alfresco & SharePoint? Alfresco & SharePoint are the project management solutions not the document management solutions. Alfresco is some kind of DMS solution but not the good in that. Yes! For the project management solution, it is a good software.

I’ll suggest you buy the document management solution which is legal management fo the documents and also specific to the local government. There are some document management system providers like Laserfiche & OnBase, their work is similar to the Google Docs. You can create an account for every employee of the firm or the business.

Yes all the documents are in the MS office format like Ms-Word, Ms-excel, PDF & PPT

Workflow with the Document Management system is much efficient and easy to handle

Yes in by using DMS you can easily find the file within minutes (Laserfiche Software take 10 mints to extract the file or folder) Laserfiche DMs is web interface software. You can login to the software and reach the file or folder from different locations easily

Storage

In DMS system all the data is secured and stored in a cloud storage. You can easily reach the document just by Logging in to your account. In case of lost or any misshapen, you can get the lost data from the company.

Meta Data

DMs system is the regular database engine as all the business data is secured in the cloud storage on the regular basis

Tech

There is no need to build anything; you only need to purchase the DMS software. I recommend you the Laserfiche because we are using their services

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top