Syntex and Large Language Models
Syntex and Large Language Models: A particularly attractive combination
Syntex is one of the tools that Microsoft is pushing hard in the field of artificial intelligence (AI), especially since the breakthrough of its chatbot ChatGP. The company offers a wide range of applications and tools based on AI, OpenAI and Large Language Models (LLM). Syntex is integrated into the Office 365 Cloud, so you can do all sorts of smart things with documents without having to write software code.
For companies in the legal world, for example, such as law firms, this is a particularly attractive combination. In these types of organizations, sometimes hundreds of thousands of documents have been stored over the years, often without anything or anyone still having the complete overview of what can be found where. The average lawyer is also not a software expert, and specialized software talent is becoming increasingly scarce and expensive, so an approachable and user-friendly solution like Syntex can be a godsend.
A role for AI: classification and extraction
So for a company like Documentaal, with a customer base largely from the legal world, Syntex is therefore definitely interesting to try out. We present to you a fictitious practical situation: from a library of several hundred scanned PDF documents stored in your SharePoint environment, we want to select the contracts originating from one particular organization. Of those contracts, we want to know the signer(s). In Syntex terminology, this is called classification and extraction.
Depending on the exact requirements and IT experience of the organization, it may still require an investment in time to get the Syntex fully operational. It may be an interesting option to hire external expertise for this purpose. But once it is well and truly up and running, Syntex can also be maintained by non-IT professionals. That's a major plus over customization.
In terms of classification, Syntex does a very reasonable job: in our test set, eleven of the twelve documents tested were classified correctly.
In terms of extraction, performance is less: only three out of 12 signatures are recognized correctly. The algorithm used under water appears to be very sensitive to optical disturbances, such as a smudge on the scanner glass or a signature superimposed on the printed text. Unfortunately, many people have the unfortunate habit of putting their signature that way. As a result, Syntex's performance in this test is poor.
In terms of turnaround time, Syntex takes a few minutes to half an hour, depending on how busy the server is. So for a small number of documents that's quite long, but the nice thing is that this time hardly increases for larger numbers. So you can work through an entire SharePoint environment in a very reasonable amount of time.
Classification and extraction via Python, OpenAI and LLM
A previous blog described how ChatGPT, from OpenAI, can be used to unlock domain-specific knowledge. In short, you provide selected domain knowledge along with your query and OpenAI usually gives you back a very acceptable answer. At the same time, you thereby share your knowledge, in this case your document, with OpenAI, which is not always desirable. That's why Microsoft launched its own version of OpenAI a few months ago. Everything you send there stays within your own Microsoft environment and is not shared with the outside world.
It is interesting to investigate whether we can do the same with this approach as with Syntex: do a good classification and extraction. It takes a bit of trial and error to formulate the questions in such a way that the answers actually make sense, but then the results turn out to be surprisingly good: out of 12 documents, 12 are classified correctly and out of 10 the signer is recognized correctly. Python also suffers somewhat from optical interference, but to a much lesser extent: 10 out of 12 instead of 3 out of 12. Python needs about 10-20 seconds per document. So for larger numbers, that does add up.
In summary, if your organization only needs to classify documents, Syntex is a good choice, especially if large numbers are involved. If extraction is needed and optical interference is involved, Python currently performs even better. However, it can be expected that Syntex's character recognition will be improved soon. Also, Syntex has the advantage that you do not have to write your own code, making the tool more user-friendly.
What can Documentaal do for you?
Documentaal has decades of experience with document management systems and is also versed in the latest techniques in AI and Large Language Models. Does your organization also benefit from expertise in accessing large volumes of documents? Then get in touch with us! Our experts are happy to use their knowledge to give you even better insight into all the knowledge stored in your organization.
Kees Rinzema
Data Scientist/Data Analyst at Documentaal