Text Mining in Financial Industry: Implementing Text Mining Techniques on Bank Policies
Summary
With the increase in data, that organisations collect and create, the necessity to leverage from these resources has become apparent. This pool of data distinguishes two primary data structures, namely structured data and unstructured data. Both of these formats come with their own bag of techniques for scrutinising the data and extracting information and knowledge subsequently. Besides not having a predefined structure or representation, unstructured data also comprise roughly 80% of all the data that organizations possess. Policy documents are a good illustration of this kind of data, with their text-heavy format and domain specific language. As written guidelines of acceptable actions to which organisations must adhere, policy documents are present across industries and in a large number. This is especially true for organisations in the financial industry, such as banks, who continuously introduce policies in order to be fully compliant with regulations that governing bodies impose. In an attempt to bring order and some understanding to policies, this research investigates the applicability and benefits of TM on processing such documents. Relying on the DS principles, initially, the literature was consulted, to determine the extent to which such techniques have been exploited on policies. This investigation revealed that the use of TM on policy documents fell short in both qualitative and quantitative aspect. Next, to the limited amount of publications that treated these concepts, the variety of techniques that were examined was narrow. Hence, through a CS in one of the biggest banks in Netherlands, a set of unprecedented techniques were applied to policy documents. The use of IE to extract references between policies, together with the use of automatic summarization and keyword extraction, to retrieve a concise representation of the documents and a set of descriptive labels (tags) respectively, were evaluated both statistically and by experts. The results showed that to a large extent, these techniques are capable of analysing internal policies and extracting reliable information from them. Furthermore, this led to the introduction of a new TM framework for processing policies. The framework is a MAM of the approach followed in this study and it represents the harmonic use of three different techniques and the results that derive from their utilisation. Thus, next to unveiling the current state of literature, this research also introduces a novel approach for processing policies with the use of TM techniques.