Automated categorization of geoscience text documents

Portfolio categories
Automated categorization of geoscience text documents Image

Document classification is the process of assigning categories or classes to documents to make them easier to manage, search, filter, or analyze. Traditionally, document classification is one of the major parts of the manual effort, especially when the documents to classify are scattered within a huge database. 

Our objective was to develop an automatic workflow to classify documents and group them according to their topic. 

We developed a workflow to classify documents, based on the similarity to the reference papers. To do so, 322 academic papers have been used and divided into 6 thematic classes: 3 geoscientific classes and 3 unrelated categories. In each category, one or more papers were set as “archetypal references”. The adopted methodology combined two concepts: Text similarity, developed in ElasticSearch under a search algorithm named “More Like This”, and “Text classification”, this latter being a Supervised Machine Learning approach.

More than 90% of papers were properly classified into their own thematic. 

This methodology is promising and it’s worth to improve it by testing it with a bigger database and by adding thematic classes closer to each other (e.g.: Sedimentology and sequence stratigraphy). 

Join TELLUS Share and...

TELLUS Share logo
  1. Access detailed technical information on the TELLUS TOOLS prototypes, benefit from live demos and open discussions
  2. Listen, drive and follow IFPEN initiatives on the digital transformation of geosciences
  3. Receive quarterly newsletters for worldwide scientific intelligence on this fast-paced field
  4. ...

Check out all benefits of TELLUS Share  membership