Indexing | Habefast
In IT, we mainly talk about automatic document indexing. But what is it? It is a software method used to organise documents to facilitate content retrieval, like a library filing system but computerised. Depending on the type of documents to be classified, video, text, media or other, indexing approaches may differ.
Google, for example, uses a referencing index to classify and organise the different websites that are on the search engine. Google’s indexing aims to enable a display that is relevant to web users’ searches.
Each document is associated with what is called metadata (title, date of publication, author, category, etc.) which helps to index these different documents. But this metadata is not always accurate or representative. This is why computerised indexing is also based on the content to be able to classify it better. Hence the importance of semantics and the use of keywords. This makes it possible to classify a document according to categories and themes. The Google algorithm is based essentially on the content to classify the different web pages and index them.
Automatic indexing is more than necessary because the amount of data on the web is constantly increasing, and a variety of information is exposed every day. It is therefore necessary to succeed in classifying data according to their similarities in order to facilitate searches by Internet users.
Text indexing:
To index a text on the web, we will concentrate on the most used words which are logically part of the main theme of the page, by integrating filters of course. Logically, the words that appear the most are “and”, “of”, “the”, etc. Therefore, these frequent but meaningless words are filtered out, in order to find the most frequent meaningful words.
Images indexing:
They are indexed in two ways: either by their metadata, i.e. their title or other textual information about them, or by their appearance, i.e. shapes, colours, graphics, etc.
Audio and video indexing:
As with images, audio or video content can be classified by its metadata, or otherwise by data such as its duration or author.