With the enormous technological advances of recent years, the amount of digitized historical documents, both handwritten and printed, has increased. It is well known that digital historical documents are not easily processed in their original form, but they need to be transformed into a readable form in order to be automatically understood by computer vision tools. Word spotting is an important task to understand and exploit document contents by creating indexes. It is an information retrieval technique that aims to identify all occurrences of a query word in a set of documents (for example, a book). In the word spotting task, the input is a set of unindexed documents and the output is a ranked list of words according to their similarity to the query word. This allows quick and easy online access to cultural heritage materials and provides further opportunities to investigate these resources. The present PhD thesis investigates the problem of word spotting in historical documents. The first contribution of this work is the development of embedding space for word image representation based on the combination of convolutional networks and triplet loss. Subsequently, similarity distances are employed to match query words with all words present in the historical documents. The second contribution of this thesis presents an improved method for constructing an embedding space for a word spotting model through the adoption of multiple enhancement strategies. These strategies include preprocessing steps, transfer learning, online triplet mining, and semi-hard triplet selection techniques. The third contribution aims to enhance word spotting performance by developing a conditional generative adversarial network-based model for generating clean document images from highly degraded images. This enhancement model addresses various degradation tasks such as watermarks and chemical degradation, with the goal of producing hyper-clean document images and fine detail recovery performance. In the final contribution, we propose the utilization of a vision transformer architecture for the generation of word-image representations. The approach utilizes triplet loss as the optimization criterion and incorporates transfer learning from two distinct domains to improve the performance of the word-image representation. All these contributions are evaluated on many public databases that provide different challenges of historical documents. The obtained experimental results in the word spotting task for historical documents compare favorably with many recent state-of-the-art methods.