Īlong with the advancement of RS, image processing techniques and machine learning algorithms have also been significantly promoted. These practices have been performed using different sources of RS data, including multi-spectral, synthetic aperture radar (SAR), light detection and ranging (LiDAR), hyperspectral, thermal, and digital elevation model (DEM). In particular, RS allows timely monitoring of croplands to extract different information concerning the crop phenological status, health, types, and yield estimation over small- to large-scale areas, based on different characteristics of satellite images (e.g., spatial, temporal, and spectral resolutions). This is owing to the frequent, broad-scale, and spatially consistent data acquisition of RS systems. RS has long been recognized as a trustworthy approach for extracting specialized information about agricultural products. Thus, it is crucial to employ efficient approaches, such as advanced machine learning along with remote sensing (RS) data, to ensure high-quality information is derived about crops in order to achieve specified goals. Additionally, it is more appealing to incorporate efficient approaches that facilitate the requirement of sustainability and climate change adaption. Accordingly, it is vital to obtain authentic information about the location, extent, type, health, and yield of crops to ensure food security, poverty reduction, and water resource management. Additionally, climate change effects and catastrophic natural disasters (e.g., drought and flood) are already hampering agricultural production and threatening food security from local to global scales. Insight 2: Based on the fact that the encoder-decoder attention heads are retained mostly in the last layers, it is highlighted that the first layers of the decoder account for language modeling, while the last layers for conditioning on the source sentence.Considering the prospect of human population growth, which is expected to reach 8.7 billion by 2030, the food supply system is subjected to escalating pressure. Interestingly, the encoder attention heads were the easiest to prune, while encoder-decoder attention heads appear to be the most important for machine translation. Note that this corresponds to roughly 2⁄3 of the heads of the encoder.īelow are the results of pruning the Transformer’s encoder heads in two different datasets for machine translation: Source: Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Prunedīy mostly keeping the heads that are classified in the distinguished categories, as shown, they managed to retain 17 out of 48 heads with almost the same BLEU score. Here is an example of their pruning strategy based on the head classification for the 48 heads (8 heads times 6 blocks) of the original transformer: The best way to prove the significance of their head categorization is by pruning the others. Heads that point to rare words in the sentence. Syntactic heads that point to tokens with a specific syntactic relation. Positional heads that attend mostly to their neighbor.
They identified 3 types of important heads by looking at their attention matrices: analyzed what happens when using multiple heads in their work “ Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”. Encoder weights can be classified and pruned efficiently This renders attention as a routing algorithm of the query sequence with respect to the key/values. The heads preserve almost all the content. We tend to think that multiple heads will allow the heads to attend to different parts of the input but this paper proves the initial guess wrong. In such a case, the attention mechanism can be interpreted as the routing of multiple local information sources into one global tree structure of local representations.” ~ Schlag et al.
Insight 1: “This (their results) indicates that the attention mechanism incorporates not just a subspace of the states it attends to, but affine transformations of those states that preserve nearly the full information content. Attention as the routing of multiple local informationīased on the ‘ Enhancing the Transformer With Explicit Relational Encoding for Math Problem Solving’ paper: Inspired by this, there are many papers that use one shared projection matrix for the keys and the queries instead of two.
However, keep in mind that the rank of the resulted matrix will not be increased. Why? Because when you multiply a matrix with its transpose you get a symmetric matrix. That would make the presented graph undirected. Dot-scores = ( Q K T d k ) \operatorname_K W Q = W K .