System design and evaluation methodologies receive significant attention in natural language processing (NLP), with the systems typically being evaluated on a common task and against shared data sets. This enables direct system comparison and facilitates progress in the field. However, computational work on metaphor is considerably more fragmented than similar research efforts in other areas of NLP and semantics. Recent years have seen a growing interest in computational modeling of metaphor, with many new statistical techniques opening routes for improving system accuracy and robustness. However, the lack of a common task definition, shared data set, and evaluation strategy makes the methods hard to compare, and thus hampers our progress as a community in this area. The goal of this article is to review the system features and evaluation strategies that have been proposed for the metaphor processing task, and to analyze their benefits and downsides, with the aim of identifying the desired properties of metaphor processing systems and a set of requirements for their evaluation.