Axiomatic Attribution for Deep Networks

We study the problem of attributing the prediction of a deep network to itsinput features, a problem previously studied by several other works. Weidentify two fundamental axioms---Sensitivity and Implementation Invariancethat attribution methods ought to satisfy. We show that they are not satisfiedby most known attribution methods, which we consider to be a fundamentalweakness of those methods. We use the axioms to guide the design of a newattribution method called Integrated Gradients. Our method requires nomodification to the original network and is extremely simple to implement; itjust needs a few calls to the standard gradient operator. We apply this methodto a couple of image models, a couple of text models and a chemistry model,demonstrating its ability to debug networks, to extract rules from a network,and to enable users to engage with models better.