(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version
11institutetext: University of Maryland, Computer Science Department 22institutetext: George Washington University, Law DepartmentCorrespondence to 22email: mmoayeri, sfeizi @umd.edu
Mazda Moayeri11 Samyadeep Basu11 Sriram Balasubramanian11 Priyatham Kattakinda11 Atoosa Chengini11 Robert Brauneis22 Soheil Feizi11
Abstract
Recent text-to-image generative models such as Stable Diffusion are extremely adept at mimicking and generating copyrighted content, raising concerns amongst artists that their unique styles may be improperly copied.Understanding how generative models copy “artistic style” is more complex than duplicating a single image, as style is comprised by a set of elements (or signature) that frequently co-occurs across a body of work, where each individual work may vary significantly. In our paper, we first reformulate the problem of “artistic copyright infringement” to a classification problem over image sets, instead of probing image-wise similarities. We then introduceArtSavant, a practical (i.e., efficient and easy to understand) tool to (i) determine the unique style of an artist by comparing it to a reference dataset of works from artists curated from WikiArt, and (ii) recognize if the identified style reappears in generated images. We leverage two complementary methods to perform artistic style classification over image sets, includingTagMatch, which is a novel inherently interpretable and attributable method, making it more suitable for broader use by non-technical stake holders (artists, lawyers, judges, etc). LeveragingArtSavant, we then perform a large-scale empirical study to provide quantitative insight on the prevalence of artistic style copying across 3 popular text-to-image generative models. Namely, amongst a dataset of prolific artists (including many famous ones), only 20 of them appear to have their styles be at a risk of copying via simple prompting of today’s popular text-to-image generative models.
1 Introduction
In the recent years diffusion-based text-to-image generative models such as Stable Diffusion, Imagen, Mid-Journey, and DeepFloyd [20, 21, 1, 16] have captured widespread attention due to their impressive image generation capabilities. Notably, these models demonstrate exceptional performance with very low FID scores on various conditional image generation benchmarks, showcasing their advanced capabilities. These models are pre-trained on a large data corpus such as LAION [23] containing up to 5B image-text pairs, which mirror a vast range of internet content, including potentially copyrighted material. This raises an important question - to what extent do image generative models learn from these copyrighted images? While previous studies [4, 25, 26] have shown that direct copying in diffusion models on the level of individual images is generally rare and mostly occurs due to duplications in the training data, the degree to which image generative models replicate art styles as opposed to art works remains unclear. This issue is increasingly critical as artists express concerns about generative models mimicking their unique styles, potentially saturating the market with imitations and undermining the value of original art. Furthermore, there are no laws currently to identify and protect an artist’s style - mainly due to challenges in definition and a previous lack of necessity.
Artistic styles are complex, broadly defined over a set of artworks created over their lifetime, making it challenging to determine a style by inspecting individual works of art (a la previous image-wise copy studies). We frame artistic style as characterized by a set of elements that co-occur frequently across works by that artist. For e.g., Vincent Van Gogh had a characteristic art style associated with Post-Impressionism - comprising expressive wavy lines, bright unblended coloring and his signature choppy textured brushwork. In Figure2, we illustrate that while these models seldom reproduce Van Gogh’s artworks exactly, they frequently capture and replicate elements of his distinctive style.
To empirically study style copying in generative models and to build a corpus of artistic styles, we first collect an art dataset consisting of artworks from WikiArt from 372 artists, along with the artist labels. We then proceed to develop ArtSavant- a practical tool which can effectively detect and attribute an image to its original artist. The design of this tool is strongly motivated by (i) the notions of ‘holistic’ and ‘analytic’ comparisons from the copyright legal literature [11, 14] and (ii) shedding insight on the question: Is there a unique set of elements co-occurring across a given artist’s works, and if so, can we extract this ‘signature’?
For a style to be considered unique, it must be distinguishable from the styles of other artists. However, describing an art style is challenging and making a case for the distinctiveness between two styles is even more so, particularly as artists frequently draw inspiration from each other. An alternative approach to proving the uniqueness of artistic styles involves demonstrating that from a collection of artworks, one can identify the artist(s) who created them. Hence, if art can be accurately attributed to its creators, this would suggest the existence of unique styles that differentiate one artist’s works from another’s. Therefore, the task of showing the existence and distinctiveness of artistic styles can be reduced to a classification problem (arguably simpler than computing image-similarity).
Our tool’s first component – DeepMatch – is a neural network which classifies an artwork to its corresponding artist. DeepMatch implicitly maps each artist to a vector (via the classification head) during training, which can be interpreted as a neural signature representing an artist. Aggregating its predictions over a set of artworks via majority voting, we find that DeepMatch achieves test accuracy. This success indicates that unique artistic styles do indeed exist for a large fraction of artists. Since deep features are not very interpretable, DeepMatch is not suited for articulating the elements that comprise each artistic style. For ArtSavantto be useful and trustworthy to artists, lawyers, judges or juries, its output must be interpretable by design. Thus, there is a need to complement DeepMatch with a more transparent system which scaffolds a black-box AI component with interpretable intermediate outputs and combines these outputs to arrive at a final prediction. To fill this need, we introduce TagMatch, an inherently interpretable and attributable tag-based classifier.
TagMatch first tags individual artworks using a tagging method via a novel zero-shot, selective, multilabel classification using CLIP [17]. These tags are drawn from a concept vocabulary spanning diverse aspects of artistic style, created using LLMs such as Vicuna-13B [31], and validated using an MTurk human study. They describe a wide variety of artistic attributes like composition, coloring, shapes, and medium. While individual tags are common across artists and thus cannot define unique styles, we propose an efficient algorithm to search over the vast space of combinations over tags to surface tag signatures. Namely, we compose common ‘atomic tags’, and find that tag compositions become less frequent as the number of atomic tags within them increases. If a tag combination is unique to a specific artist, it can then be interpreted as a tag signature representing a unique artist, and even be used to a set of works to a known artistic style, defined over a portfolio of works for a reference artist.We find such signatures for all artists in our dataset, and tag signatures are reliable enought to detect the style of the artists in our dataset (on a held out set) with top-1 and top-5 accuracy. TagMatch is also highly efficient – once the CLIP embeddings are computed and cached, it takes just around a minute to search and find tag combinations, despite the search space being combinatorial. But most importantly, with each style detected, TagMatch also articulates the stylistic elements that were uniquely present in the test set of images and the matched reference set. Moreover, TagMatch is attributable, as one can inspect the subset of images from both sets that contain the matched tag signature.
Utilizing both these components, we proceed to use ArtSavantto classify the images generated by modern text-to-image models when prompted with an artist’s name to the original artists. Somewhat surprisingly, even for the prolific artists we investigate, we find high risk of style copying for only about of artists. We find that while style copying is not highly prevalent for the artists in our dataset, it still occurs very often and may become more common as models become better.
In summary, we make the following contributions in our paper:
- •
We reformulate the copyright infringement of artistic styles through the lens of classification over image sets, rather than a single image.
- •
We introduceArtSavant, a tool consisting of a reference dataset of artworks from prolific artists, and two complementary methods (including a novel, highly interpretable and attributable one) which effectively can detect unique artistic styles.
- •
Leveraging ArtSavant, we perform a large-scale empirical study to understand style copying across 3 popular text-to-image generative models, highlighting that generated images (using simple prompting) from only 20 of the artists in our dataset appear to be at high risk of style copying.
2 Related Works
As image generative models have rapidly improved in scale and sophistication, the possibility of them mimicking artists’ personal styles has been an important topic of discussion in the literature [18].Many previous works describe ways to either detect potential direct image copying in generated images, or to foil any future copying attempts by imperceptibly altering the artists’ works to prevent effective training by the generative models. These include techniques like adding imperceptible watermarks to copyrighted artworks [28, 8, 7], and crafting “un-learnable” examples on which models struggle to learn the style-relevant information [24, 29, 30]. These methods are typically computationally expensive and incur a loss in image quality, which may render these techniques impractical for many artists. Also, they do not protect artworks which have been previously uploaded to the internet without any safeguards. Others have suggested methods to mitigate this issue from the model owner’s perspective - to either de-duplicate the dataset before training [4, 25, 26], or to remove concepts from the model after training (“unlearning”) [13, 9, 3]. These are also technically challenging, and require the model owner to invest significant resources which may again inhibit their practicality. Methods like [4, 25, 26] are also more focused on analyzing direct image copying from the training data, and thus may not be applicable to preventing style copying.
None of these works tackle the problem of detecting potentially copied art styles in generated art, especially in a manner which may be relevant to legal standards of copyright infringement. According to current US legal standards [2], an artwork has to meet the “substantial similarity” test for it to be infringing on copyright. This similarity has to be established on analytic and holistic terms [14, 11]. Analytic here refers to explaining an artwork by breaking it down into its constituents using a concrete and objective technical vocabulary, while holistic refers to the overall “look and feel” of the artwork. Thus, it is highly desirable for any automated copy-detection method to reflect this dichotomy in its working. We thus design our tool based on these two notions.
3 Motivation
Recent works have investigated copying on an image-wise level, showing that state-of-the-art diffusion models can generate exact replicas of a small fraction of training set images [25, 26, 4]. Typically, these works involve representing images in a deep embedding space via models like SSCD[15] or DINO[5], and computing image-to-image similarities across generated and real images. These results, as well as anecdotal instances, have raised concerns amongst artists, since generative models may pose a risk of saturating the market with replications, thus jeopardizing the artists’ livelihoods, as well as cheapening creative work they likely feel personal ownership and attachment to. Inspired by these valid concerns and existing results on image replication, we first explore if generative models can recreate famous artworks, e.g., by Vincent Van Gogh. Specifically, we generate images by prompting “{artwork title} by Vincent Van Gogh” for Van Gogh works, and compute the DINO similarity between correponding real and generated pairs. Figure 2 visualizes the distribution of similarities, as well as examples at each similarity level. We find that the vast majority of similarities are lower than , which corresponds to (generated, real) pairs that are far from duplicates. However, even when the generated image differs significantly from the source real image, certain stylistic elements associated with Van Gogh seem to appear consistently in the generated works.Thus, while instance-wise copying of artwork appears rare for even the very famous Van Gogh, we argue that style copying requires going beyond image-to-image comparisons, as artists may still have their personal styles, developed over a long time and at significant personal cost, infringed upon in ways that searching for exact replicas would miss. Namely, to investigate style copying, one must first identify one (or more) styles used by an artist, which we define as a set of elements, or signature, that frequently occurs across a multiple works.
Currently, copyright law does not protect an artist’s style; however, this may be due to the prior absence of a pressing need for such protection from humans, combined with the difficulty of defining a person’s style. Given the rapid advance of generative models, we attempt to tackle the problem of defining and identifying artistic styles. Importantly, so that our results are useful to a broad audience, we prioritize transparency and efficiency in our work. That is, we design a tool that is fast enough for an end-user (e.g., artist or lawyer) to run, and interpretable enough so that the user can understand and convey the results to another party (e.g., judge or jury). In line with the existing notions of ‘analytic’ and ‘holistic’ components of an artwork, we extend these notions to an artist’s style, which may be developed over a set of artworks comprising an art portfolio. Here, the holistic component would refer to the overall ‘look and feel’ of the artist’s style, while the analytic component would consist of a breakdown of an art style into its constituent components. We make this concrete by designing a style copy detection tool which corresponds closely with the mentioned concepts and operates on a set of images rather than on a single image.
4 Towards Practical Artistic Style Copy Detection
To argue an artist’s style is copied, one must first demonstrate the existence of a unique style for the artist. An analytic approach is to articulate the frequently co-occurring elements that comprise the artist’s style. Alternatively, a holistic argument is to show that the artist’s work can consistently be distinguished from that of other artists, than there must exist something unique that is present across the artist’s portfolio. In the latter case, we have reduced style copy detection to a classification problem over sets of images (i.e. artist portfolios), something neural networks are well suited to do. We now propose DeepMatch and TagMatch, two complementary methods (w.r.t. accuracy and interpretability) that detect artistic styles in holistic and analytic manners, respectively.
4.1 WikiArt Dataset
To distinguish on artist’s style from that of others, we need a corpus of artistic styles (consisting of portfolios from many artists) to compare against. To this end, we curate a dataset consisting of artworks from WikiArt111https://www.wikiart.org/ to serve as (i) a reference set of artistic styles, (ii) a validation set of real art to show our method’s can recognize an artist’s style when shown a held-out set of their works, and (iii) a test-bed to explore if text–to-image models replicate the styles of the artists in our dataset in their generated images. Previous work [27] uses images from WikiArt to for a different purpose (GAN training). Since the content of WikiArt has been updated since then, we re-scrape WikiArt, perform a filtering step and subsequently curate a repository of 91k painting images encompassing 372 unique artists (denoted by the set ). The filtering step ensures that each artist has at least 100 images.We also denote the set of images (or portfolio) for each artist as . Each of the images in our dataset is annotated with its corresponding genre (e.g., landscape) and style (e.g., Impressionism) which provides useful information about the paintings beyond their given titles. We provide an easy-to-execute script with all the necessary filtering step, so to generate newer versions of WikiArt if desired (see Appendix). We now detail our two complementary methods that compare a test set of images to our reference corpus so to detect if any of the reference styles reappear.
4.2 DeepMatch: Black-Box Detector
DeepMatch consists of a light-weight artist classifier (on images) and a majority voting aggregation scheme to obatin one prediction for a set of images. Majority voting requires that at least half the images in a test set are predicted to for DeepMatch to predict , allowing for abstention in case no specific style is recognized with sufficient confidence. For our classifier, we train a two layer MLP on top of embeddings from a frozen CLIP ViT-B\16 vision encoder [17] to classify artwork to their respective artist, using a train split containing of our dataset. We employ weighted sampling to account for class imbalance. Because we utilize frozen embeddings, training is very fast, taking only a few minutes. Thus, a new artist could easily retrain a detector to include their works (and thus encode their artistic style).
Validation of the Detector. We apply DeepMatch on the held-out test split of our dataset and observe that the image-wise classifier attains accuracy per image over artists. When aggregating image-wise predictions via majority vote, of artists are matched, validating our method, and offering strong evidence towards the existence of unique artistic styles. Specifically, neural classifiers capture unique and frequently co-occurring characteristics of the artists in their embedding space, which can be thought of as neural signatures. Figure 3 shows the distributions of image-wise accuracies per artist, shading correctly matched images (green). We also present an image from one of the few artists who’s style is not matched by DeepMatch, along with an image from a similar artist. Notice that the style of two artists can be extremely similar, making the existence of unique artistic styles for the vast majority of artists considered (by way of neural signatures) a non-trivial observation.
4.3 Interpretable Artistic Signatures
Now we provide an interpretable alternative to matching via neural signatures. We draw inspiration from the interpretable failure mode extraction of [19]. Namely, we first tag images with atomic tags drawn from a vocabulary of stylistic elements. Then, we compose tags efficiently to go from atomic tags that are common across artists to longer tag compositions that are unique to each artist (i.e. tag signatures). Lastly, we detect artist styles in an attributable way via match tag signatures in a test portfolio to those for artists in our reference corpus. We detail each step of TagMatch, as well as validation results, below.
Zero-shot Art TaggingWe utilize the zero-shot open-vocabulary recognition abilities of CLIP to tag images with descriptors of stylistic elements. First, we construct a concept vocabulary with help from LLMs. Namely, we prompt Vicuna-13b and ChatGPT to generate a dictionary of concepts along various aspects of art. We manually consolidate and amend the concept dictionary, resulting in a vocabulary of concepts over aspects (see Appendix 0.D.1).
To assign concepts to images, we a design a novel scheme that consists of selective multilabel classification per-aspect. Namely, for an image, we compute CLIP similarities to all concepts, and normalize similarities within each aspect. Then, we only assign a concept its normalized similarity (i.e. z-score) exceeds a threshold of . This means that a concept is only assigned for an aspect if the image is substantially more similar to this concept than other concepts describing the same aspect. Classifying per-aspect allows for a diversity of descriptors to emerge, as global thresholding results in a biased tag description, as concepts for certain aspects (e.g. subject matter) consistently have higher CLIP similarity than those for more nuanced aspects (e.g. brushwork). We call the assigned concepts atomic tags; figure 4 shows atomic tags assigned for a few examples.
Validation of Quality of Tags Using Human-Study. We validate the effectiveness of our tagging via a human-study involving MTurk workers. In particular, given an image of an artwork and an assigned atomic tag from the vocabulary – MTurk workers are asked “Does the term match (i.e. the concept present) the artwork below? ”. The workers are then asked to select between . We collect responses for images with annotators each. We find that in only 17 cases, a majority of workers disagree with the provided tag, suggesting our tagging results in a low false positive rate. We also observe all three annotators agree in only of cases, reflecting that describing artistic style can be subjective. While our tagging is not perfect, it is a deterministic and automatic method of articulating artistic style elements, and that our tagging method will improve as underlying VLMs improve too. See the appendix for more details and discussion on the human study.
Tag Composition for Artists. Using the atomic tags in the artwork specific vocabulary , in this sectionwe design a simple and easy-to-understand iterative algorithm to obtain a set of tag signatures for each artist . These signatures are a composition of a subset of tags in . In particular, our algorithm efficiently searches the space of tag compositions to go from atomic tags to composition of tags which become more unique as the length of the tag composition grows. For e.g., while 40 of the artists may use simple colors, only 15 may use both simple colors and impressionism style.
To efficiently search the space of tag compositions per artist , we first assign a set of tags to each of their images via the zero-shot selective multi-label classification method described above.For each image , let denote the set of predicted atomic tags.To get atomic tags for an artist, we aggregate all atomic tags over images, and keep only the tags occurring in at least works. We denote this aggregate set of atomic tags as the “Common Atomic Tags Per Artist” and denote it as . Then, we iterate through all the images for a given artist , to find the intersection . We then compute a powerset of the tags occurring in the intersection and increment the count of each occurrence of the tag composition from the powerset in . Note that the size of is much smaller than that of , and thus, iterating through for each image is much, much faster than iterating through . Finally, we again filter the tag compositions in , only including those that occur in at least works. We provide the details of this tag composition algorithm inAlgorithm1, and discuss other details in the Appendix.
Do Unique Signatures Exist for Artists? Using our tag composition
method on the curated dataset from WikiArt, we find that unique artistic signatures in the form of an unique tag composition exists per artist. InFigure5, we show that our tag composition algorithm is able to select unique tag compositions such that only a very few artists exhibit such compositions in their paintings as the tag length increases. This shows that artists exhibit unique style which can effectively be captured by our iterative algorithm. Leveraging these observations, in the next section, we describe TagMatch, which can classify a set of artworks to an artist by uniquely matching such tags (or tag signatures).
4.4 TagMatch: Interpretable and Attributable Style Detection
InSection4.2, we outlined a holistic approach to accurately detect artistic styles. While DeepMatch obtains high accuracy (recognizing styles for of artists), the neural signatures it relies upon lack interpretability. For a copyright detection tool to be useful in practice (e.g., to be used as assistive technologies), providing explanations of the classification decisions can tremendously benefit the end-user. To this end, we leverage our efficient tag composition algorithm as defined inSection4.3 to develop TagMatch - an interpretable classification and attribution method which can effectively classify a set of artworks to an artist, as well provide reasoning behind the classification and example images from both sets that present the matched tag signature. TagMatch follows the intuition of matching a test portfolio to a reference artist who’s portfolio shares the most unique tag signatures. Given a set of test images , we first obtain a number of tag compositions for them using our iterative algorithm inSection4.3. These tag compositions are then compared with the tag compositions of the artists in the reference corpus in order of uniqueness (i.e. we first consider tag signatures present in the test portfolio that occur for the fewest number of reference artists). We can then rank reference artists by how unique the shared tags are with the test portfolio. Detailed steps of the algorithm is in Appendix (Algo. 2). Altogether, TagMatch is remarkably fast, taking only about a minute, after caching embeddings of all images.
Validation of TagMatch. We again utilize the test split of our WikiArt Dataset to validate the proposed style detection method. We observe TagMatch to predict the correct artist with top-1 accuracy of , with top-5 and top-10 accuracies rising to and respectively. While less accurate than DeepMatch, the tag signatures provided by TagMatch allow for analytic arguments to be made regarding style copying, as the exact tag signatures used in matching can be inspected. Moreover, the subset of images in both the test portfolio and matched reference portfolio can be easily retrieved, offering direct attribution of the method; examples can be seen in the next section, where we match generated images to our reference artists. Overall, we hope TagMatch and DeepMatch can serve as automatic and objective tools to navigate the subtle problem of identifying artistic styles, towards understanding when styles are copied and helping artists argue their case (i.e. in a court of law) in such instances.
5 Analysis on Generated Art from Text-to-Image Models
We now turn to generated images, towards two ends. First, we seek to demonstrate the tools we validated on real art can be similarly effective in recognizing and articulating artistic styles in generated art. Secondly, by conducting a systematic empirical study, we aim to shed quantitative insight into the phenomena of style infringement by generative models. While enough instances of style mimicry have been observed to raise concern [24, 18], the prevalence and nature of such instances remains nebulous. We hope our analysis can provide a more complete picture of the current state of style copying by generative models.
Specifically, we employ TagMatch and DeepMatch to generated versions of the art in our WikiArt dataset, so to quantify the degree to which generative models reproduce the stylistic signatures of the artists in our dataset. These artists are somewhat representatitve in the sense that they touch a wide spectrum of broader styles, and they are each somewhat popular and prolific (with respect to having at least works on WikiArt), making them good candidates to potentially have their styles infringed by generative models.
Setup. We extract the titles from the paintings in our dataset from WikiArt and augment them with the name of the artist. Using these prompts (e.g.“the starry night by Vincent Van Gogh” or “the water lillies by Claude Monet”), we generate images from 3 text-to-image generative models: (i) Stable-Diffusion-v1.4; (ii) Stable-Diffusion-v2.0; and (iii) OpenJourney from PromptHero. We note that (i) and (ii) are pre-trained on a subset of the LAION dataset[22], while (iii) is pre-trained on LAION and then fine-tuned on Mid-Journey generated images. We also note that (ii) uses a stronger CLIP text-encoder which can help generating images with better fidelity to the text-prompt. These characteristics make these generative models unique to one another, thereby providing a diverse range of artistic interpretations and styles in our image generation experiments.
Thus, for each artist in our dataset, in addition to a set of his or her real artworks , we obtain a corresponding set of generated images , per generative model. We then compare each set of generated art to the entire corpus of existing art. Namely, we seek to quantify the frequency with which generated art prompted to be in the style of a specific artist is matched to that artist; we call this the match rate. Match rate is a percentage over artists, as each artist is either matched correctly or not (i.e. to the wrong artist or no artist at all). We also consider top-5 and top-10 match rates, where a top-k match refers occurs when artist is amongst the top predictions for the set generated using prompts of the form “{title of a work by } by {}”.
5.1 DeepMatch Recognizes Artistic Styles for Half of the Artists
We first employ DeepMatch, the more accurate but less interpretable of our two style recognition systems, to quantify the degree to which the unique styles of artists from our dataset are reproduced by generative models prompted to recreate works from these artists. Recall that DeepMatch predicts an artist from a set of test images by first inferring the artist for each image, and then aggregating predictions across the set. Thus, in addition to a match rate, we can investigate a match confidence, which is simply the fraction of images in predicted to ; that is, the accuracy of DeepMatch for a set of generated images , if we take the label of these images to be artist who’s style they attempt to recreate (i.e., ). Note that match confidence is computed per artist.
Figure 6 shows match rate and match confidence across three generative models. We observe an average match rate of , indicating that for the vast majority of artists in our study, generative models cannot reproduce their styles in a way recognizable to DeepMatch, which has accuracy on real art. When we inspect match confidences (right subplot), we see that even for artists who’s styles are recognized over a set of generations, the per-image accuracy is far lower. To be specific, averaged over models, about of artists see match confidences below , indicating that less than one in five images generated in their style are predicted as their work. The mean confidence is , and for all three models, more than half of the artists see match confidences below . Indeed, of artists yield an average confidence below .On the other hand, we observe a handful of artists who’s styles are matched with high confidence: artists see average match confidences of at least . These include very famous artists like Van Gogh, Claude Monet, Renoir, which we’d expect generative models to do well in emulating. However, a few lesser known artists are also present, such as Cindy Sherman and Jacek Yerka, who are still alive, and thus could be negatively affected by generative models reproducing their styles.
5.2 Articulating Style Infringements with TagMatch
We now utilize TagMatch because of its enhanced interpretability. Recall that in addition to predicting an artistic style, TagMatch also names the specific signature shared between the test set of images and the reference set of images for the predicted style. Thus, we can inspect the shared signature, as well as instances from both sets where the signature is present. Figure 7 visualizes a number of examples. In comparing the subsets of generated and real images for top-1 matches, we observe that while pixel level differences are common across retrieved image subsets, stylistic elements are consistent in both sets with the labeled tags. Thus, TagMatch can serve as an effective tool for describing the way in which stylistic elements are copied via language, as well as providing direct evidence of the potential infringement.
TagMatch yields match rates of , , and for top-1, top-5, and top-10 matches respectively. Low top-1 match rates using TagMatch alone cannot be used to argue that generative models do not reproduce artistic styles, but convincing arguments can be made combining the two methods. TagMatch also allows for understanding image distributions from the perspective of interpretable tags. We explore this direction in the appendix, finding differences in the uniqueness of the tags present in generated art vs real art.
6 Conclusion
In our paper, we rethink the problem of copyright infringement in the context of artistic styles. We first argue that image-similarity approaches to copy detection may not fully capture the nuance of artistic style copying. After reformulating the task to a classification problem over image sets, we develop a novel tool – ArtSavant, consisting of a dataset and two complementary methods that can effectively recognize artistic styles, via neural and tag based signatures. The success of our method offer strong evidence to the existence of unique artistic signatures, a necessary pre-requisite for styles to be protected. We highlight TagMatch, which scaffolds a black-box AI component with interpretable intermediate outputs and a transparent way in which intermediate outputs are combined to arrive at a final prediction, resulting in a white(r)-box AI system. TagMatch, which can classify a set of images to an artist with reasonable accuracy as well as provide succinct text explanations and image attributions. Using these two detectors, we analyze generated images from different text-to-image generative models highlighting that amongst all the artists (including many famous ones), only 20 of the artists are recognized as having their style copied.
7 Acknowledgements
This project was supported in part by a grant from an NSF CAREER AWARD 1942230, ONR YIP award N00014-22-1-2271, ARO’s Early Career Program Award 310902-00001, HR00112090132 (DARPA/ RED), HR001119S0026 (DARPA/ GARD), Army Grant No. W911NF2120076, the NSF award CCF2212458, NSF Award No. 2229885 (NSF Institute for Trustworthy AI in Law and Society, TRAILS), an Amazon Research Award and an award from Capital One.
References
- [1]Deepfloyd (Apr 2023), https://github.com/deep-floyd/IF
- [2]Generative artificial intelligence and copyright law (Sep 2023), https://crsreports.congress.gov/product/pdf/LSB/LSB10922
- [3]Basu, S., Zhao, N., Morariu, V., Feizi, S., Manjunatha, V.: Localizing and editing knowledge in text-to-image generative models (2023)
- [4]Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramèr, F., Balle, B., Ippolito, D., Wallace, E.: Extracting training data from diffusion models (2023)
- [5]Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. CoRR abs/2104.14294 (2021), https://arxiv.org/abs/2104.14294
- [6]Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
- [7]Cui, Y., Ren, J., Lin, Y., Xu, H., He, P., Xing, Y., Fan, W., Liu, H., Tang, J.: FT-SHIELD: A watermark against unauthorized fine-tuning in text-to-image diffusion models (2024), https://openreview.net/forum?id=OQccFglTb5
- [8]Cui, Y., Ren, J., Xu, H., He, P., Liu, H., Sun, L., Xing, Y., Tang, J.: Diffusionshield: A watermark for copyright protection against generative diffusion models (2023)
- [9]Gandikota, R., Orgad, H., Belinkov, Y., Materzyńska, J., Bau, D.: Unified concept editing in diffusion models (2023)
- [10]Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2414–2423 (2016)
- [11]Goldstein, P.: Goldstein on Copyright, 3rd edition. Wolters Kluwer Legal & Regulatory U.S. (2014)
- [12]Huang, X., Huang, Y.J., Zhang, Y., Tian, W., Feng, R., Zhang, Y., Xie, Y., Li, Y., Zhang, L.: Open-set image tagging with multi-grained text supervision. arXiv e-prints pp. arXiv–2310 (2023)
- [13]Kumari, N., Zhang, B., Wang, S.Y., Shechtman, E., Zhang, R., Zhu, J.Y.: Ablating concepts in text-to-image diffusion models (2023)
- [14]Oakes, Calebrisi, Sotomayor: Tufenkian import export ventures inc v. einstein moomjy inc (2003), https://caselaw.findlaw.com/court/us-2nd-circuit/1455682.html
- [15]Pizzi, E., Roy, S.D., Ravindra, S.N., Goyal, P., Douze, M.: A self-supervised descriptor for image copy detection (2022)
- [16]Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=di52zR8xgf
- [17]Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021)
- [18]Ren, J., Xu, H., He, P., Cui, Y., Zeng, S., Zhang, J., Wen, H., Ding, J., Liu, H., Chang, Y., Tang, J.: Copyright protection in generative ai: A technical perspective (2024)
- [19]Rezaei, K., Saberi, M., Moayeri, M., Feizi, S.: Prime: Prioritizing interpretability in failure mode extraction (2023)
- [20]Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
- [21]Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understanding (2022)
- [22]Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open large-scale dataset for training next generation image-text models (2022)
- [23]Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.W., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S.R., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: LAION-5b: An open large-scale dataset for training next generation image-text models. In: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022), https://openreview.net/forum?id=M3Y74vmsMcY
- [24]Shan, S., Cryan, J., Wenger, E., Zheng, H., Hanocka, R., Zhao, B.Y.: Glaze: Protecting artists from style mimicry by text-to-image models (2023)
- [25]Somepalli, G., Singla, V., Goldblum, M., Geiping, J., Goldstein, T.: Diffusion art or digital forgery? investigating data replication in diffusion models (2022)
- [26]Somepalli, G., Singla, V., Goldblum, M., Geiping, J., Goldstein, T.: Understanding and mitigating copying in diffusion models. In: Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol.36, pp. 47783–47803. Curran Associates, Inc. (2023), https://proceedings.neurips.cc/paper_files/paper/2023/file/9521b6e7f33e039e7d92e23f5e37bbf4-Paper-Conference.pdf
- [27]Tan, W.R., Chan, C.S., Aguirre, H.E., Tanaka, K.: Artgan: Artwork synthesis with conditional categorial gans. CoRR abs/1702.03410 (2017), http://arxiv.org/abs/1702.03410
- [28]Wang, Z., Chen, C., Lyu, L., Metaxas, D.N., Ma, S.: DIAGNOSIS: Detecting unauthorized data usages in text-to-image diffusion models. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=f8S3aLm0Vp
- [29]Xue, H., Liang, C., Wu, X., Chen, Y.: Toward effective protection against diffusion based mimicry through score distillation (2024)
- [30]Zhao, Z., Duan, J., Xu, K., Wang, C., Guo, R.Z.Z.D.Q., Hu, X.: Can protective perturbation safeguard personal data from being exploited by stable diffusion? (2023)
- [31]Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023), https://openreview.net/forum?id=uccHPGDlao
Appendix 0.A ArtSavant Demonstration:
A Practical Tool to Protect Artists
The founding principle that ArtSavant was practical utility. In other words, we strove to build something people can actually use. Specifically, the motivating use case revolved around a hypothetical artist who is concerned with generative models potentially copying their styles. In figure 8, we outline the general flow of how our tool can be used. The concerned artist would first present a corpus of their works, along with their own name and the titles of each work. Then, ArtSavant would create an easy-to-understand report characterizing the degree to which generative models copy the styles of the artist. The artist can present a set of generated images, or we can generate them by prompting text-to-image models with captions of the form “{title of work} by {name of artist}” for each work provided by the artist.
As explained in the diagram, we first combine the provided works with our existing art repository, performing a train/test split as well. Using the train split, we extract neural and tag signatures. In other words, we train a classifier over the artists, and we also tag all images, compose tags within artists, and store extracted tags per artist. Then, using the extracted neural and tag signatures, we can apply DeepMatch and TagMatch respectively. Applying DeepMatch to the held-out art provides a measure of recognizability. That is, it establishes that the test artist has an identifiable style to begin with; otherwise, there can be no style copying. Then, running DeepMatch on generated images provides a quantitative manner to understand if the artist’s style appears consistently in generated works (i.e. is there a match), and to what frequency (i.e. what is the match confidence). Finally, running TagMatch on the generated images helps articulate the particular style signatures that are copied, enabling an analytic way to argue infringement. Moreover, TagMatch surfaces stylistically similar examples in a way that is faithful to how TagMatch infers styles.
Figure 9 shows an example report outputted by ArtSavantwhen presented with art from an artist named Canaletto, who we observed was at risk of style infringement. We design the report to be easy to read and understand, as well as being based on evidence. Moreover, the report can be generated very quickly. Because all steps operate on embeddings from a frozen CLIP encoder, the process takes about minutes, as we can simply compute embeddings once (and offline for the WikiArt corpus). Note that our empirical analysis suggests that most artists are not at risk of style copying. While we do not provide an example output report, it would primarily consist of the first two components, and hopefully relieve concerned artists, by showing that the generated works fail to recognizably capture their unique styles.
Appendix 0.B Limitations
Our work tackles a novel problem of artistic style infringements. Style, however, is qualitative. We merely put forward one definition for artistic style, along with two implementations for demonstrating the existence of a style given example works from an artist and recognizing the identified style in other works.
Importantly, we argue that an artist’s style is unique if we can consistently distinguish their work from that of other artists. However, we can only proxy the entire space of artists. We construct a dataset consisting of works from artists spanning diverse schools of art and time periods in attempt to represent the space of existing artists, though of course we will always fall short in capturing all kinds of art. We provide tools to allow for this dataset to grow with time, and we caution that if only one artist for some broader artistic style is not present in our reference set, the uniqueness of that artist’s style may be overestimated, and as such, generated images may be matched to this artist with an overestimated confidence. However, if only one out of artists exhibits some style, than one could argue that that alone reflects a notable uniqueness of that artist. To employ a stricter criterion for alleging style copying, we’d recommend augmenting the reference set to include more artists with very similar styles to the artist in question. Nonetheless, we believe our reference dataset does well in representing all art, to where analysis based on this reference set is still informative.
We also note that our atomic tagging leverages an existing foundation model (CLIP) with no additional training. While we verify the precision of our tags, CLIP is known to have issues with complex concepts. Further, we do not claim our tags achieve perfect recall (most image taggers do not). We advise users to interpret the assignment of a tag to indicate a strong presence of that concept, relative to similar concepts (i.e. from the same aspect of artistic style). While our tagger is not perfect, it is objective and automatic, enabling interpretable style articulation and detection. Also, we note that the field of image tagging in general has seen rapid improvement in the past year [12], and an improved tagger could easily be swapped into our pipeline.
Lastly, we only analyze generated images using off-the-shelf text-to-image models. It is possible that particularly determined and AI-adept style thiefs fine-tune a model to more closely replicate specific artistic styles. This is a much more threatening scenario, though requires greater effort and ability by the style thief. We elect to demonstrate the feasability of our approach in the more broadly accessible setting of using models off-the-shelf, and note that our method can flexibly accept generated images produced in a different way (or perhaps discovered on the internet); notice generated images are an optional input in figure 8. We look forward to explorations of more threatening scenarios in future work, and hope both our formulation and methods for measuring style copying prove to be of use.
Appendix 0.C A nuance in artistic style infringements:
Existing Artists can have very similar styles
A crucial step in arguing that an artist’s style has been infringed is to first demonstrate the existence of the given artist’s unique style. We note that doing so objectively is non-trivial, as a style may not have a clear definition, and thus, it can be challenging to systematically compare to all other artistic styles, so to show uniqueness. In our work, we utilized classification, claiming that if an artist’s works can consistently be mapped (i.e. at least half the time) to that artist (over a large set of other artists), than that artist must have some underlying unique style (parameterized by a neural signature).
In doing so, we found that of artists could be recognized based of a set of (at least of) their works (held-out in training the classifier). What about the remaining of artists? We now take a closer look at these artists, and also introduce a second, stricter style copying criterion. Namely, we consider the notion that it may be unfair to claim a generative model is copying the style of an artist, if another existing artist seems to also be copying that artist. That is, we propose a way to verify that the generative model not only shows a substantial similarity to the copied artist, but also an unprecedented similarity.
0.C.1 Artists who’s styles were not recognized
First, we inspect more examples from artists who were not recognized using our majority voting threshold in DeepMatch. That is, less than half of their held-out works were predicted to them. Figure 10 shows a number of examples, from which we can make some qualitative observations. First, the styles of artists who operate in the same broader genre (e.g. portraiture, landscapes, narrative scenes in renaissance styles, etc) can be extremely similar. We even see an instance where an artist’s son’s style is indistinguishable from his father’s (Jamie and Andrew Wyeth). Lastly, we note that in most cases, the artists only marginally fall short of our recognition threshold (i.e. accuracy for their held-out works is only a bit below ). We utilize majority voting because (i) it is intuitive, (ii) it requires consistent appearance of the neural signature across works, and (iii) it allows for abstention when no particular style is strongly present. However, the exact threshold of can be altered as desired. In summary, as in Figure 3, we see artistic styles can be very similar, making the existence of unique artistic styles for the vast majority of artists a non-trivial observation.
If an artist’s style cannot be recognized over their own held-out works, arguing that a generative model copies that style is strenuous, as the style itself is ill-defined. Notably, in these cases, the classifier had an option to predict the correct artist. However, in applying DeepMatch to generated images, there is no direct option for the classifier to abstain from predicting anyone, under that generated art comes from a “new artist”, which takes inspiration from existing artists. Note that abstention is still possible (due to the majority voting in DeepMatch), and occurs when a match confidence falls below . To make comparisons fairer to generative models, we now discuss a stricter criterion of unprecedented similarity.
0.C.2 Unprecedented Similarity: Do generative models copy styles more than existing artists already do?
A nuance that requires consideration when studying artistic style copying is that it is possible for two artists to have very similar styles. Thus, it may be unfair to allege that a generative model is copying an artist if there exists another artist who’s style is just as or in fact even more similar to artist . Towards this end, we introduce unprecedented similarity, which requires that the similarity between works of a generative model and works of the artist inteded to be copied is higher than the similarity of any existing artist with . That is, for works from all other existing artists .
Note that this is a stricter criterion than our previous threshold. In DeepMatch, we required that at least half of the works in a given set of test images were predicted to a single artist in order for us to flag the test images as a potential style infringmenet. In other words, that threshold required that , which in turn implies that for all (with room to spare; here we use match confidence to denote similarity).
Now, however, instead of just comparing to all , we must also compare all to . Instead of comparing all other artists, we inspect the most similar artist to , identified by taking the artist with the highest rate of false positive predictions to artist . Then, we hold out , and train a new classifier on the remaining artists. Finally, we check for style matches of for the set of generated images and the works from the most similar artist .
Figure 11 summarizes our result for OpenJourney (all three models studied show consistent results). We find that only in three cases do we see a held-out artist’s work flagged as potential style copying. Notably, in all instances where generated work is flagged as potential style copying, the corresponding held-out artist’s work is either not flagged or is flagged with lower confidence, indicating that the instances of style copying of generative models that we observe always also satisfy the criterion of unprecedented similarity.
Taking a closer look at instances where held-out art is flagged for style copying (or perhaps style emulation?), we again see just how similar the works of different artists can be. Namely, we see that some artists works seem to fall into a broader genre of art that many artists utilize (e.g. ukiyo-e or impressionism). In summary, while generative models can very closely resemble the style of a given artist, contextualizing copying by generative models with respect to copying (or perhaps, ‘style emulation’) already done by existing artists is crucial in order to afford the same artistic liberties to generative models as have been provided to other artists in the past.
Appendix 0.D Details on TagMatch
We now provide greater details regarding the implementation of TagMatch, a central technical contribution of our work. TagMatch is a method to classify a set of images to a class; specifically, we map a set of artworks to one artist, selected over choices. TagMatch is not as accurate as DeepMatch, as it maps held-out works of each artist in our WikiArt dataset to the correct artist about of the time (compared to top-1 accuracy for DeepMatch). However, top-5 accuracy is more reasonabe, achieving above . Most notably, TagMatch is inherently interpretable and attributable. It consists of three steps: (i) assigning atomic tags to images, (ii) efficiently composing tags to obtain more unique tag signatures, and (iii) matching a test set of images to a reference artist based on the uniqueness of the tags shared between the test set and works from the predicted reference artist.
Our method is fast and flexible: after caching image embeddings, the whole thing only takes minutes, and it is easy to modify the concept vocabulary as desired, as the tagging is done in a zero-shot manner. Through MTurk studies, we verify that the atomic tags we assign our mostly precise, though we recognize that these descriptors can be subjective. Thus, while we do not claim perfect tagging, we stress that our method is easy to understand, and crucially, is deterministic per image. Therefore, ideally our tagging may be more reliable biased than human judgements, particularly when the humans involved may be biased (e.g. an artist alleging copying and a lawyer defending a generative model would have strong and opposing stakes).
Below, we provide details for image tagging (§0.D.1), artist tagging (§0.D.2), artistic style inference via tag matching (§0.D.3), effect of hyperparameters (§0.D.4), details on efficiency (§0.D.5), and a review of validation (§0.D.6).
0.D.1 Image Tagging
As explained in §4, we utilize CLIP to attain a diverse set of atomic tags per image in a zero-shot manner. Specifically, we first define a vocabulary of descriptors along various aspects of artistic style. Then, given an image, we do selective multi-label zero-shot classification for each aspect. Performing zero-shot classification per aspect proves to be critical in order to achieve a diversity of tags and a similar number of tags per image. We find that some descriptors always lead to higher CLIP similarities than others. Specifically, descriptors for simple aspects, like colors and shapes, yield higher similarities than more complex aspects like brushwork and style. Thus, using a global threshold across descriptors would lead to a less diverse descriptor set. Moreover, we observe some images have higher similarities across the board than others, which again would lead global thresholding to result in a disparate number of tags per image. Our per-aspect scheme requires that the descriptors within each aspect are mostly mutually exclusive; we prioritize this in the construction of the concept vocabulary, via the prompt we present the LLM assistants and our manual verification.
Namely, we prompt both Vicuna-33b and ChatGPT with “I want to build a vocabulary of tags to be able to describe art. First, consider different aspects of art, and then for each aspect, list about 20 distinct descriptors that could describe that aspect of art. Please return your answer in the form of a python dictionary. ”. We then perform a filtering step with a human in the loop, where we manually remove tags that are difficult to recognize or redundant. After this filtering step, we add in a few new aspects. First, we incorporate the styles (e.g., “impressionism”) and genres (e.g., “portrait”) that are most common amongst works in our WikiArt dataset; note that all WikiArt images also contain metadata for these categories. Finally, we add some easy to understand tags such as color and shape which can be important characteristics describing a given painting. The concept vocabulary we use is contains shown below:
- •
Style, caption template: {} style. Descriptors:
- –
realism, impressionism, romanticism, expressionism, post impressionism, art nouveau modern, baroque, symbolism, surrealism, neoclassicism, naïve art primitivism, northern renaissance, rococo, cubism, ukiyo e, abstract expressionism, mannerism late renaissance, high renaissance, magic realism, neo impressionism
- –
- •
Genre, caption template: the genre of {}. Descriptors:
- –
portrait, landscape, genre painting, religious painting, cityscape, sketch and study, illustration, abstract art, figurative, nude painting, design, still life, symbolic painting, marina, mythological painting, flower painting, self portrait, animal painting, photo, history painting, digital art
- –
- •
Colors, caption template: {} colors. Descriptors:
- –
pale red, pale blue, pale green, pale brown, pale yellow, pale purple, pale gray, black and white, dark red, dark blue, dark green, dark brown, dark yellow, dark purple, dark gray
- –
- •
Shapes, caption template: {}. Descriptors:
- –
circles, squares, straight lines, rectangles, triangles, curves, sharp angles, curved angles, cubes, spheres, cylinders, diagonal lines, spirals, swirling lines, radial symmetry, grid patterns
- –
- •
Common Objects, caption template: {}. Descriptors:
- –
male figures, female figures, children, farm animals, pet animals, wild animals, geometric shapes, fruit, vegetables, intsruments, flowers, boats, waves, roads, household items, the moon, the sun, saints, angels, demons
- –
- •
Backgrounds, caption template: {} in the background. Descriptors:
- –
fields, blue sky, night sky, sunset or sunrise, forest, rolling hills, simple colors, beach, port, river, starry night, clouds, shadows, living room, bedroom, trees, buildings, chapels, heaven, hell, houses, streets
- –
- •
Color Palette, caption template: {} color palette. Descriptors:
- –
vibrant, muted, monochromatic, complementary, pastel, bright, dull, earthy, bold, subdued, rich, simple, complex, varying, minimal, contrasting
- –
- •
Medium, caption template: the medium of {}. Descriptors:
- –
oil painting, watercolor, acrylic, ink, pencil, charcoal, etching, screen printing, relief, intaglio, collage, montage, photography, sculpture, ceramics, glass
- –
- •
Cultural Influence, caption template: {} influences. Descriptors:
- –
Indigenous, European, American, East Asian, Indian, Middle Eastern, Hispanic, Aztec, Contemporary, Greek, Roman, Byzantine, Russian, African, Egyptian, Tahitian, Polynesian, Dutch
- –
- •
Texture, caption template: {} texture. Descriptors:
- –
rough, smooth, bumpy, glossy, matte, roughened, polished, textured, smoothed, brushstroked, layered, scraped, glazed, streaked, blended, uneven, smudged
- –
- •
Other Elements, caption template: {}. Descriptors:
- –
stippled brushwork, chiaroscuro lighting, pointillist brushwork, multimedia composition, impasto technique, repetitive, pop culture references, written words, chinese characters, japanese characters
- –
Now, we detail the implementation of our modified zero-shot classification. Recall that in zero-shot classification, one computes a text embedding per class, which amounts to the classification head, and computes an image embedding for the test input, so that the prediction is the class who’s text embedding has the highest cosine similarity to the test image embedding. In computing the text embeddings, we take each descriptor (e.g. Dutch) and place it an aspect-specific caption template (e.g. Dutch Dutch influences), and then average embedddings over multiple prompts (e.g. “artwork containing Dutch influences”, “a piece of art with Dutch influences”, etc), as done in [17]. We modify standard zero-shot classification to allow for the fact that more than one descriptor (or perhaps none) from a given aspect may be present. Namely, instead of assigning the most similar descriptor per-aspect, we assign an atomic tag for any descriptor who’s similarity is significantly higher than other descriptors for that aspect. We achieve this via z-score thresholding: per-aspect, we convert similarities to z-scores by subtracting away the mean and dividing by the standard deviation, and then admit atomic tags who’s z-score is at least .
The template prompts we utilize for embedding each concept caption are as follows:
- •
art with
- •
a painting with
- •
an image of art with
- •
artwork containing
- •
a piece of art with
- •
artwork that has
- •
a work of art with
- •
famous art that has
- •
a cropped image of art with
0.D.2 From Image Tags to unique Artist Tags
Recall that we define styles not per-image, but over a set of images. Namely, we seek to surface tags that occur frequently. The best way to do so is to simply count the occurrences of each tag, and discard the ones that rarely appear. However, each atomic tag is not particularly unique with respect to artists. We utilized efficient composition of atomic tags to arrive at more unique tag signatures, as shown in figure 5 and detailed in algorithm 1. Importantly, we utilize a threshold here to differentiate what a common tag is; we require a tag to appear in at least three works for an artist in order for the tag to count as a frequently used tag by the artist. We note that tag composition can be done efficiently because we have a relatively low number of tags per image: on average, there are atomic tags per image. Moreover, because the number of occurrences for a composed tag is bound belo by the number of occurrences of each atomic tag in the composition, we can ignore all non-frequent atomic tags. Thus, we can iterate over the powerset of common atomic tags per image without it taking exorbitantly long. We include one fail safe, which is that in the rare instance where an image has a very high number of common atomic tags, we truncate the tag list to include only tags. Over the images that we encounter, this happens only once. We highlight that our tag composition takes inspiration from [19].
0.D.3 Predicting Artistic Styles based on Matched Tags
Once we have converted tags per image to tags per artist, we can then utilize these artist tags to perform inference over a set of images. Namely, given a test set of images, we extract common tags (including tag compositions) for the test set and compare them to tags extracted for each artist in our reference corpus. Then, we predict the reference artist who shares the most unique tags with the test set.
Figure 12 best explains our method, as it shows the documented code. We note that all code will be released upon acceptance. We’ll now explain it step by step. First, for each artist and for the test set of images, we find common tags via (i) assigning atomic tags to each image, (ii) finding the commonly occurring atomic tags, (iii) counting compositions of the commonly occurring atomic tags, and (iv) discarding tags (including compositions) that do not occur frequently enough. The code shows this done for the test set of images; we perform this per reference artist when the TagMatcher object (for which tag_match is function) is initialized; notice fields like self.ref_tags_w_counts_by_artist, which contain useful information about the reference artists, computed once and re-used for each inference.
Then, we loop through the set of ‘matched’ tags (i.e. those that occur for both the test set of images and at least one reference artist), starting with the most unique ones. Here, uniqueness refers to the number of reference artists that frequently use a tag. For each tag, we loop through all artists that also use that tag. For the first (denoted by self.matches_per_artist_to_consider in the code) matched tags per artist, we add a score to a list of scores for the artist, which ultimately are averaged. The score contains an integer and a decimal component. The integer component is the number of reference artists that share the matched tag. The decimal component is the absolute value of the difference in frequency with which the tag appears, over the reference artist’s works and the test set of images; note that this is always less than one. This way, when comparing two matched tags, a lower score is assigned to a more unique one, and one there is a tie in uniqueness, we break the tie based on how similar the frequency of the matched tag is for the test artist and reference artist.
Finally, we average the list of scores per artist to get a single score per reference artist, analogous to a logit. We assign a score of inf for any artist with less than self.matches_per_artist_to_consider (which we set to ) matched tags. This hyperparameter makes our tag matching less sensitive to individual matched tags, and empirically results in a substantial improvement in top-1 accuracy on held-out art from WikiArt artists (see next section).
0.D.4 Choosing Hyperparameters
Overall, there are three hyperparameters to our method: the z-score threshold, the tag count threshold, and the number of matches to consider per artist. Here is quick refresher on what they each do:
- •
The z-score threshold determines how much more similar a descriptor needs to be to an image compared to other descriptors for the same aspect in order for the descriptor to be assigned as an atomic tag of the image. The value we use is .
- •
The tag count threshold is the minimum number of an artist’s works that a tag needs to be present in order for a the tag to be deemed common for the artist. The value we use is .
- •
The number of matches to consider per artist pertains to how many matched tags are considered when computing the final score per artist in tag match. That is, the final score for an artist is the average of the top-k most unique tags that the artist shares with the test set of images, where corresponds to this hyperparameter. The value we use is .
Now that the role of each hyperparameter is clear, let’s discuss how hyperparameters can be adjusted towards particular ends, along with the potential consequence of each action:
- •
To increase the number of atomic tags, lower the z-score threshold. Risk: atomic tags may be less precise, and the method will take longer to run, as there will atomic tags and composed tags.
- •
To get more tags per artist, lower the tag count threshold. Risk: some tags will become less unique. Other tags will be introduced, and may be very unique, which could skew tag matching. Also, the method may take longer to run, as there will be more tags.
- •
To make inference less sensitive to a low number of matched tags, increase the number of matches to consider per artist. Risk: when you consider more matches, interpretation is a little more difficult, as you have more reasons for each inference, and it will take longer to view them all.
To choose hyperparameters, we selected a small range of reasonable values and swept each hyperparameter individually. While a combined search would likely yield better accuracy numbers, we opt out of hyper-tuning TagMatch for accuracy, as its main objective is to provide and interpretable and attributable complement to DeepMatch. We find the (relatively strong, considering the high number of artists considered) accuracy numbers encouraging, but do not find it a priority, as DeepMatch arguably provides a stronger and easier to understand signal of if style copying is happening. TagMatch, on the other hand, tells us how and where it is happening (if observed with DeepMatch).
0.D.5 Efficiency of TagMatch: Runs in minute
TagMatch is surprisingly fast. The longest step by far is computing CLIP embeddings for the reference artworks. This takes us about 5 minutes using one rtx2080 GPU with four CPU cores to embed the training split images using a CLIP ViT-B16 model. Importantly, this step is done only once, and in practice, is done offline. The other steps and approximate time needed for each are as follows: embedding concepts (5 seconds), extracting common atomic tags and composing them (45 seconds), reorganizing tags and removing non-common tags (3 seconds). Then, inference for a test set of works takes about 10 to 15 seconds. Again, we will release all code upon acceptance, as we truly hope our tool can be of use to artists who are concerned by generative models potential infringing upon their unique styles.
0.D.6 Validation
Because tag match has multiple steps, we perform multiple validations. First, for image tagging, we utilize an MTurk study. We collect separate human judgements on instances of assigned atomic tags. Namely, we show randomly selected (tag, image) pairs to three annotators each. Figure 14 shows an example of the form presented to MTurk workers. MTurkers provide consent and are awarded per task, resulting in an estimated hourly pay of . For each task, they answer ‘yes’, ‘no’, or ‘unsure’ to the question ‘does the term {atomic tag} match the artwork below?’ They are also shown example artworks for each term which were manually verified to be correct. Response rates were as follows: yes, unsure, no. In investigating inter-annotator agreement, we find that at least 2 annotators agree of the time, but all 3 agree only of the time. This reflects the subjectivity associated with assigning artistic tags, and partially motivates the need for a deterministic automated alternative, in order to objectively tag images at scale. All three annotators said no only of the time, and at least two said no of the time, suggesting that our zero-shot tagging mechanism achieves reasonable precision.
To validate the value of tag composition, we refer to figure 5, which shows how tags become more unique as they get longer (i.e. consist of more atomic tags). Moreover, our time analyses show that the added benefit of composing tags to find unique tag signatures does not come at the cost of the efficiency of our method. Finally, the non-trivial top-1 matching accuracy and strong top-5 matching accuracy shows that the extracted tag signatures do indeed capture some unique properties of artistic style. Figure 13 reflects a few more examples of successful inference, interpretation, and attribution for the task of detecting style copying by generative models.
Appendix 0.E A Sim2Real Gap in Tag Distributions
An added advantage of ascribing tags to images is that we can better compare image distributions from an interpretable basis (the tags). We briefly explore this direction now.
Top 1 | Top 5 | Top 10 | ||
---|---|---|---|---|
Generated Art | CompVis Stable Diffusion v1.4 | |||
Stability AI Stable Diffusion v2 | ||||
PromptHero Openjourney | ||||
Average | ||||
Real Art (held out) |
First, we provide complete results from applying TagMatch to generated images from each of the three text-to-image models in our study, presented in table 1. Consistent with our DeepMatch results, we observe substantially lower matching accuracy for generated images than for real held-out artwork. While the primary takeaway is that for many artists, generative models struggle to replicate their styles, we can also hypothesize that generative models may output images that follow a different distribution than the distribution of real artworks.
Motivated by this hypothesis, we now compare the distribution of real to
generated artworks from the perspective of tags. Because we consider composed tags, the total space of tags is vast and hard to reason over. However, we can look at properties of each tags. Namely, we can inspect the uniqueness of tags. That is, for each tag present in generated images, we inspect the number of reference artists that also present that tag; we do the same for real art as well (subtracting one so to not double count the artist for which a given a tag is being considered). Figure 15 shows a kernel density estimation plot of the distributions of tag commonality, where a tag commonality of means that for each tag assigned to a set of images (either from a real artist or from a generative model emulating an artist), other artists also commonly use that tag. We see tags tend to be rather unique (due to our tag composition), and notably, tags for generated images are more unique.
Appendix 0.F Patch Match: Generating Additional Visual Evidence of Copying
Detecting artistic style copying in a given art requires analyzing local stylistic elements that manifest across an artist’s body of work. To address this, we employ a patch-based approach that compares small image regions between a given art and original artworks, enabling a fine-grained analysis of stylistic and semantic (e.g. objects) similarities at a local level. We consider three patch matching methods: CLIP-based, DINO-based, and Gram matrix-based.
Gram Matrix-based Patch Matching [10]: The Gram matrix is a measure of style similarity introduced in the context of neural style transfer. It captures the correlations between the activations of different feature maps in a convolutional neural network, representing the style of an image. For patch matching, the Gram matrices of patches from the given art and original arts can be computed and compared using a suitable distance metric (e.g., Frobenius norm). The Gram matrix is specifically designed to capture stylistic elements, making it well-suited for detecting style copying.
CLIP-based Patch Matching [17]: CLIP (Contrastive Language-Image Pre-training) is a powerful model that can effectively capture the semantic similarity between text and images. In the context of patch matching, CLIP embeddings can be used to measure the similarity between a patch from a given art and patches from original artworks. The patches can be encoded using the CLIP image encoder, and the cosine similarity between their embeddings can be computed to find the closest matches. CLIP may not be as sensitive to low-level stylistic elements, such as brushstrokes, textures, and color palettes, however it focuses more on higher-level semantic concepts, which can be useful to find if the given art pictured the same objects as the selected original patch.
DINO-based Patch Matching [6]: DINO is a self-supervised vision transformer that learns robust visual representations by solving a self-distillation task. DINO embeddings can be used for patch matching by computing the cosine similarity between the embeddings of patches from the given art and original artworks. We use DINO to capture higher semantical similarities, and check whether the given art pictured similar subjects of interest and high-level visual features as selected original artworks.
0.F.1 Experimental setting
For our experiments, we aim to identify the most similar artwork from a pool of original artworks in the WikiArt dataset given a reference image. The reference image is first resized to a resolution of pixels and normalized. From this normalized image, we select a patch size of pixels. This process is repeated for all original artworks in the dataset, resulting in a total of patches from original artworks for comparison with the reference patch. We then use three methods, namely Gram matrix, CLIP, and DINO, to find the most similar patches.
Figure 16 showcases the patches that are deemed most similar to the image being referenced. These matches are determined using Gram-matrix, CLIP, and DINO methods.
We then select an artist and find patches from our original image dataset that closely match this artist’s style. In Figure 17, we utilize the Gram-matrix method to identify the most similar patches to three chosen artworks by Van Gogh. Our dataset includes all paintings by Van Gogh as well as works by nine other artists. Gram-matrix selects original artworks that closely resemble the style of the reference image, all of which are from Van Gogh. Essentially, this means that Gram-matrix predominantly selects Van Gogh’s artworks because they are the most stylistically similar to the referenced paintings compared to the works of the other nine artists.
0.F.2 Discussion and limitations
Patch matching methods like Gram-matrix, CLIP, and DINO are effective in detecting similarities between artworks by examining their local stylistic and semantic elements. Gram-matrix focuses on capturing stylistic correlations, CLIP evaluates semantic similarity, and DINO concentrates on higher-level features. However, these methods have limitations. They primarily focus on local aspects of artworks and may overlook broader artistic characteristics such as texture, composition, and brushwork that are crucial to detect copyright infringements. Moreover, the process of finding the most similar patches for each given art takes approximately fifteen minutes when considering original artworks, and if we opt to include more original artworks, the duration of the process would inevitably increase. Therefore, patch-matching methods are computationally expensive, which restricts their practical application. Despite these limitations, patch matching is valuable for identifying instances of direct copying in artworks and they aid in the detection of plagiarized content.
Appendix 0.G Details on WikiArt Scraping
WikiArt is a free project intended to collect art from various institutions, like museums and universities, to make them readily accessible to a broader audience. We design a scraper to collect a corpus of reference artists, with which we can define a test artist’s style in contrast to the other artists, and to provide a testbed to empirically study copying behavior of generative models. Some important landing pages to perform scraping are (i) the works by artist page (https://www.wikiart.org/en/Alphabet/j/text-list; url shows all artists starting with the letter ‘j’, and we loop through all letters), (ii) the page containing information on allowed usage (https://www.wikiart.org/en/terms-of-use), (iii) an example artist landing page (https://www.wikiart.org/en/vincent-van-gogh), and (iv) an example painting landing page (https://www.wikiart.org/en/vincent-van-gogh/the-starry-night-1889). As you can see, many pages have standard formats, making scraping particularly feasible. We will provide our scraping code, along with all other code, to facilitate easy updating of our dataset as time goes by.
We obtain artworks only from artists with at least 100 works on WikiArt, so to focus on somewhat famous artists who are arguably more likely to be copied. For every work, we also scrape the licensing information, and annotation for styles, genres, and title. In total, our dataset has 90,960 artworks over 372 artists. There are 81 styles with at least 100 works, with the most popular styles being realism, impressionism, romanticism, and expressionism. There were 37 genres with at least 100 works, with the most popular being portrait, landscape, religious painting, sketch and study, and cityscape. We note that we only include images who’s license is either public domain or fair use, with the vast majority of works being public domain. Nonetheless, we strongly advise against using this dataset for commercial purposes, and especially for the purpose of copying artists.