Duplicate code

In source code cloning (also code duplicates, software clones or just cloning) is similar sections in the source code of a program. Clones are a form of redundancy and built mainly by copy & paste. Since they have a negative impact on the maintenance clones are a strong indicator of poor quality.

5.1 Tools

6.1 Removing
6.2 Watch

Similarity

When cloning is generally similar to sections of code. Thus, the definition of similarity is a key point, which is discussed more intensively in research. There is, however, a consensus that there is no universal definition, but this is always aligned to the specific application. In the definition of similarity among others, the following aspects should be considered:

Minimum length: The every program ( keywords, operators, identifiers, eg ... ) composed of the atomic elements of the programming, there is always at this level similarity between different code fragments. However, there is at this level rarely meaningful abstraction opportunities and to view these atomic elements already as clones is rarely helpful. Therefore, part of the definition of the similarity sets a minimum length specifies how long code sections must be at least so they are considered clones. Unit in which the minimum size is specified based on the detection method. Examples of minimum lengths are values such as 10 lines, 100 tokens, seven statements, etc.

Normalization: In addition to the minimum length, it has proven the Code regarding the definition of similarity to normal. Thus, in many cases, be ignored in the code comments with regard to clones. Another often used normalization is the abstraction of identifiers, that is, two loops are considered as clones even if in a i and is used in the other j as the loop variable. Similarly, literals can be normalized. More complex normalization could for example consist in commutative operations to make the order of the operands ambiguous or to abstract from the type of loop in loop constructs. What normalization is useful depends in general strongly of the application. Since the normalization but has a very large impact on the results this should be done conscientiously. This requires a certain amount of experience in the field of clone recognition is necessary not least.

Unequal sections ( " gaps " ): Even if you can clarify the definition of similarity very far with the normalization remains to consider whether clones may in addition also include smaller unequal sections ( " gaps "). The advantage is that several small clones thereby merge into the larger clones, making visible at a higher level of redundancy. The gaps have also for differences may be unintentional. However, the detection of clones with gaps is not supported by all tools. It must also be determined how many Gaps may contain a clone.

Demarcation from redundant literals

Clones defined by the fact that sections of code to a certain size are similar to each other. For this reason, one usually speaks of individual duplicated token not of clones. This pertains primarily to redundant literals in the source code that should be better represented as a symbolic constant (eg all occurrences of the literal 3.14159 by a constant PI).

Terms

There are a number of terms that have been established over time in the area of software clones.

Int a = 0; int b = a * a; String str = " Peter"; int a = 0; / / Commentint b = a * a; String str = " Peter"; Type -2 clones Type -1 clones in which additional differences in the identifiers or literals exist. Type 2 Klonpaar int a = 0; int b = a * a; String str = " Peter"; int a = 0; / / Commentint x = a * a; String str = " Peter"; Type-3 clones Type -2 clones the unequal portions (gaps ) included. In particular, in the English literature, the term near miss clones will be used. Type-3 Klonpaar int a = 0; int b = a * a; String str = " Peter"; int a = 0; / / Commentint x = a * a; print ( b); String str = " Peter"; Type -4 clones Clones that are syntactically similar semantically similar but not necessarily. Due to the impossibility of determining semantic similarity of these clone type has little practical use. Klonpaar A Klonpaar represents the relation between exactly two similar code sections ( cloning). Klonklasse A Klonklasse represents the relationship between two or more similar source code portions ( cloning). Klonüberdeckung The portion of the system is part of a clone is. Is often used to determine the extent of the redundancy in a system to quantify. reasons

Clones arise almost exclusively by copy & paste programming. For the emergence of clones, there are a number of reasons, which can be divided into the following categories:

Development strategy: Existing functionality is used as a template for newly created functionality. Here, the existing code is copied and adapted as needed. This includes situations where code is duplicated in order to test changes without compromising the function of the system ( branching ).

Advantages in maintenance: By reusing existing code, future maintenance will be easier. For example, it is best code reused to minimize the risk of new failure. In addition, avoided by copying code unwanted dependencies between components and a separate maintenance are made possible.

Bypassing limitations: programming languages offer different kinds of abstraction mechanisms. If in a given situation not a suitable abstraction opportunity to be present, this problem is often solved by clones. But clones can also arise due to time constraints and lack of knowledge of the developer.

Independent Implementation: Due to a lack of awareness of this functionality is implemented multiple times. In addition, clones may result from the use of libraries that require a specific protocol.

Negative effects

Clones are generally considered to be anti-pattern, since they contradict the principle of " Do not repeat yourself " (DRY) and are considered the most common feature of bad code. Run the above reasons, the list of so-called code smells. And although there are a variety of reasons for their formation, as clones sometimes have serious negative effects on the software and its maintenance.

Increased maintenance: clones increase the cost of maintenance significantly. Identical code must be read and understood on several occasions under certain circumstances. Additional effort is required to understand whether small differences are intentional or unintentional. In this case, a change must be carried out several times and tested. It must be checked for each copy at least if the change must be carried out here - even if the copy is not changed at the end.

Inconsistent changes: In the context of changes to clones is always the risk to overlook individual copies. This problem is all the stronger the more people get involved in a project and copy each other unknowingly. Even if existing code is used as a template and needs to be adjusted is always the danger that this adjustment is not performed correctly. Unintentionally inconsistent changes can be particularly problematic when it concerns with the change to a bug fix. The error is corrected at a point and it is believed that the problem is solved so that even though the error still remains at least one other location in the system. This unintentional inconsistent changes demonstrably exist in large numbers in production systems.

Increased memory requirements: A further consequence of cloning is that the code size is larger than it should be. This applies to the storage of the source files as well as the size of the compiled system. This can lead, in particular in the field of embedded systems to problems and make more expensive hardware required. In addition, an increased code size also has an increased time required by the compiler or interpreter result.

Copy of errors: There is a risk that the code was copied, in which there are errors. As a result, the error must be later sought and corrected at various points of the source code. There is a risk to overlook individual occurrences.

Clone recognition

There are a variety of Klonerkennungsverfahren which can be roughly based on the program representation with which they work, categorize. The procedures differ from one another in the running time complexity, quality of results and available normalizations.

Text: These procedures have no language-specific knowledge and interpret the source code as plain text. Thus, these methods are to be implemented relatively quickly and easily. The disadvantage is that they are of the layout of the source code -based and provide very few opportunities of normalization.
Tokens: These methods are based on a lexical analysis of the source code and work on a sequence of tokens. These procedures are still relatively quick and easy to implement. With some knowledge of the language is analyzed normalizations leave such as ignoring comments or abstracting of identifiers perform. In addition, these methods are not sensitive to differences in the layout of the source code.
Tree: This method work on the abstract syntax tree (AST ) of the program and is looking for similar subtrees. Thus they offer many other possibilities such as the normalization the abstraction of the type of a loop or the order of the operands in commutative operations. Tree -based methods, however, are complex to implement and require a significantly longer run time. They also require that the source code syntax is correct and can be parsed. This is a clear disadvantage compared to text- based and token-based methods.
Graph: These methods generally work on the Program Dependency Graph ( PDG) and find similar subgraphs. Thus, they are the most complex to implement, and also the slowest method. Their advantage is that they can detect clones with low syntactic similarity that can be found by any of the other methods.
Metrics: These methods calculate for certain entities in the source code (eg, methods) metrics and identify clones by comparing these metrics. They are quite simple to implement and relatively fast. The disadvantage is that the clones can be found only on the predetermined granularity. Thus, for example no clones are found within methods when the metrics are calculated based on all methods. In practice, these methods find little use, since many entities have purely coincidental similar metric values without any syntactic or semantic similarity.

Not every method can be clearly assigned to one of the categories, as some methods combine elements from different categories. Thus, there are, for example, hybrid method, which first create an abstract syntax tree, these then serialize and use a token -based methods for the detection of clones in the sequence of the serialized nodes.

Furthermore, let Klonerkennungsverfahren according to whether they work incrementally or not. An incremental method detects clones in several successive versions of the source code. In comparison to the repeated application of a non- incremental method for each version is incremental process guarantee that the results of the previous version advantage. This offer incremental process a significant speed advantage in the clone recognition across multiple versions of the source code.

Tools

There are different tools for static analysis of program code that can find the source code clones. This includes numerous free tools such as the PMD plugin CPD ( Copy / Paste Detector), Clone Digger ( for Python and Java), cppcheck (for C and C) and ConQAT ( for Ada, ABAP, C #, C, C , Cobol, Java, Visual Basic, PL / I) as well as proprietary tools like CCFinder (code CloneFinder ) or Simian ( Similarity Analyser ).

Klonmanagement

Klonmanagement summarizes all activities together in dealing with cloning. This includes both the tool-based detection of clones as well as the selection and implementation of appropriate countermeasures. Among the possible countermeasures include:

Remove

The clones are removed by a suitable abstraction is created that can be more unified the redundant sections of code. Frequently noted in the Extract Method refactoring is applied. In object- oriented language, it is also possible clones by taking advantage of inheritance and grouping the copies in a common base class ( pull-up Method refactoring ) to remove. In languages that contain a preprocessor to clones can be removed by appropriate macros. In cloning, the many files and entire subsystems ( "Structural clones " ) include offers to remove the clones by outsourcing in modules or libraries.

Watch

Do not ever let clones easily be removed. An alternative approach is to observe the clones by an appropriate tool. Upon change of a clone of the developer is made to the Vorhandesein further copies. This reduces the risk of unintentionally inconsistent changes is avoided.

Clones outside of the source code

Clone recognition focuses heavily on similar sections in the source code of a program. In addition, however clones also exist in other software development artifacts such as Models or requirement specifications. The reasons for the clones and their negative effects are largely transferable. The definition of the similarity, however, must be adjusted. While there are already well-established method for the detection of clones in source code, is still being researched intensively on the detection of clones in other artifacts.

Example

The following example shows how clones can be removed by an Extract Method refactoring. The code calculates the sum of the values in two arrays.

Public int sum ( int [ ] Values1, int [ ] values2 ) { int sum1 = 0; sum2 int = 0; for (int i = 0; i < values1.length; i ) { sum1 = values [ i]; } for (int i = 0; i < values2.length; i ) { sum2 = values2 [i ]; } return sum1 sum2; } The loop that performs the actual calculation can be extracted into a separate function.

Public int sum ( int [ ] values) { int sum = 0; for (int i = 0; i < values.length; i ) { sum = values [ i]; } return sum; } The new function can be called once for each array. Thus, the clones were removed, and reduces redundancy.

Public int sum ( int [ ] Values1, int [ ] values2 ) { return sum ( Values1 ) sum ( values2 ); } Web Links

Robert Taira: Code Clones Literature. Accessed on 28 August 2013.

Software maintenance Literal (computer programming) Interpreter (computing) Run time (program lifecycle phase) Parsing PMD (Software) C Sharp (programming language) Code refactoring Superclass (computer science) Modular programming Martin Fowler

194710