10 and 20 are overwritten since the contents don’t match the source. ![]() 2 is copied because it doesn’t exist at the target. If -update is used, 1 is skipped because the file-length and contents match. Consider a copy from /source/first/ and /source/second/ to /target/, where the source paths have the following contents:ġ is skipped because the file-length and contents match. The Update and Overwrite options warrant special attention since their handling of source-paths varies from the defaults in a very subtle manner. overwrite overwrites target-files that exist at the target. update is used to copy files from source that don’t exist at the target or differ from the target version. Please refer to the detailed Command Line Reference for information on all the options available in DistCp. If a source file is (re)moved before it is copied, the copy will fail with a FileNotFoundException. Attempting to overwrite a file being written at the destination should also fail on HDFS. It’s also worth noting that if another client is still writing to a source file, the copy will likely fail. ![]() Some have had success running with -update enabled to perform a second pass, but users should be acquainted with its semantics before attempting this. Since DistCp employs both Map/Reduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy. For HDFS, both the source and destination must be running the same version of the protocol or use a backwards-compatible protocol see (#Copying_Between_Versions_of_HDFS).Īfter a copy, it is recommended that one generates and cross-checks a listing of the source and destination to verify that the copy was truly successful. It is important that each NodeManager can reach and communicate with both the source and destination file systems. A count of skipped files is reported at the end of each job, but it may be inaccurate if a copier failed for some subset of its files, but succeeded on a later attempt. By default, files already existing at the destination are skipped (i.e. When copying from multiple sources, DistCp will abort the copy with an error message if two sources collide, but collisions at the destination are resolved per the options specified. The most common invocation of DistCp is an inter-cluster copy: This document aims to describe the design of the new DistCp, its spanking new features, their optimal use, and any deviance from the legacy implementation. New paradigms have been introduced to improve runtime and setup performance, while simultaneously retaining the legacy behaviour as default. The purpose of the DistCp refactor was to fix these shortcomings, enabling it to be used and extended programmatically. ( ) has its share of quirks and drawbacks, both in its usage, as well as its extensibility and performance. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |