The Hortonworks HDP distribution will soon be deprecated in favor of Cloudera’s CDP. One of our clients wanted a new Apache Hive feature backported into HDP 2.6.0. We thought it was a good opportunity to try to manually rebuild HDP 2.6.0’s Hive version. This article will go through the process of backporting a list of patches provided by Hortonworks on top of the open source Hive project in the most automated way.
Re-building HDP 2.6.0’s Hive
Our first step was to see if we were able to build Apache Hive as it is in HDP 2.6.0.
For each HDP release, Hortonworks (now Cloudera) describes the list of patches applied on top of a given component’s release in the release notes. We will retrieve here the list of patches applied on top of Apache Hive 2.1.0 for HDP 2.6.0 and put them in a text file:
cat hdp_2_6_0_0_patches.txt HIVE-9941 HIVE-12492 HIVE-14214 ... ... HIVE-16319 HIVE-16323 HIVE-16325
We know there are 120 of them:
wc -l hdp_2_6_0_0_patches.txt 120 hdp_2_6_0_0_patches.txt
Now comes the question how to apply those patches?
Applying the patches in the order provided by Hortonworks
The first step is to clone the Apache Hive repository and checkout the version HDP 2.6.0 embeds which is the
git clone https://github.com/apache/hive.git git checkout rel/release-2.1.0
Our first guess was that the patches were listed on HDP’s release notes in the chronological order of merging in Hive’s master branch so we tried to get the patches and apply them in that order.
For each patch in the list, we need to fetch the patch file issued in the associated JIRA and run the
git apply command against it. It can be tedious for a large amounts of patches so we will see later in this article how to automate it.
Let’s start small by trying to apply the first patch of our list:
HIVE-9941. We can see in the associated JIRA issue that there are multiple attachments:
We will pick the most recent one and apply it:
wget https://issues.apache.org/jira/secure/attachment/12833864/HIVE-9941.3.patch git apply HIVE-9941.3.patch HIVE-9941.3.patch:17: new blank line at EOF. + HIVE-9941.3.patch:42: new blank line at EOF. + HIVE-9941.3.patch:71: new blank line at EOF. + HIVE-9941.3.patch:88: new blank line at EOF. + warning: 4 lines add whitespace errors.
git apply command gave us a few warnings about some white spaces and blank lines but the application of the patch worked. The success is confirmed with
git status HEAD detached at rel/release-2.1.0 Untracked files: (use "git add <file>..." to include in what will be committed) HIVE-9941.3.patch ql/src/test/queries/clientnegative/authorization_alter_drop_ptn.q ql/src/test/queries/clientnegative/authorization_export_ptn.q ql/src/test/queries/clientnegative/authorization_import_ptn.q ql/src/test/queries/clientnegative/authorization_truncate_2.q ql/src/test/results/clientnegative/authorization_alter_drop_ptn.q.out ql/src/test/results/clientnegative/authorization_export_ptn.q.out ql/src/test/results/clientnegative/authorization_import_ptn.q.out ql/src/test/results/clientnegative/authorization_truncate_2.q.out nothing added to commit but untracked files present (use "git add" to track)
In this case, the patch only created new files and did not perform any modifications but it worked regardless.
Moving on to the second patch of the list,
wget https://issues.apache.org/jira/secure/attachment/12854164/HIVE-12492.02.patch git apply HIVE-12492.02.patch HIVE-12492.02.patch:362: trailing whitespace. Map 1 [...] error: src/java/org/apache/hadoop/hive/conf/HiveConf.java: No such file or directory error: src/test/resources/testconfiguration.properties: No such file or directory error: src/java/org/apache/hadoop/hive/ql/optimizer/ConvertJoinMapJoin.java: No such file or directory error: src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java: No such file or directory error: src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java: No such file or directory
This time, we encountered multiple
No such file or directory errors. Tracking the cause of the error is kind of tricky. Let me show why it is so and why it make the all patching process hard to automate.
Once we download and open the “HIVE-12492.02.patch” path file, we see at the beginining that the “No such file or directory” error is applying to the file
diff --git common/src/java/org/apache/hadoop/hive/conf/HiveConf.java common/src/java/org/apache/hadoop/hive/conf/HiveConf.java index 3777fa9..08422d5 100644 --- common/src/java/org/apache/hadoop/hive/conf/HiveConf.java +++ common/src/java/org/apache/hadoop/hive/conf/HiveConf.java @@ -1422,6 +1422,11 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal "This controls how many partitions can be scanned for each partitioned table.\n" + [...]
But the error is telling that the
src/java/org/apache/hadoop/hive/conf/HiveConf.java does not exist instead of
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java. Notice how the second path prefixes the first one with
common. This is because
git apply has a property
-p (described here) which removes the leading part components of file paths. By default it is set to
1 and not
0, removing the first directory,
common in our case.
We did not encounter this problem with the previous patch because the diff made use of prefixing in the file paths, for example:
diff --git a/ql/src/test/queries/clientnegative/authorization_alter_drop_ptn.q b/ql/src/test/queries/clientnegative/authorization_alter_drop_ptn.q
Let’s try again with
git apply -p0 HIVE-12492.02.patch HIVE-12492.02.patch:362: trailing whitespace. Map 1 [...] error: patch failed: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:1422 error: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java: patch does not apply error: patch failed: itests/src/test/resources/testconfiguration.properties:501 error: itests/src/test/resources/testconfiguration.properties: patch does not apply error: patch failed: ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConvertJoinMapJoin.java:53 error: ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConvertJoinMapJoin.java: patch does not apply error: patch failed: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java:51 error: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java: patch does not apply
This is better but still the patch won’t apply. This time we have an error
patch does not apply which is clearly indicating us the patch can not be applied.
Looking back at the JIRA list given by Hortonworks, we realized that they actually were given in an alphanumerical order. If we take the example of these two issues:
HIVE-14432. The first one has been created before the second one but the patch from the later has been release before the former. If these two patches were to modify the same file, it could lead to the error we just encountered.
Applying the patches in the alphanumerical order seems like a bad choice in this case. We will now order the patches in chronological order according to the date when they were committed/applied in the master branch.
Applying the JIRA patches in the commit order
Let’s try to iterate through every patch to see how many patches we can apply by going with this strategy.
Getting and applying every patches file automatically from every JIRA’s attachments can be tedious:
- There are 120 of them
- We have just seen some patches are applied at
-p0level while others are at
- Some JIRA indicated by Hortonworks do not have any attachments (ex: HIVE-16238)
- Some JIRA have attachments that are not patches (ex: HIVE-16323 has a screenshot attached next to the patches)
- Some JIRA have patches in
.txtformat instead of
Here is an excerpt (in CoffeeScript) from the script we made to get the latest version of each patch from JIRA:
generate_patch_manifest = -> patches_manifest = hdp_2_6_0_0_patches .map (patch) -> jira_id = patch.split(/:(.+)/) commit_description = patch.split(/:(.+)/) response = await axios.get "https://issues.apache.org/jira/rest/api/2/search?jql=key=#jira_id&fields=attachment" attachments = response.data.issues.fields.attachment.filter (o) -> o.filename.includes jira_id most_recent_created_time = Math.max.apply Math, attachments.map (o) -> Date.parse o.created most_recent_created_attachment = attachments.filter (o) -> Date.parse(o.created) is most_recent_created_time id: jira_id description: commit_description created: Date.parse most_recent_created_attachment.created url: most_recent_created_attachment.content .sort (a,b) -> a.created - b.created fs.writeFile './patch_manifest.yml', yaml.safeDump(patches_manifest), (err) -> if err then console.error err else console.info 'done'
This method generates a
patch_manifest.yaml file that looks like:
- id: $JIRA_NUMBER description: $COMMIT_MESSAGE created: $TIMESTAMP url: $LATEST_PATCH_URL
Creating this manifest is going to help us applying the patches in right order.
Once we have built this manifest, we iterate through the list, download and try to apply as much patches as we can. Here is another code excerpt to do that:
for patch, index in patches_manifest try await nikita.execute """ # Download the patch file wget -q -N #patch.url # If the patch as the "a/" leading part in the files, apply with -p1 if grep "git \\\\ba/" patches/#patch.url.split('/').pop(); then git apply patches/#patch.url.split('/').pop() # If not, then apply with -p0 else git apply -p0 patches/#patch.url.split('/').pop() fi # Commit changes git add -A git commit -m "Applied #patch.url.split('/').pop()" """ console.info "Patch #patch.url.split('/').pop() applied successfully" catch err console.error "Patch #patch.url.split('/').pop() failed to apply"
This did not go as good as expected. Only 19 out of the 120 patches in the list applied successfully. All failures were due to
patch does not apply errors which could be because:
- We are still not applying the fixes in the right order
- We are missing some code pre-requisites (ie: the JIRA list is incomplete)
- Some code need to be modified to be backported
- All of the above
In the next part, we will use another strategy by cherry-picking the commits related to the HIVE-XXXXX JIRAs we want to have in our build.
Applying the patch by pushed-to-master chronological date
Every commit message in Apache Hive has the related JIRA issue in its description:
In this attempt, we are going to get the list of all the commits pushed to the Apache Hive master branch in order to generate the list of the commits containing the patches we want to apply in a chronological order.
git checkout master git log --pretty=format:'%at;%H;%s' > apache_hive_master_commits.csv
git log command with the right parameters generates a CSV file with the format:
head apache_hive_master_commits.csv 1597212601;a430ac31441ad2d6a03dd24e3141f42c79e022f4;HIVE-23995:Don't set location for managed tables in case of replication (Aasha Medhi, reviewed by Pravin Kumar Sinha) 1597206125;ef47b9eb288dec6e6a014dd5e52accdf0ac3771f;HIVE-24030: Upgrade ORC to 1.5.10 (#1393) 1597144690;d4af3840f89408edea886a28ee4ae7d79a6f16f8;HIVE-24014: Need to delete DumpDirectoryCleanerTask (Arko Sharma, reviewed by Aasha Medhi) 1597135126;71c9af7b8ead470520a6c3a4848be9c67eb80f10;HIVE-24001: Don't cache MapWork in tez/ObjectCache during query-based compaction (Karen Coppage, reviewed by Marta Kuczora) 1597108110;af911c9fe0ff3bb49d80f6c0109d30f5e1046849;HIVE-23996: Remove unused line in UDFArgumentException (Guo Philipse reviewed by Peter Vary, Jesus Camacho Rodriguez) 1597107811;17ee33f4f4d8dc9437d1d3a47663a635d2e47b58;HIVE-23997: Some logs in ConstantPropagateProcFactory are not straightforward (Zhihua Deng, reviewed by Jesus Camacho Rodriguez) 1596833809;5d9a5cf5a36c1d704d2671eb57547ea50249f28b;HIVE-24011: Flaky test AsyncResponseHandlerTest ( Mustafa Iman via Ashutosh Chauhan) 1596762366;9fc8da6f32b68006c222ccfd038719fc89ff8550;HIVE-24004: Improve performance for filter hook for superuser path(Sam An, reviewed by Naveen Gangam) 1595466890;7293956fc2f13ccbe72825b67ce1b53dce536359;HIVE-23901: Overhead of Logger in ColumnStatsMerger damage the performance 1593631609;4457c3ec9360650be021ea84ed1d5d0f007d8308;HIVE-22934 Hive server interactive log counters to error stream ( Ramesh Kumar via Ashutosh Chauhan)
This next script will match only the commits we want to apply by matching the JIRA ids provided by Hortonworks. Then we order the CSV by date from oldest to newest:
cat hdp_2_6_0_0_patches.txt | while read in; do grep $in apache_hive_master_commits.csv; done > commits_to_apply.csv sort commits_to_apply.csv > sorted_commits_to_apply.csv
Now that we have an ordered list of commits we want to apply we can go back to the
2.1.0 release tag and cherry-pick these commits.
Cherry-picking seems like a better solution than patch applying in our situation. Indeed, we have just seen that we could not apply some patches because of missing code dependency.
Let’s try with the first one, the JIRA is HIVE-14214 and the commit hash is
head -1 sorted_commits_to_apply.csv 1469348583;b28ec7fdd8317b47973c6c8f7cdfe805dc20a806;HIVE-14214: ORC Schema Evolution and Predicate Push Down do not work together (no rows returned) (Matt McCline, reviewed by Prasanth Jayachandran/Owen O'Malley) git cherry-pick -x 15dd7f19bc49ee7017fa5bdd65f9b4ec4dd019d2 --strategy-option theirs CONFLICT (modify/delete): ql/src/test/results/clientpositive/tez/explainanalyze_5.q.out deleted in HEAD and modified in 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong). Version 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong) of ql/src/test/results/clientpositive/tez/explainanalyze_5.q.out left in tree. [...] CONFLICT (modify/delete): ql/src/test/queries/clientpositive/explainanalyze_1.q deleted in HEAD and modified in 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong). Version 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong) of ql/src/test/queries/clientpositive/explainanalyze_1.q left in tree. error: could not apply 15dd7f19bc... HIVE-15084: Flaky test: TestMiniTezCliDriver:explainanalyze_1, 2, 3, 4, 5 (Pengcheng Xiong)
We removed some of the outputs but the takeaway here is that we are having some conflicts.
git cherry-pick is trying to
git merge a commit from the official Hive repository into our current local branch. The benefit of cherry-picking is that we are comparing two file instead of just applying a diff, leaving us the choice on how to merge the content.
To do that, we need to add the
--strategy-option theirs parameter to the cherry-pick command. This will solve most conflicts but we can still encounter conflicts like:
Unmerged paths: (use "git add/rm
..." as appropriate to mark resolution) deleted by us: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java added by them: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/VectorDeserializeOrcWriter.java
The conflict resolution must be done manually with
git add or
git rm depending on the situation.
Using the cherry-picking method and the manual resolutions for some patches, we were able to import all 120 JIRA patches in our code branch.
Important note: We tried to automate it as much as possible but keep in mind that running the unit tests after patching is extremely recommended before building and applying the fix to any production cluster. We will see that in the next part.
git checkout rel/release-2.1.0 while read line; do commit=$(echo $line | awk -F "\"*;\"*" 'print $2') if git cherry-pick -x $commit --strategy-option theirs > patches/logs/cherry_pick_$commit.out 2>&1; then echo "Cherry picking $commit SUCCEEDED" else git status > patches/logs/cherry_pick_$commit.out 2>&1 git status | sed -n 's/.* by .*://p' | xargs git add git -c core.editor=true cherry-pick --continue echo "Cherry picking $commit SUCEEDED with complications" fi done < sorted_commits_to_apply.csv
With the script above, we are able to apply 100% of the patches. In the next part, we will go over the steps to build and test the Hive distribution.
Before running the unit tests, let’s try to see if we can actually build the Hive release:
mvn clean package -Pdist -DskipTests
After a few minutes, the build stops with the following error:
[ERROR] /home/leo/Apache/hive/storage-api/src/test/org/apache/hadoop/hive/ql/exec/vector/TestStructColumnVector.java:[106,5] cannot find symbol symbol: class Timestamp location: class org.apache.hadoop.hive.ql.exec.vector.TestStructColumnVector
Looking at the file
./storage-api/src/test/org/apache/hadoop/hive/ql/exec/vector/TestStructColumnVector.java, the class
java.sql.Timestamp, is not imported indeed.
This file was modified by cherry-picking the HIVE-16245. This patch changes 3 files:
How did Hortonworks apply this specific patch without breaking the build? In the next part, we will dig into the Hortonworks’ public Hive repository to find clues on how they applied the patches listed for HDP 188.8.131.52.
Exploring Hortonworks’ hive2-release project
On GitHub, Hortonworks keeps a public repository for every component in the HDP stack. The one we will want to check is
git clone https://github.com/hortonworks/hive2-release.git cd hive2-release && ls README.md
The repository seems empty at first but the files can actually be found by checking out the tags. There is one tag per HDP release.
git checkout HDP-184.108.40.206-8-tag
Let’s try to find out how Hortonworks applied the patch which provoked the build error we have seen earlier.
git log --pretty=format:'%at;%H;%s' | grep HIVE-16245 1489790314;2a9150604edfbc2f44eb57104b4be6ac3ed084c7;BUG-77383 backport HIVE-16245: Vectorization: Does not handle non-column key expressions in MERGEPARTIAL mode
Now, let’s see which files were modified by this commit:
git diff 2a9150604edfbc2f44eb57104b4be6ac3ed084c7^ | grep diff diff --git a/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java b/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java diff --git a/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch.java b/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch.java
As opposed to what’s in the patch (and been pushed to Apache Hive’s master branch), the backport operated by Hortonworks only changes 2 out of 3 files. There seems to be some black magic involved or the process is manual and not fully documented. The files that provoked the build error we encountered earlier is not changed here.
Backporting a list of patches to an open source project like Apache Hive is hard to automate. With the HDP distribution and depending on the patches, Hortonworks has applied some manual changes to the source code or its engineers were working on a forked version of Hive. A good understanding of the global project structure (modules, unit tests, etc.) and its ecosystem (dependencies) is required in order to build a functional and tested release.