-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
1165 lines (758 loc) · 133 KB
/
index.html
File metadata and controls
1165 lines (758 loc) · 133 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE HTML>
<html lang="zh-Hans">
<head>
<meta charset="utf-8">
<title>Oh Captain, My Captain - Du00</title>
<meta name="author" content="Du00">
<meta name="description" content="Du00的博客">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<meta property="og:site_name" content="Oh Captain, My Captain - Du00"/>
<meta property="og:image" content="undefined"/>
<meta http-equiv="Content-Language" content="zh-Hans"/>
<link href="/img/favicon.png" rel="icon">
<link rel="apple-touch-icon" href="/img/apple-icon.png">
<link rel="apple-touch-icon-precomposed" href="/img/apple-icon.png">
<link rel="alternate" href="/atom.xml" title="Oh Captain, My Captain - Du00" type="application/atom+xml">
<link rel="stylesheet" href="/css/style.css" media="screen" type="text/css">
<style type="text/css">
/* Tim Pietrusky advanced checkbox hack (Android <= 4.1.2) */
body{ -webkit-animation: bugfix infinite 1s; }
@-webkit-keyframes bugfix { from {padding:0;} to {padding:0;} }
<!-- Chinese readability improvements -->
article {font-weight: 400;letter-spacing: .01rem;}
article .entry{line-height:2;}
article .post-content-index .entry{ overflow:hidden;}
</style>
<!--[if lt IE 9]><script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]-->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-56718947-1', 'auto');
ga('send', 'pageview');
</script>
<!-- 360 Font and Baidu CDN in China -->
<link href='http://fonts.useso.com/css?family=Open+Sans:300,400|Playball' rel='stylesheet' type='text/css'>
<link href='http://apps.bdimg.com/libs/fontawesome/4.1.0/css/font-awesome.css' rel='stylesheet' type='text/css'>
<script src="http://libs.baidu.com/jquery/1.11.1/jquery.min.js"></script>
</head>
<body>
<header id="header" class="inner"><div class="padding">
<div class="alignleft logo">
<h1><a href="/">Oh Captain, My Captain - Du00</a></h1>
</div>
<nav id="main-nav" class="alignright">
<input type="checkbox" id="toggle" />
<label for="toggle" class="toggle" data-open="Main Menu" data-close="Close Menu" onclick><i class="fa fa-bars"></i></label>
<ul class="menu">
<li><a href="/">Home</a></li>
<li><a href="/archives">Archives</a></li>
</ul>
</nav>
<div class="clearfix"></div>
</div>
</header>
<div id="page-heading-wrap">
<div class="inner">
<div class="padding">
<h2>Qzone-天涯-163-百度-新浪 拣尽寒枝不肯栖</h2>
</div>
</div>
</div>
<div id="content" class="inner">
<div id="main-col" class="alignleft"><div id="wrapper" class="padding">
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2015/03/spark-map-mapvalues/">Spark学习——map & mapValues</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2015-03-27T12:31:00.000Z">2015-03-27</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2015/03/spark-map-mapvalues/#ds-thread"><span class="ds-thread-count" data-thread-key="2015/03/spark-map-mapvalues/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<p> 有时候觉得会不会从来没看过源码会是种遗憾,尝试着去看spark源码时又发现看不懂……</p>
<p> map和mapValues的区别其实很大,最重要的区别是mapValues只对Tuple2的第二个元素进行操作,保留第一个元素key不变(废话)。先上结论:</p>
<ul>
<li>mapValues在不改变数据分区的情况下对数据进行一些转换操作,从而避免在进一步的join/reduce之类的操作中产生不必要的shuffle开销,如果想要制造narrow dependency,这一点还是需要知道;</li>
<li>map操作过后partitioner会丢失,仿上——join/reduce之类的操作时会产生shuffle</li>
</ul>
<p>Spark源码中map/mapValues的相关体现:<br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="comment">//RDD中的partitioner</span></span><br><span class="line"> <span class="comment">/** Optionally overridden by subclasses to specify how they are partitioned. */</span></span><br><span class="line"> <span class="annotation">@transient</span> <span class="function"><span class="keyword">val</span> <span class="title">partitioner</span>:</span> <span class="type">Option</span>[<span class="type">Partitioner</span>] = <span class="type">None</span></span><br><span class="line"></span><br><span class="line"><span class="comment">//MappedRDD</span></span><br><span class="line"><span class="keyword">private</span>[spark]</span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">MappedRDD</span>[</span><span class="type">U</span>: <span class="type">ClassTag</span>, <span class="type">T</span>: <span class="type">ClassTag</span>](prev: <span class="type">RDD</span>[<span class="type">T</span>], f: <span class="type">T</span> => <span class="type">U</span>)</span><br><span class="line"> <span class="keyword">extends</span> <span class="type">RDD</span>[<span class="type">U</span>](prev) {</span><br><span class="line"></span><br><span class="line"> <span class="keyword">override</span> <span class="function"><span class="keyword">def</span> <span class="title">getPartitions</span>:</span> <span class="type">Array</span>[<span class="type">Partition</span>] = firstParent[<span class="type">T</span>].partitions</span><br><span class="line"></span><br><span class="line"> <span class="keyword">override</span> <span class="function"><span class="keyword">def</span> <span class="title">compute</span>(</span>split: <span class="type">Partition</span>, context: <span class="type">TaskContext</span>) =</span><br><span class="line"> firstParent[<span class="type">T</span>].iterator(split, context).map(f)</span><br><span class="line">}</span><br></pre></td></tr></table></figure></p>
<!-- more >
map的相关代码在RDD中,值得注意的是其中的partitioner变量是带有`@transient`标记的,标记的具体解释可以参照
<blockquote><p>Finally, Scala provides a @transient annotation for fields that should not be serialized at all. If you mark a field as @transient, then the framework should not save the field even when the surrounding object is serialized. When the object is loaded, the field will be restored to the default value for the type of the field annotated as @transient.</p>
<footer><strong>Annotations</strong><cite><a href="https://www.artima.com/pins1ed/annotations.html">Programming in Scala</a></cite></footer></blockquote>
<p> 这也就意味着由map生成的RDD都是不带有partition信息的。同样的,再看看<code>MappedValuesRDD</code><br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="keyword">private</span>[spark]</span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">MappedValuesRDD</span>[</span><span class="type">K</span>, <span class="type">V</span>, <span class="type">U</span>](prev: <span class="type">RDD</span>[_ <: <span class="type">Product2</span>[<span class="type">K</span>, <span class="type">V</span>]], f: <span class="type">V</span> => <span class="type">U</span>)</span><br><span class="line"> <span class="keyword">extends</span> <span class="type">RDD</span>[(<span class="type">K</span>, <span class="type">U</span>)](prev) {</span><br><span class="line"></span><br><span class="line"> <span class="keyword">override</span> <span class="function"><span class="keyword">def</span> <span class="title">getPartitions</span> =</span> firstParent[<span class="type">Product2</span>[<span class="type">K</span>, <span class="type">U</span>]].partitions</span><br><span class="line"></span><br><span class="line"> <span class="keyword">override</span> <span class="function"><span class="keyword">val</span> <span class="title">partitioner</span> =</span> firstParent[<span class="type">Product2</span>[<span class="type">K</span>, <span class="type">U</span>]].partitioner</span><br><span class="line"></span><br><span class="line"> <span class="keyword">override</span> <span class="function"><span class="keyword">def</span> <span class="title">compute</span>(</span>split: <span class="type">Partition</span>, context: <span class="type">TaskContext</span>): <span class="type">Iterator</span>[(<span class="type">K</span>, <span class="type">U</span>)] = {</span><br><span class="line"> firstParent[<span class="type">Product2</span>[<span class="type">K</span>, <span class="type">V</span>]].iterator(split, context).map { pair => (pair._1, f(pair._2)) }</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure></p>
<p> MappedValuesRDD中,partitioner是直接取了第一个祖先的分区的,所以RDD的partition信息是得到了保留的。</p>
-->
</div>
<footer>
<div class="alignright">
<a href="/2015/03/spark-map-mapvalues/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2014/12/spark-tips-01/">Spark问题备忘</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2014-12-12T09:46:00.000Z">2014-12-12</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2014/12/spark-tips-01/#ds-thread"><span class="ds-thread-count" data-thread-key="2014/12/spark-tips-01/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<p> 想当年被mapreduce虐得死去活来,换上Spark其实是接着虐,记录在此仅作备忘。</p>
<h3 id="1-_Spark/Scala在Eclipse中的设置">1. Spark/Scala在Eclipse中的设置</h3><p> IntelliJ IDEA也有相应插件,在此不提。基于Eclipse的方案中可以安装ScalaIDE的插件,但还是建议使用ScalaIDE,而且是<a href="http://scala-ide.org/download/milestone.html" target="_blank" rel="external">Milestone</a>版本。IDE的新版本不同于硬件驱动,往往是添加了更好的新功能,不需要坚持用一个老版本,况且开源的东西对老版本的东西感觉好像不维护的样子。 </p>
<p> scala工程在Eclipse中有一些基本上逃避不了的问题: </p>
<ol>
<li>首先,<strong>怎么创建一个scala的maven工程?</strong><br> 参考<a href="/2014/11/11/2014-11-11-spark-scala-introduction/">Spark/Scala的入门材料</a>,建一个新工程,在pom中配置scala的maven编译插件即可。</li>
<li><strong>引入一个scala的maven工程后需要哪些设置?</strong><br>情况可能各有不同,一条一条对照检查即可:<ul>
<li>未识别出工程为Scala工程(工程文件夹图标有M、J标记,没有S)——需要<strong>添加工程的Scala特性</strong><br><code>工程->右键->Configure->Add Scala Nature</code></li>
<li>未能识别scala代码(没有变成java的package管理)——添加Scala代码路径为Source Folder<br>比如src/main/scala文件夹下有scala源文件——<code>scala代码文件夹-> 右键 ->Build Path -> Use as Source Folder</code></li>
<li>可能的兼容性问题:修改JDK兼容版本从1.5到1.6——<code>JRE System Library -> 右键 -> Properties -> 选择Java SE-1.6</code>(比如thrift需要至少JDK1.6支持,如果没有飘红忽略亦可)</li>
</ul>
</li>
<li><strong>Scala版本的依赖版本冲突</strong><br> 在Problems的View中可能会有红色的错误提示说xx.jar是用2.10编译的,而你使用的是2.11,这时对着问题<code>ctrl+1或者右键-> Quick Fix -> 下拉Scala Installation选择相应版本(2.10)</code></li>
<li><strong>IDE在Build workspace时缓慢、报内存不够</strong><br> 调整最大内存分配。这个还是很有必要的,<code>菜单栏->Scala->Run Setup Diagnostics->选中Use recommended default settings</code>,如果Heap settings中的有1.5G以上就不用改了,如果有需要,修改eclipse.ini(去安装文件夹下找,把512m改成2048m)即可。</li>
</ol>
<!-- more >
### 2. org.apache.spark.SparkException: Error communicating with MapOutputTracker
问题:少量数据可以运行成功,而数据量放大到数十倍之后运行失败并报这个错。下面是官方对[spark.akka.frameSize](http://spark.apache.org/docs/latest/configuration.html)这个参数的解释
>Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB. Increase this if your tasks need to send back large results to the driver (e.g. using collect() on a large dataset).
有了上面的解释就可以明白怎么去调整了:1. 对消息的大小限制放松,2.启用压缩
<figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">sparkConf</span> =</span> <span class="keyword">new</span> <span class="type">SparkConf</span>()</span><br><span class="line"> .setAppName(<span class="string">"Host Access Freq Stats"</span>)</span><br><span class="line"> .set(<span class="string">"spark.serializer"</span>, <span class="string">"org.apache.spark.serializer.KryoSerializer"</span>)</span><br><span class="line"> .set(<span class="string">"spark.akka.frameSize"</span>, <span class="string">"30"</span>); <span class="comment">//默认是10</span></span><br></pre></td></tr></table></figure>
<p>如果这不能解决问题,可能就需要关注关注其它问题——比如文件块数是不是太多了(10,000+)?可以类似于reduceByKey的时候额外指定分区数:<br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line">rdd.reduceByKey(x => ???, <span class="number">80</span>)</span><br></pre></td></tr></table></figure></p>
<p>文件块数多的时候调整调整分区数的大小也可以避免大量的10k/1M的小文件产生,小文件太多是会降低IO效率的。</p>
<h3 id="3-_查看Accumulator/Counter">3. 查看Accumulator/Counter</h3><p> yarn-cluster模式启动的任务如果集群上配了有spark日志的web服务,是可以在任务执行时/结束后(没有专门的日志服务会临时起一个Spark UI,任务结束就没了)查看任务历史的。在Spark1.1之后“有名的”Accumulator可以在Spark Application UI上查看了——在UI上点击——Stages中子任务的Description,在Accumulators中就可以看到计数器了(有没有计数器还得看子任务是什么,如果是collect,显然是不会有的)。<br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="comment">//有名计数器</span></span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">logLinesAcc</span> =</span> sc.accumulator(<span class="number">0</span>f, <span class="string">"(当日)日志输入"</span>)</span><br><span class="line"><span class="comment">//使用</span></span><br><span class="line">logLineAcc += <span class="number">1.5</span></span><br></pre></td></tr></table></figure></p>
<h3 id="最后">最后</h3><p> 有一点写一点,足够长就再开一篇…这个样式真的要调了,好丑!</p>
-->
</div>
<footer>
<div class="alignright">
<a href="/2014/12/spark-tips-01/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2014/12/scala-tips-01/">Scala点点滴滴-JSON/正则/命令行解析</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2014-12-01T09:11:00.000Z">2014-12-01</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2014/12/scala-tips-01/#ds-thread"><span class="ds-thread-count" data-thread-key="2014/12/scala-tips-01/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<h1 id="1-_JSON处理包">1. JSON处理包</h1><p>正如<a href="https://github.com/json4s/json4s" target="_blank" rel="external">JSON4S</a>官网所说,现在已经有6个Scala的JSON解析库了,为什么要使用这一个呢?——快速,简单!JSON4S可以将字符串解析成对象、容器,什么复杂的就免了,我只想学最简单的抽成Map的方法,其它的就交给我自己来处理好了。</p>
<h2 id="1-1_依赖">1.1 依赖</h2><p> 如果是使用的Spark,它的依赖中已经有json4s这个包了,无需再添加。如果没有可以在maven中添加:<br><figure class="highlight xml"><table><tr><td class="code"><pre><span class="line"><span class="tag"><<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>org.json4s<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>json4s-native_${scala.version}<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>3.2.11<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"><span class="tag"></<span class="title">dependency</span>></span></span><br><span class="line"><span class="tag"><<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>org.json4s<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>json4s-jackson_${scala.version}<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>3.2.11<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"><span class="tag"></<span class="title">dependency</span>></span></span><br></pre></td></tr></table></figure></p>
<h2 id="1-2_样例">1.2 样例</h2><p> 我最喜欢的例子(我也只用到了这个)<br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> org.json4s._</span><br><span class="line"><span class="keyword">import</span> org.json4s.jackson.<span class="type">JsonMethods</span>._ <span class="comment">//下面只用到了其中的parse方法</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">implicit</span> <span class="function"><span class="keyword">val</span> <span class="title">formats</span> =</span> <span class="type">DefaultFormats</span> <span class="comment">//不加这一句会提示formats找不到,并且还提示了将org.json4s.DefaultFormats提到前面</span></span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">s</span> =</span> <span class="string">"""{"a":"","b":"mobile","c":"dior","d":"stable","e":"4.4.2"}"""</span></span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">deviceInfo</span> =</span> parse(s, <span class="literal">false</span>).extract[<span class="type">Map</span>[<span class="type">String</span>, <span class="type">String</span>]]</span><br><span class="line">println(deviceInfo)</span><br><span class="line"><span class="comment">// Map(e -> 4.4.2, a -> , b -> mobile, c -> dior, d -> stable)</span></span><br></pre></td></tr></table></figure></p>
<!-- more >
官方文档中还有个抽取成对象的例子,各取所需吧~(反正我只需要上面的……)
<figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> org.json4s._</span><br><span class="line"><span class="keyword">import</span> org.json4s.jackson.<span class="type">JsonMethods</span>._</span><br><span class="line"><span class="keyword">implicit</span> <span class="function"><span class="keyword">val</span> <span class="title">formats</span> =</span> <span class="type">DefaultFormats</span> <span class="comment">// Brings in default date formats etc.</span></span><br><span class="line"><span class="keyword">case</span> <span class="class"><span class="keyword">class</span> <span class="title">Child</span>(</span>name: <span class="type">String</span>, age: <span class="type">Int</span>, birthdate: <span class="type">Option</span>[java.util.<span class="type">Date</span>])</span><br><span class="line"><span class="keyword">case</span> <span class="class"><span class="keyword">class</span> <span class="title">Address</span>(</span>street: <span class="type">String</span>, city: <span class="type">String</span>)</span><br><span class="line"><span class="keyword">case</span> <span class="class"><span class="keyword">class</span> <span class="title">Person</span>(</span>name: <span class="type">String</span>, address: <span class="type">Address</span>, children: <span class="type">List</span>[<span class="type">Child</span>])</span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">json</span> =</span> parse(<span class="string">"""</span><br><span class="line">{ "name": "joe",</span><br><span class="line">"address": {</span><br><span class="line"> "street": "Bulevard",</span><br><span class="line"> "city": "Helsinki"</span><br><span class="line"> },</span><br><span class="line"> "children": [</span><br><span class="line"> {</span><br><span class="line"> "name": "Mary",</span><br><span class="line"> "age": 5,</span><br><span class="line"> "birthdate": "2004-09-04T18:06:22Z"</span><br><span class="line"> },</span><br><span class="line"> {</span><br><span class="line"> "name": "Mazy",</span><br><span class="line"> "age": 3</span><br><span class="line"> }</span><br><span class="line"> ]</span><br><span class="line"> }</span><br><span class="line"> """</span>)</span><br><span class="line">json.extract[<span class="type">Person</span>]</span><br><span class="line"><span class="comment">// res0: Person =Person(joe,Address(Bulevard,Helsinki),List(Child(Mary,5,Some(Sat Sep 04 18:06:22 EEST 2004)), Child(Mazy,3,None)))</span></span><br></pre></td></tr></table></figure>
<p> 文档中还有很多丰富的内容,各取所需各取所需。</p>
<h1 id="2-_正则表达式">2. 正则表达式</h1><p> scala中的正则表达式还是很有爱的——简单,才有爱……不需要引入什么包,直接上例子:<br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">nonHttpPattern</span> =</span> <span class="string">"""^[^(http)].*"""</span>.r <span class="comment">//"""可以避免转义</span></span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">ipAccessPattern</span> =</span> <span class="string">"""^https{0,1}://\d+\.\d+\.\d+.*"""</span>.r</span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">downloadPattern</span> =</span> <span class="string">""".*\.(apk|exe|zip|rar)"""</span>.r <span class="comment">//主要是为了过滤apk</span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">isUrlValid</span>(</span>url: <span class="type">String</span>): <span class="type">Int</span> = {</span><br><span class="line"> url <span class="keyword">match</span> {</span><br><span class="line"> <span class="keyword">case</span> nonHttpPattern(_*) => <span class="number">4</span></span><br><span class="line"> <span class="keyword">case</span> ipAccessPattern(_*) => <span class="number">5</span></span><br><span class="line"> <span class="keyword">case</span> downloadPattern(_*) => <span class="number">6</span></span><br><span class="line"> <span class="keyword">case</span> _ => <span class="number">0</span></span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">date</span> =</span> <span class="string">"""(\d\d\d\d)-(\d\d)-(\d\d)"""</span>.r</span><br><span class="line"><span class="string">"2004-01-20"</span> <span class="keyword">match</span> {</span><br><span class="line"> <span class="keyword">case</span> date(year, month, day) => s<span class="string">"$year was a good year for PLs."</span></span><br><span class="line">} <span class="comment">// case可以匹配出分组,还是很强大的</span></span><br></pre></td></tr></table></figure></p>
<h1 id="3-_命令行解析">3. 命令行解析</h1><p> 首先,如果你已经熟悉一个java版的并且没有时间(不想)学,那直接用就好了。其次,如果时间足够可以去挨个挨个比较比较scala各种各样的命令行解析包。这里只介绍scopt,scopt确实是很简洁以及简单的,直接上例子了:<br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="comment">//所有参数都需要有默认值,这样才能无参初始化一个实例</span></span><br><span class="line"><span class="keyword">case</span> <span class="class"><span class="keyword">class</span> <span class="title">Config</span>(</span>logBase: <span class="type">String</span> = <span class="string">"."</span>,</span><br><span class="line"> hist: <span class="type">String</span> = <span class="string">"."</span>,</span><br><span class="line"> hostFilter: <span class="type">String</span> = <span class="string">"."</span>,</span><br><span class="line"> maxRecordPerUser: <span class="type">Int</span> = <span class="number">200</span>,</span><br><span class="line"> minHostAccess: <span class="type">Int</span> = <span class="number">100</span>,</span><br><span class="line"> decay: <span class="type">Double</span> = <span class="number">0.95</span>f,</span><br><span class="line"> day: <span class="type">DateTime</span> = <span class="literal">null</span>, <span class="comment">//joda datetime</span></span><br><span class="line"> partitions: <span class="type">Int</span> = <span class="number">80</span>)</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">parser</span> =</span> <span class="keyword">new</span> scopt.<span class="type">OptionParser</span>[<span class="type">Config</span>](<span class="string">"run"</span>) {</span><br><span class="line"> head(<span class="string">"User Host Access Accumulator"</span>, <span class="string">""</span>)</span><br><span class="line"> <span class="comment">//每个少写的空格其实都相当于一个"."</span></span><br><span class="line"> opt[<span class="type">String</span>]('i', <span class="string">"input"</span>) required () valueName (<span class="string">"<file>"</span>) action { (x, c) ⇒</span><br><span class="line"> c.copy(logBase = x)</span><br><span class="line"> } text (<span class="string">"input - hdfs路径,浏览器行为基础日志"</span>)</span><br><span class="line"> <span class="comment">//'h'是对应-h,"history"对应的是--history,还是很好理解的</span></span><br><span class="line"> opt[<span class="type">String</span>]('h', <span class="string">"history"</span>) required () valueName (<span class="string">"<file>"</span>) action { (x, c) ⇒</span><br><span class="line"> c.copy(hist = x)</span><br><span class="line"> } text (<span class="string">"input - hdfs路径,浏览器访问累积日志"</span>)</span><br><span class="line"> opt[<span class="type">String</span>]('f', <span class="string">"host-filter"</span>) required () valueName (<span class="string">"<file>"</span>) action { (x, c) ⇒</span><br><span class="line"> c.copy(hostFilter = x)</span><br><span class="line"> } text (<span class="string">"output - hdfs路径,host过滤文件"</span>)</span><br><span class="line"> opt[<span class="type">String</span>]('d', <span class="string">"day"</span>) required () valueName (<span class="string">"<date>"</span>) action { (x, c) ⇒</span><br><span class="line"> c.copy(day = <span class="keyword">new</span> <span class="type">DateTime</span>(x)) <span class="comment">//不只是能copy,简单的处理逻辑也是可以有的!</span></span><br><span class="line"> } text (<span class="string">"date - yyyy-MM-dd日期"</span>)</span><br><span class="line"> opt[<span class="type">Int</span>](<span class="string">"max-record-per-user"</span>) optional () action {</span><br><span class="line"> (x, c) ⇒ c.copy(maxRecordPerUser = x)</span><br><span class="line"> } text (<span class="string">"每个用户记录访问频次的host最大数量"</span>)</span><br><span class="line"> opt[<span class="type">Int</span>](<span class="string">"min-host-access-limit"</span>) optional () action {</span><br><span class="line"> (x, c) ⇒ c.copy(minHostAccess = x)</span><br><span class="line"> } text (<span class="string">"全局host访问频次最低过滤条件"</span>)</span><br><span class="line"> opt[<span class="type">Int</span>](<span class="string">"decay"</span>) optional () action {</span><br><span class="line"> (x, c) ⇒ c.copy(decay = x)</span><br><span class="line"> } text (<span class="string">"访问频次衰减因子"</span>)</span><br><span class="line"> opt[<span class="type">Int</span>](<span class="string">"partitions"</span>) optional () action {</span><br><span class="line"> (x, c) ⇒ c.copy(partitions = x)</span><br><span class="line"> } text (<span class="string">"保存文件块的数量"</span>)</span><br><span class="line">}</span><br></pre></td></tr></table></figure></p>
<p>上面是解析部分,调用见下:<br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">main</span>(</span>args: <span class="type">Array</span>[<span class="type">String</span>]): <span class="type">Unit</span> = {</span><br><span class="line"> parser.parse(args, <span class="type">Config</span>()) map { config ⇒</span><br><span class="line"> <span class="comment">//config已经获取到了</span></span><br><span class="line"> } getOrElse {</span><br><span class="line"> <span class="comment">//额外的错误处理逻辑,默认会把命令行帮助打印出来</span></span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure></p>
<p><a href="https://github.com/scopt/scopt">scopt</a>在github的官网上还有很复杂的例子,我第一次就是被它吓住了。scopt使用时的maven坐标<br><figure class="highlight xml"><table><tr><td class="code"><pre><span class="line"><span class="tag"><<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>com.github.scopt<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>scopt_2.10<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>3.2.0<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"><span class="tag"></<span class="title">dependency</span>></span></span><br></pre></td></tr></table></figure></p>
<p>最后看一下scopt给我们准备的精美帮助<br><figure class="highlight haml"><table><tr><td class="code"><pre><span class="line">User Host Access Accumulator</span><br><span class="line">Usage: run [options]</span><br><span class="line"></span><br><span class="line">-<span class="ruby">i <file> | --input <file></span><br><span class="line"></span> input - hdfs路径,浏览器行为基础日志</span><br><span class="line">-<span class="ruby">h <file> | --history <file></span><br><span class="line"></span> input - hdfs路径,浏览器访问累积日志</span><br><span class="line">-<span class="ruby">f <file> | --host-filter <file></span><br><span class="line"></span> output - hdfs路径,host过滤文件</span><br><span class="line">-<span class="ruby">d <date> | --day <date></span><br><span class="line"></span> date - yyyy-MM-dd日期</span><br><span class="line">-<span class="ruby">-max-record-per-user <value></span><br><span class="line"></span> 每个用户记录访问频次的host最大数量</span><br><span class="line">-<span class="ruby">-min-host-access-limit <value></span><br><span class="line"></span> 全局host访问频次最低过滤条件</span><br><span class="line">-<span class="ruby">-decay <value></span><br><span class="line"></span> 访问频次衰减因子</span><br><span class="line">-<span class="ruby">-partitions <value></span><br><span class="line"></span> 保存文件块的数量</span><br></pre></td></tr></table></figure></p>
<p>如果参数有错误会先打出来,再打这段帮助,错误信息是类似于这样的:<br><figure class="highlight vbscript"><table><tr><td class="code"><pre><span class="line"><span class="keyword">Error</span>: Unknown <span class="keyword">option</span> -o</span><br><span class="line"><span class="keyword">Error</span>: Unknown argument <span class="comment">'ojjljlkl'</span></span><br><span class="line"><span class="keyword">Error</span>: Missing <span class="keyword">option</span> --host-<span class="built_in">filter</span></span><br><span class="line"><span class="keyword">Error</span>: Missing <span class="keyword">option</span> --<span class="built_in">day</span></span><br></pre></td></tr></table></figure></p>
-->
</div>
<footer>
<div class="alignright">
<a href="/2014/12/scala-tips-01/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2014/11/hexo-github-pages-3/">利用GitHub搭建个人博客-美化(3)</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2014-11-27T12:39:00.000Z">2014-11-27</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2014/11/hexo-github-pages-3/#ds-thread"><span class="ds-thread-count" data-thread-key="2014/11/hexo-github-pages-3/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<p> 美化调整其实是重头戏,但是对于像我这种不懂Html/JS的人来说美化就只有两步:</p>
<ol>
<li>使用别人的主题</li>
<li>调参——对各项参数进行微调<br>这部分的内容比较繁杂,大多数都是在耗在反复对比上了。</li>
</ol>
<h2 id="应用主题">应用主题</h2><p> Hexo的爱好者们DIY了很多各式各样的<a href="https://github.com/hexojs/hexo/wiki/Themes" target="_blank" rel="external">主题</a>,并且还分享出来了。有很多非常的酷,可以尝试尝试。安装方法很简单,比如要安装<code>metro-light</code>,在博客的主目录下执行<br><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/halfer53/metro-light.git themes/metro-light</span><br></pre></td></tr></table></figure></p>
<p>然后在主目录中的<code>_config.yml</code>中设置theme为<code>metro-light</code>即可。别人共享的站点中都有如何安装的说明,以及有哪些配置项,记得扫一眼!</p>
<blockquote>
<p>注意:如果用GitHub来同步整个博客,记得把<strong>themes/metro-light</strong>下的<strong>.git</strong>文件夹删掉。</p>
</blockquote>
<h2 id="配置参数">配置参数</h2><p> 参数配置很简单,作者都留好了入口,挨个填就可以 了。首先要注意区分根目录和主题目录下的各有一个<code>_config.yml</code>文件,参数需要分别在两个地方进行配置。</p>
<ul>
<li>配置主目录_config.xml<br> 主目录的_config.xml的配置不能够马上反映的本地站点(<code>hexo s</code>)上,需要<code>hexo g</code>一次,主题内的配置文件修改完就可以在网页上刷新看到。下面列出了至少需要配置的内容,按自己的情况一一修改,不放心就本地预览。</li>
</ul>
<figure class="highlight"><table><tr><td class="code"><pre><span class="line"># Site title: Oh Captain, My Captain - Du00 subtitle: Qzone-天涯-163-百度-新浪,削足适履,不如亲手打造 description: Du00的博客 author: Du00 email: du00cs@gmail.com language: zh-CN # URL ## If your site is put in a subdirectory, set url as 'http://yoursite.com/child' and root as '/child/' url: http://du00cs.github.com # Extensions theme: metro-light # Deployment deploy: type: github repo: https://github.com/du00cs/du00cs.github.io.git</span><br></pre></td></tr></table></figure>
<ul>
<li>配置主题中的_config.yml<br> 每个主题中可配置的项各不一样,下面还是以<code>metro-light</code>为例说明一些配置,主要是评论系统和分享按钮。</li>
</ul>
<figure class="highlight"><table><tr><td class="code"><pre><span class="line">#duoshuo_short_name是需要去“多说”申请的,填错无效…… comment: duoshuo: true duoshuo_short_name: du00cs ## to enable disqus, you need to fill in the disqus_shortname in config.yml ## to enable duoshuo, you need duoshuo id and set duosuo to true #share plugins at the bottom of the article share: enable: true jiathis: true ## Jiathis是一个面向国内的分享插件,你不会想分享到google/twitter的…… twitter: false google: false bottom_link: github: du00cs ## 填写用户名即可 weibo: du00cs ## 填写微博数字ID或者用户名(不是昵称) renren: ##e.g. 333333333 for http://www.renren.com/333333333 #google analytics id, 这个可以用来对网站进行统计,同样需要申请 google_analytics: UA-56718947-1</span><br></pre></td></tr></table></figure>
<h2 id="点点滴滴,需要耐心">点点滴滴,需要耐心</h2><ul>
<li>hexo是一个台湾学生写的,不得不佩服</li>
<li>首行缩进:英文首行有没有缩进无所谓,中文不写就很难看了——输入两个全角空格即可(一般可用<code>shift+space</code>切换到全角输入)</li>
<li>添加公式支持:网上有加语句的,事实上加个插件就好了(尤其是对我这种小白)<figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">hexo install hexo-renderer-mathjax --save</span><br></pre></td></tr></table></figure>
</li>
</ul>
<p>并在_config.xml标明使用了该插件(<strong>注意空格</strong>)<br><figure class="highlight"><table><tr><td class="code"><pre><span class="line">plugins: - hexo-renderer-mathjax</span><br></pre></td></tr></table></figure></p>
<ul>
<li><p>Atom公式预览:安装markdown-preview-plus,注意Display的公式需要写成</p>
<figure class="highlight elixir"><table><tr><td class="code"><pre><span class="line"><span class="variable">$$</span></span><br><span class="line"> ax^<span class="number">2</span>+bx+c=<span class="number">0</span></span><br><span class="line"><span class="variable">$$</span></span><br></pre></td></tr></table></figure>
</li>
<li><p>调整markdown的样式:别人设置的样式可能有你不喜欢的,如果你看见“引用”部分居中了想修改,去<code>metro-light/source/css/_partial/article.styl</code>中修改即可。或者如果你有喜欢的样式,比如Mou中的GitHub2的表格是有颜色间隔的,这时可以找一个css转stylus的工具(npm install stylus),在生成的文件中把table部分代码贴过来替换掉即可。同理,如果文本是两边对齐的想替换成左对齐,可以先用浏览器的“审查元素”的功能,定位到相应的文本域,查看它的CSS就可以进行相应定位了。</p>
</li>
<li>文章预览只显示部分内容:原始模板中首页的预览把所有文章都显示了,如果主题没有只显示部分的功能,可以手工在文章中加上<code><!-- more --></code>,这一句之后的部分就不会在首页中显示了。</li>
</ul>
<p> 主题、插件这两个东西需要好好借助它们来为自己服务,相关的文章其实还是不少的,但是大同小异。如果不懂JS之类的,可以调整的范围也非常有限,但是试试总是没有坏处吧~最后多看看别人的主题,偷偷代码改一改,说不定会有惊喜的。最后,用Hexo搭博客就是为了好好写东西,内容,还是最重要的。</p>
</div>
<footer>
<div class="alignright">
<a href="/2014/11/hexo-github-pages-3/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2014/11/hexo-github-pages-2/">利用GitHub搭建个人博客-利其器(2)</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2014-11-22T05:31:00.000Z">2014-11-22</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2014/11/hexo-github-pages-2/#ds-thread"><span class="ds-thread-count" data-thread-key="2014/11/hexo-github-pages-2/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<p> 工欲善其事,必先利其器。话虽然是这么说,但是在接触新事务的时候往往是相反的——先发现事情比较有意义,做着做着发现是不是做得太傻了,然后再发现其实运用某些工具可以让事情开展得更舒服。经历过这一过程的人来讲东西怎么用一般还是会把故事倒过来讲的。<br> 搭建博客的工具的调整主要还是在工具的微调上,具体说来有几点:</p>
<ol>
<li>掌握hexo的基本命令,这个一看就懂,无非就是创建文章、生成页面、预览和发布;</li>
<li>(可选,推荐)安装Atom并配置Markdown Writer</li>
<li>(可选,推荐)用github来跟踪blog的源文件</li>
</ol>
<p> <em>Linux/Mac OSX中配置这些简直是太简单了,只有Windows才会有各种麻烦。</em></p>
<h2 id="1-_hexo相关命令">1. hexo相关命令</h2><p> 以下均是在命令行中进行的</p>
<h3 id="1-1_创建文章">1.1 创建文章</h3><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">hexo new [layout] title</span><br></pre></td></tr></table></figure>
<p> layout不写默认就是post,会在<code>source/_posts</code>中创建以title命名的文件title.md,整个写博客的过程就是编辑符合markdown规范的文本。这也是hexo/jeklly号称能够让用户更专注的内容生产的原因。<br> 即将发布的文章layout为post没有问题,但是也如果修改了一篇需要重新发布,而另外一篇才写了一半,这时就需要做些区分了。这时可以选layout为draft,文件就会被放到<code>source/_draft</code>中,在生成页面时会忽略这个目录,最后再文章完成之后再拖动文件到_posts中或者<br><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">hexo publish [layout] <filename></span><br></pre></td></tr></table></figure></p>
<p>就可以了。</p>
<!-- more >
### 1.2 生成页面
<figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">hexo clean <span class="comment">#如果觉得页面比较诡异,这个命令将清除生成的页面</span></span><br><span class="line">hexo generate <span class="comment">#生成页面,生成的文件位于.deploy下</span></span><br><span class="line">hexo g <span class="comment">#g是generate的缩写,事实上以后基本上不会去用上一条</span></span><br></pre></td></tr></table></figure>
<h3 id="1-3_本地预览">1.3 本地预览</h3><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">hexo server <span class="comment"># 启用本地预览,在localhost:4000可以访问到</span></span><br><span class="line">hexo s <span class="comment">#server的缩写</span></span><br></pre></td></tr></table></figure>
<h3 id="1-4_文章发布">1.4 文章发布</h3><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">hexo deploy <span class="comment">#部署站点,实际上是将生成的页面git push到GitHub的Repository中</span></span><br><span class="line">hexo d <span class="comment">#缩写</span></span><br></pre></td></tr></table></figure>
<p> 掌握这部分已经可以完完全全开始了,但是总有一些其它的工具让你更写起来更爽、更舒服。这些工具往往是先行者们给予我们的便利,不必客气。君子生非异也,善假于物也。</p>
<h2 id="2-_Atom编辑器与Markdown_Writer">2. Atom编辑器与Markdown Writer</h2><p> 首先选用一个好的支持Markdown编辑器还是非常有必要的,虽然说你完全可以用记事本/gedit来折磨自己,但是方便一点我想也是没有人反对的。在我眼中一个好的Markdown编辑器包括以下几个方面:</p>
<ol>
<li>文章支持实时/半实时预览(如果使用记事本,你就需要开启hexo s然后刷网页来看效果了);</li>
<li>支持语法至少与GitHub一致,比如```的代码块要支持指定语言,否则那高亮完全就是来糊弄人的;</li>
<li>支持Mathjax(如果不需要写公式则可以忽略)</li>
</ol>
<p> <a href="2014/11/22/利用GitHub搭建个人博客-工具准备/index.html">安装Atom</a>已经在前面介绍过了,是利用的Chocolatey。安装<a href="https://github.com/zhuochun/md-writer">Markdown Writer</a>插件是通过<code>ctrl+,</code>来调出<code>settings</code>,在<code>pacakges</code>中搜索<code>markdown writer</code>,然后选择安装。这时Atom会提示一些错误(Mac上没有错误,估计Linux上也没有),如果你按照它的提示去安装VS Studio/Python等等东西的话,这日子真就没法过了,就那个VS Studio Express就有6G多。还好,没那那么麻烦,既然提示node-gyp不存在,那就用npm装一个好了。<br><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">npm install node-gyp</span><br></pre></td></tr></table></figure></p>
<p>然后再点安装就可以搞定了。这个插件提供了很多markdown写作的快捷键,比如<code>ctrl+B</code>加粗、<code>ctrl+I</code>变斜体,插入图片则会给你弹一个窗口出来,确实是很方便。这里需要提一下的是Atom编辑器把所有的命令都放在了一个可以搜索的框内,直接输单词就可以调出某个功能,并不一定需要你去记某某某是什么什么快捷键。比如插入图片可以直接用<code>ctrl+shift+p</code>来调出一个窗口,然后输入<code>insert image</code>,回车或选择一个就可以调出插入图片的对话框。<br><img src="http://du00.qiniudn.com/2014/11/命令输入面板-insert_im.png" alt="命令输入面板-输入insert im(age)"><br> 上面是markdown方面的便捷功能,markdown writer还提供了与hexo/jeklly工作目录连接的功能,完成连接首先需要利用一个插件来生成它所需要的三个json文件:<br><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line"><span class="comment">#进入blog目录</span></span><br><span class="line">e:</span><br><span class="line"><span class="built_in">cd</span> e:/blog</span><br><span class="line">npm install --save hexo-generator-atom-markdown-writer-meta</span><br><span class="line">hexo g <span class="comment">#这时可以看到生成文件里面多了三个json文件了</span></span><br><span class="line">hexo d <span class="comment">#布署</span></span><br></pre></td></tr></table></figure></p>
<p> 配置与hexo的连接在<strong>Settings->Filter Packages->输入Markdown Writer</strong>在右侧的Settings中写入如下类似的东西即可:<br><img src="http://du00.qiniudn.com/2014/11/markdown_writer的hexo设置.png" alt="插入图片-Markdown Writer连接hexo工作目录"><br>这里实际上是在修改<code>~/.atom/packages\markdown-writer\lib\config.coffee</code>文件,可以打开看一看。<br> 配置完成之后创建文章就可以不通过hexo命令了,在万能的<code>ctrl+shift+p</code>中输入<code>new post</code>或者<code>new draft</code>就可以生成新的文章了。不过这个功能目前还是有些问题,文件名生成有些障碍,需要手工修改一下文件名。</p>
<h2 id="3-_GitHub管理站点生成代码/源文件">3. GitHub管理站点生成代码/源文件</h2><p> 简单地说就是把blog目录交给git管理,当然这里面要去掉一些没必要提交的。比如:.deploy目录,这里面的内容是可以通过<code>hexo g</code>生成的:node_modules目录,这里的东西是由<code>npm install</code>得到的。可以确认blog目录下是否有.gitignore文件并且内容包含以下部分:</p>
<blockquote>
<p>.DS_Store<br>Thumbs.db<br>db.json<br>debug.log<br>node_modules/<br>public/<br>.deploy/</p>
</blockquote>
<p>有几点需要注意:</p>
<ol>
<li>themes中如果包含了从git上拉下的主题,需要去主题目录下删掉.git文件夹,这样才能完成同步(如果不确定可以去GitHub的Repository中查看是否有主题);</li>
<li>从GitHub上拉下来的blog目录需要运行一次<code>npm install</code>安装需要的模块,否则是没法正常生成网页的;</li>
<li>记得推送……git push,否则神也帮不了你 </li>
</ol>
<p> 为什么不用云盘/同步盘来做这件事?因为需要同步的有效内容非常少(比如修改了两行,那么就只需要同步这两行的修改),而且同一个目录下还有很多文件不需要同步,尤其是.deploy文件夹下的那些东西,<code>hexo g</code>一次你就会看见云盘的同步图标就开始转了,这其实没有必要。当然了,用GitHub显得更程序员、更任性。</p>
<p>下一步将是各种微调,是对感观上的增强,实际上来说也是最重要的部分,因为展示才是最重要的,上面这些只是为了更爽。</p>
-->
</div>
<footer>
<div class="alignright">
<a href="/2014/11/hexo-github-pages-2/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2014/11/hexo-github-pages-1/">利用GitHub搭建个人博客-工具准备(1)</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2014-11-21T19:34:00.000Z">2014-11-22</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2014/11/hexo-github-pages-1/#ds-thread"><span class="ds-thread-count" data-thread-key="2014/11/hexo-github-pages-1/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<blockquote>
<p> 最近两个星期对<code>Github Pages</code>这个东西非常着迷,这东西竟然可以DIY成自己的博客! 两个星期的探索后终于摸索出一条比较好的安装/环境配置的方法,记录在此,就权当是备忘吧。</p>
</blockquote>
<h1 id="1-_什么是Github_Pages?">1. 什么是<strong>Github Pages</strong>?</h1><p> Github Pages是Github交给用户自己定义的主页,自定义的程度非常高(仅限于静态页面)。至于怎么被人挖掘出来做个人博客,这个我就不得而知了,我只是觉得用起来很爽……先看看下面的效果图,有没有觉得跳出了新浪、百度等等烦人的框架后清新了许多? <img src="http://du00.qiniudn.com/2014/11/博客首页截图.png" alt="博客首页截图"></p>
<h1 id="2-_博客搭建工具一览">2. 博客搭建工具一览</h1><p> 其实吧,完整搭起这个博客除了文章是我自己写的(需要掌握Markdown,相信我,学会了之后你会爱上它的),其它都是借鉴(抄袭)的别人的……这个网页中我实际写的其实是这些,而看到的是下面这个页面。确实做到了让用户更关注于内容的产生,而不是各种要注意的特别格式。<br><img src="http://du00.qiniudn.com/2014/11/markdown原始文件.png" alt="markdown原始文件"><br> 如果不是那些(姑且)称作“极客”的人开发了那么那么多的工具,你很难想象自己纯手工打造有多困难,尤其是针对像我这种对<code>html/js/css</code>一无所知的人。搭建工具包括:</p>
<ol>
<li>node.js</li>
<li>hexo</li>
<li>git/github</li>
</ol>
<!-- more >
### 2.1 安装chocolatey
[chocolatey](https://chocolatey.org)是面向Windows的类apt/yum/brew这样的包管理工具,有了这么个好东西Windows下配置编程环境就不会再那么复杂了。在命令行(在Power Shell中执行引号中的内容在我的机器上会出错)中贴上以下语句
<figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">@powershell -NoProfile -ExecutionPolicy unrestricted -Command <span class="string">"iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))"</span> && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin</span><br></pre></td></tr></table></figure>
<p>即可完成安装。如果跟我一样有不把东西安装在C盘的习惯,可以按照如下操作:<br> 默认Chocolatey会安装到<code>C:\ProgramData\</code>下,可以将这个文件夹移动到别的地方,然后在<strong>系统 -> 高级系统设置 -> 环境变量</strong>中修改<code>ChocolateyInstall</code>和<code>PATH</code>为相应的路径就可以了。</p>
<h3 id="2-2_安装git">2.2 安装git</h3><p> GitHub Pages的文章发布会使用到Git这个工具,对程序员来说Git肯定是相当熟悉的。<br><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">choco install git</span><br></pre></td></tr></table></figure></p>
<p> 坑爹,怎么把git安装到C盘去了?这个没办法,全自动的就是这么任性。但是这并不是说上面移动Chocolatey的位置就没有意义了,还是有其它更大的软件会装到那里的。</p>
<h3 id="2-3_安装node-js">2.3 安装node.js</h3><p> 搭建博客、生成页面的工具是用这个写的,各种插件及依赖也是node自动完成的。<br><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">choco install nodejs</span><br><span class="line">choco install npm</span><br></pre></td></tr></table></figure></p>
<h3 id="2-4_安装hexo">2.4 安装hexo</h3><p> <a href="http://hexo.io">hexo</a>是用来从markdown的纯文本自动生成网页的工具,比起同类产品<code>jeklly</code>来说,上手简单得不只一点半点(我在学jeklly的时候始终弄不明白,差点放弃GitHub Pages,幸运地是我找到了hexo)。<br><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">npm install -g hexo</span><br></pre></td></tr></table></figure></p>
<p> 这里需要注意的是直接在命令行中输入hexo是找不到命令的(工具还是不完善),需要手工去把hexo所在目录加入到PATH中。进入<strong>环境变量</strong>在<code>PATH</code>中加入类似于<code>E:\Programming\chocolatey\lib\nodejs.commandline.0.10.33\tools;</code>这样的路径。记得重新开一个命令行来保证环境变量生效。然后使用以下命令来完成一个hexo管理的博客的结构初始化:<br><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">mkdir blog</span><br><span class="line"><span class="built_in">cd</span> blog</span><br><span class="line">hexo init</span><br><span class="line"><span class="comment"># 这时hexo会提醒你可能需要npm install来完成初始化</span></span><br><span class="line">npm install</span><br><span class="line"><span class="comment"># 下面可选,完成后就可以从网页看到最原始的样子了</span></span><br><span class="line">hexo g <span class="comment"># 等同于hexo generate,生成静态页面</span></span><br><span class="line">hexo s <span class="comment"># 等同于hexo server,启动一个本地站点,可以在浏览器中输入 (localhost:4040) 来查看页面的样子,`ctrl + c`停掉。</span></span><br></pre></td></tr></table></figure></p>
<p><img src="http://du00.qiniudn.com/2014/11/hexo本地截图.png" alt="插图 - 初始网页"></p>
<h3 id="2-5(可选)安装Atom">2.5(可选)安装<a href="https://atom.io/">Atom</a></h3><p> Atom是支持markdown的编辑器,类似于SublimeText,但是插件功能更为强大。有国人为Atom写了<code>Atom Markdown Writer</code>,用来编辑、管理hexo/jeklly生成的静态站点还是非常方便的。<br><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line">choco install atom</span><br></pre></td></tr></table></figure></p>
<p> Atom的配置以后再讲。</p>
<p>需要安装的基本工具就到此结束了。</p>
<h1 id="2-_配置自己的GitHub_Pages">2. 配置自己的GitHub Pages</h1><p> 这一部分将会建立本地工具与远程站点之间的联系,并且能够把页面推送到远程站点。对于GitHub Pages来说则是将本地生成的静态站点推送到某个特定的Repository就可以完成站点的搭建/更新。</p>
<h3 id="2-1_创建GitHub_Pages">2.1 创建GitHub Pages </h3><p> 在GitHub上创建一个Repository,命名必须是类似于<code>du00cs.github.io</code>,创建之后在下一页选择<code>Settings</code>找到<code>Automatic page generator</code>,下一步下一步直接<code>Publish page</code>,然后按照提示等待10(多)分钟,再去访问<code>du00cs.github.io</code>就看到初始页面了。</p>
<h3 id="2-2_将hexo与GitHub远程库关联">2.2 将hexo与GitHub远程库关联</h3><ol>
<li>为了提交方便,需要把自己的<code>SSH-KEY</code>放到GitHub上,具体操作在<a href="https://github.com/settings/ssh">SSH Keys</a>的帮助文档中有详细介绍。</li>
<li>找到blog目录下的<code>_config.yml</code>,仿照类似修改<code>deploy</code>部分<figure class="highlight"><table><tr><td class="code"><pre><span class="line">deploy: type: github repo: https://github.com/du00cs/du00cs.github.io.git</span><br></pre></td></tr></table></figure>
</li>
</ol>
<h3 id="2-3_发布博客">2.3 发布博客</h3><figure class="highlight sh"><table><tr><td class="code"><pre><span class="line"><span class="comment"># hexo clean #如有必要可以清空</span></span><br><span class="line">hexo g <span class="comment">#生成站点内容</span></span><br><span class="line">hexo s <span class="comment">#你可能需要在本地预览</span></span><br><span class="line">hexo d <span class="comment">#发布,可以稍后在github上看到更新,一般没有什么延迟</span></span><br></pre></td></tr></table></figure>
<p> 这样就完成了所有工具的准备,剩下的就是需要一个内容的生产者和调整样式、主题的设计师。</p>
-->
</div>
<footer>
<div class="alignright">
<a href="/2014/11/hexo-github-pages-1/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2014/11/thrift-serialization-des/">Thrift序列化/反序列化方法对比</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2014-11-20T12:00:52.000Z">2014-11-20</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2014/11/thrift-serialization-des/#ds-thread"><span class="ds-thread-count" data-thread-key="2014/11/thrift-serialization-des/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<p>还记得最初到公司的时候thrift序列化还是用的JSON模式,现在想想效率还是太低了。先上结论部分</p>
<h1 id="结论">结论</h1><p>Thrift提供了(至少)三种序列化方法,Json、Binary和Compact,三者之间性能差距还是比较大的。Json方式的选取往往不是基于效率的选择,下面是两种二进制模式Binary和Compact之间的比较:</p>
<ol>
<li>序列化:compact模式压缩节省19.3%的空间,耗时节省20.4%</li>
<li>反序列化:compact模式耗时增加3.3%</li>
</ol>
<table>
<thead>
<tr>
<th></th>
<th>方法</th>
<th>长度</th>
<th>耗时</th>
</tr>
</thead>
<tbody>
<tr>
<td>序列化</td>
<td>binary</td>
<td>26039</td>
<td>16.433</td>
</tr>
<tr>
<td></td>
<td>compact</td>
<td>21020</td>
<td>13.085</td>
</tr>
<tr>
<td></td>
<td>json</td>
<td>25096</td>
<td>56.137</td>
</tr>
<tr>
<td>反序列化</td>
<td>binary</td>
<td></td>
<td>8.963</td>
</tr>
<tr>
<td></td>
<td>compact</td>
<td></td>
<td>9.257</td>
</tr>
<tr>
<td></td>
<td>json</td>
<td></td>
<td>61.511</td>
</tr>
</tbody>
</table>
<h1 id="测试用例">测试用例</h1><ol>
<li>thrift定义<figure class="highlight thrift"><table><tr><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">STestObject</span></span>{</span><br><span class="line"> <span class="number">1</span>: <span class="built_in">i64</span> userId; </span><br><span class="line"> <span class="number">2</span>: <span class="built_in">i64</span> timestamp; </span><br><span class="line"> <span class="number">3</span>: <span class="stl_container">list<<span class="keyword">string</span>></span> apps; </span><br><span class="line"> <span class="number">4</span>: <span class="stl_container">list<<span class="keyword">i32</span>></span> pos; </span><br><span class="line">}</span><br></pre></td></tr></table></figure>
</li>
</ol>
<!-- more >
填充数据的方法
<figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">apps</span> =</span> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">1000</span>) <span class="keyword">yield</span> <span class="string">"com.xiaomi.channel"</span></span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">pos</span> =</span> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">1000</span>) <span class="keyword">yield</span> <span class="keyword">new</span> <span class="type">Integer</span>(<span class="number">100</span>)</span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">o</span> =</span> <span class="keyword">new</span> <span class="type">STestObject</span>(<span class="number">123123</span>, d.getMillis(), apps.toList.asJava, os.toList.asJava)</span><br></pre></td></tr></table></figure>
<p>最后序列化,反序列化都是做100,000次</p>
<p>具体的代码是Scala的,但是还是(完全)可以说明问题的。<br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> com.xiaomi.data.o2o.model.<span class="type">STestObject</span></span><br><span class="line"><span class="keyword">import</span> scala.collection.<span class="type">JavaConverters</span>._</span><br><span class="line"><span class="keyword">import</span> scala.collection.<span class="type">JavaConversions</span>._</span><br><span class="line"><span class="keyword">import</span> org.apache.thrift.<span class="type">TSerializer</span></span><br><span class="line"><span class="keyword">import</span> org.apache.thrift.protocol.<span class="type">TBinaryProtocol</span></span><br><span class="line"><span class="keyword">import</span> org.apache.thrift.protocol.<span class="type">TCompactProtocol</span></span><br><span class="line"><span class="keyword">import</span> org.apache.thrift.<span class="type">TDeserializer</span></span><br><span class="line"><span class="keyword">import</span> java.util.<span class="type">Date</span></span><br><span class="line"><span class="keyword">import</span> org.apache.thrift.protocol.<span class="type">TJSONProtocol</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/**</span><br><span class="line"> * @author du00</span><br><span class="line"> *</span><br><span class="line"> */</span></span><br><span class="line"><span class="class"><span class="keyword">object</span> <span class="title">ThriftSerializationTest</span> {</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">main</span>(</span>args: <span class="type">Array</span>[<span class="type">String</span>]): <span class="type">Unit</span> = {</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">d</span> =</span> <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">apps</span> =</span> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">1000</span>) <span class="keyword">yield</span> <span class="string">"com.xiaomi.channel"</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">pos</span> =</span> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">1000</span>) <span class="keyword">yield</span> <span class="keyword">new</span> <span class="type">Integer</span>(<span class="number">100</span>)</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">o</span> =</span> <span class="keyword">new</span> <span class="type">STestObject</span>(<span class="number">123123</span>, d.getTime(), apps.toList.asJava, pos.toList.asJava)</span><br><span class="line"></span><br><span class="line"> <span class="comment">//binary</span></span><br><span class="line"> <span class="keyword">var</span> se = <span class="keyword">new</span> <span class="type">TSerializer</span>(<span class="keyword">new</span> <span class="type">TBinaryProtocol</span>.<span class="type">Factory</span>())</span><br><span class="line"> <span class="keyword">var</span> start = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">100000</span>) se.serialize(o)</span><br><span class="line"> <span class="keyword">var</span> stop = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">binary</span> =</span> se.serialize(o)</span><br><span class="line"> println(<span class="string">"binary"</span>, binary.length, (stop.getTime() - start.getTime()) / <span class="number">1000.0</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment">//compact</span></span><br><span class="line"> se = <span class="keyword">new</span> <span class="type">TSerializer</span>(<span class="keyword">new</span> <span class="type">TCompactProtocol</span>.<span class="type">Factory</span>())</span><br><span class="line"> start = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">100000</span>) se.serialize(o)</span><br><span class="line"> stop = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">compact</span> =</span> se.serialize(o)</span><br><span class="line"> println(<span class="string">"compact"</span>, compact.length, (stop.getTime() - start.getTime()) / <span class="number">1000.0</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment">//json</span></span><br><span class="line"> se = <span class="keyword">new</span> <span class="type">TSerializer</span>(<span class="keyword">new</span> <span class="type">TJSONProtocol</span>.<span class="type">Factory</span>())</span><br><span class="line"> start = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">100000</span>) se.serialize(o)</span><br><span class="line"> stop = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">json</span> =</span> se.serialize(o)</span><br><span class="line"> println(<span class="string">"json"</span>, json.length, (stop.getTime() - start.getTime()) / <span class="number">1000.0</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment">//binary</span></span><br><span class="line"> <span class="keyword">var</span> de = <span class="keyword">new</span> <span class="type">TDeserializer</span>(<span class="keyword">new</span> <span class="type">TBinaryProtocol</span>.<span class="type">Factory</span>())</span><br><span class="line"> start = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">100000</span>) de.deserialize(o, binary)</span><br><span class="line"> stop = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> println(<span class="string">"binary"</span>, (stop.getTime() - start.getTime()) / <span class="number">1000.0</span>, o)</span><br><span class="line"></span><br><span class="line"> <span class="comment">//compact</span></span><br><span class="line"> de = <span class="keyword">new</span> <span class="type">TDeserializer</span>(<span class="keyword">new</span> <span class="type">TCompactProtocol</span>.<span class="type">Factory</span>())</span><br><span class="line"> start = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">100000</span>) de.deserialize(o, compact)</span><br><span class="line"> stop = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> println(<span class="string">"compact"</span>, (stop.getTime() - start.getTime()) / <span class="number">1000.0</span>, o)</span><br><span class="line"></span><br><span class="line"> <span class="comment">//json</span></span><br><span class="line"> de = <span class="keyword">new</span> <span class="type">TDeserializer</span>(<span class="keyword">new</span> <span class="type">TJSONProtocol</span>.<span class="type">Factory</span>())</span><br><span class="line"> start = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> <span class="keyword">for</span> (i <- <span class="number">1</span> to <span class="number">100000</span>) de.deserialize(o, json)</span><br><span class="line"> stop = <span class="keyword">new</span> <span class="type">Date</span></span><br><span class="line"> println(<span class="string">"json"</span>, (stop.getTime() - start.getTime()) / <span class="number">1000.0</span>, o)</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure></p>
-->
</div>
<footer>
<div class="alignright">
<a href="/2014/11/thrift-serialization-des/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2014/11/coursera-func-scala/">Functional Programming Principles in Scala - Week 1</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2014-11-11T09:26:23.000Z">2014-11-11</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2014/11/coursera-func-scala/#ds-thread"><span class="ds-thread-count" data-thread-key="2014/11/coursera-func-scala/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<p><em>老文章了,正好是用Markdown写的,直接贴过来了</em></p>
<blockquote>
<p>本来以为这只是一门编程语言的课,还是比较犹豫是否要学习。不想学习新语言也是缘于在本科的时候学“计算机考古学语言”COBOL之后留下的后遗症。上完第一周的课程再加上完成作业才发现我的想法完全错了——老师不仅仅是在教一门语言,还在推销一种编程思想——函数式编程。</p>
</blockquote>
<p> 自从全球刮起了MOOC风,学习(尤其是入门)就变得简单多了,很难想象分文不花就可以跟着优秀的老师学习前沿课程。还记得以前下的Machine Learning的视频,没有字幕就算了(还能勉强听懂),那昏暗的视频、需要偷偷去找的讲义以及令人捉急的黑板板书……不得不承认这是个好时代,只要愿意学习,各种机会总是能够提供给你。我也在论坛见过了不少跟我一样的学生,在工作之余坚持上一两门课,有时候作业还真不是那么简单。为自己的努力加油吧~学了有什么用?也许没什么用,但是至少可以保持学习心态。</p>
<!-- more >>
## 函数式编程与命令式编程
在这门课中我们要学习的是函数式编程(Functional Programming),而与之对应的,也是我们更熟悉的命令式编程(Imperative Programming)。虽然完成的工作是一样的,但是背后的理念却大有不同。**命令式编程** 比如C/Java等其实都是反映的机器执行过程的语句流,很容易与机器指令对应上。函数式编程则对问题进行了一些抽象:In a restricted sense, functional programming (FP) means programming without **mutable variables(值可变的变量)**, **assignments(赋值)**, **loops(循环)**, and other imperative control structures. In a wider sense, functional programming means focusing on the functions.
在这里直接引用一段[博文](http://coolshell.cn/articles/10822.html)更能阐释什么是函数式编程:
* 函数式编程的三大特性:
* immutable data 不可变数据:像Clojure一样,默认上变量是不可变的,如果你要改变变量,你需要把变量copy出去修改。这样一来,可以让你的程序少很多Bug。因为,程序中的状态不好维护,在并发的时候更不好维护。(你可以试想一下如果你的程序有个复杂的状态,当以后别人改你代码的时候,是很容易出bug的,在并行中这样的问题就更多了)
* first class functions:这个技术可以让你的函数就像变量一样来使用。也就是说,你的函数可以像变量一样被创建,修改,并当成变量一样传递,返回或是在函数中嵌套函数。这个有点像Javascript的Prototype(参看Javascript的面向对象编程)
* 尾递归优化:我们知道递归的害处,那就是如果递归很深的话,stack受不了,并会导致性能大幅度下降。所以,我们使用尾递归优化技术——每次递归时都会重用stack,这样一来能够提升性能,当然,这需要语言或编译器的支持。Python就不支持。
* 函数式编程的几个技术
* map & reduce :这个技术不用多说了,函数式编程最常见的技术就是对一个集合做Map和Reduce操作。这比起过程式的语言来说,在代码上要更容易阅读。(传统过程式的语言需要使用for/while循环,然后在各种变量中把数据倒过来倒过去的)这个很像C++中的STL中的foreach,find_if,count_if之流的函数的玩法。
* pipeline:这个技术的意思是,把函数实例成一个一个的action,然后,把一组action放到一个数组或是列表中,然后把数据传给这个action list,数据就像一个pipeline一样顺序地被各个函数所操作,最终得到我们想要的结果。
* recursing 递归 :递归最大的好处就简化代码,他可以把一个复杂的问题用很简单的代码描述出来。注意:递归的精髓是描述问题,而这正是函数式编程的精髓。
* currying:把一个函数的多个参数分解成多个函数, 然后把函数多层封装起来,每层函数都返回一个函数去接收下一个参数这样,可以简化函数的多个参数。在C++中,这个很像STL中的bind_1st或是bind2nd。
* higher order function 高阶函数:所谓高阶函数就是函数当参数,把传入的函数做一个封装,然后返回这个封装函数。现象上就是函数传进传出,就像面向对象对象满天飞一样。
* 还有函数式的一些好处
* parallelization 并行:所谓并行的意思就是在并行环境下,各个线程之间不需要同步或互斥。
* lazy evaluation 惰性求值:这个需要编译器的支持。表达式不在它被绑定到变量之后就立即求值,而是在该值被取用的时候求值,也就是说,语句如x:=expression; (把一个表达式的结果赋值给一个变量)明显的调用这个表达式被计算并把结果放置到 x 中,但是先不管实际在 x 中的是什么,直到通过后面的表达式中到 x 的引用而有了对它的值的需求的时候,而后面表达式自身的求值也可以被延迟,最终为了生成让外界看到的某个符号而计算这个快速增长的依赖树。
* determinism 确定性:所谓确定性的意思就是像数学那样 f(x) = y ,这个函数无论在什么场景下,都会得到同样的结果,这个我们称之为函数的确定性。而不是像程序中的很多函数那样,同一个参数,却会在不同的场景下计算出不同的结果。所谓不同的场景的意思就是我们的函数会根据一些运行中的状态信息的不同而发生变化。
个人觉得这里有一点需要注意的是:函数式编程只是一种思想,其实用任意一种语言都可以按照这一模式来做。就比如虽然Python没有被列为函数式编程语言,但是它的函数也是first class object,只要遵循函数式编程的模式,Python也是同样可以胜任的。
## Scala的解释器
从scala官网上下载下来的包在首次运行sbt的时候会拉取大量依赖,是重新编译?这是让我很困惑的,不都是拉的Java包吗,怎么还需要重新编译?在漫长的初始化完成之后,可以输入sbt console来进入类python的解释环境。
## Expression & Evaluation
表达式和表达式的计算,不需要专门去学,直接看以下例子就明白了。
<figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">size</span> =</span> <span class="number">2</span> <span class="comment">//定义常量</span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">power</span>(</span>x: <span class="type">Double</span>, y: <span class="type">Int</span>): <span class="type">Double</span> = ... <span class="comment">//定义函数</span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">abs</span>(</span>x: <span class="type">Int</span>) = <span class="keyword">if</span> (x >= <span class="number">0</span>) x <span class="keyword">else</span> -x <span class="comment">//if-else</span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">val</span> <span class="title">x</span> =</span> <span class="number">2</span> <span class="comment">//变量定义</span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">sqrt</span>(</span>x: <span class="type">Double</span>) = { <span class="comment">//语句是可以{}表示语句块的,参数x对块中的函数来说是可见的</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">sqrtIter</span>(</span>guess: <span class="type">Double</span>, x: <span class="type">Double</span>): <span class="type">Double</span> =</span><br><span class="line"> <span class="keyword">if</span> (isGoodEnough(guess, x)) guess</span><br><span class="line"> <span class="keyword">else</span> sqrtIter(improve(guess, x), x)</span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">improve</span>(</span>guess: <span class="type">Double</span>, x: <span class="type">Double</span>) =</span><br><span class="line"> (guess + x / guess) / <span class="number">2</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">isGoodEnough</span>(</span>guess: <span class="type">Double</span>, x: <span class="type">Double</span>) =</span><br><span class="line"> abs(square(guess) - x) < <span class="number">0.001</span></span><br><span class="line"> sqrtIter(<span class="number">1.0</span>, x)</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure>
<h2 id="小结">小结</h2><p> 本周的课程相对简单,但是用函数编程的模式来写程序的时候还是非常不习惯。因为没有变量、没有循环,所以设计计算模式的时候还需要想一想,特别是作业中几乎都要用递归来做(没有循环)。<br>另外,原来用Markdown来写博客是这么舒服,以后会坚持学习使用的。</p>
<h2 id="后记">后记</h2><p> 过了一年(还是半年)再来看发现内容还是有问题的,不是任意一种语言都能够当函数式用,比如java。因为Java中函数不是顶层类,不能作为参数传递。反过来再看Python,确实是可以的,并且Python中有map/filter/reduce三个操作本身就是关键字(函数)。</p>
-->
</div>
<footer>
<div class="alignright">
<a href="/2014/11/coursera-func-scala/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2014/11/spark-scala-introduction/">Spark/Scala极速入门材料</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2014-11-11T07:09:26.000Z">2014-11-11</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2014/11/spark-scala-introduction/#ds-thread"><span class="ds-thread-count" data-thread-key="2014/11/spark-scala-introduction/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<p>Spark/Scala的一点入门材料,希望能对想快速了解的人有所帮助,对自己则是一个备忘。</p>
<h2 id="Scala极速入门">Scala极速入门</h2><p>如果需要一些感性认识,可以先读一读Scala的<a href="http://www.scala-lang.org/what-is-scala.html" target="_blank" rel="external">官方介绍</a>。简单地说,Scala是一种更偏函数式的函数式、命令式的混合编程语言,同时也是面向对象的,函数是顶级对象,运行在Java虚拟机上,能与Java无缝结合。</p>
<figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="keyword">package</span> com.xiaomi.data.ctr.feature.analysis</span><br><span class="line"></span><br><span class="line"><span class="comment">/**</span><br><span class="line"> * Scala极速入门材料,可以直接贴入ScalaIDE的worksheet</span><br><span class="line"> */</span></span><br><span class="line"><span class="class"><span class="keyword">object</span> <span class="title">test</span> {</span></span><br><span class="line"> println(<span class="string">"Welcome to the Scala worksheet"</span>) <span class="comment">//> Welcome to the Scala worksheet</span></span><br><span class="line"> <span class="comment">//val(ue) 是引用不变,不能改变val值变量的『值』</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">n</span> =</span> <span class="number">8</span> <span class="comment">//> n : Int = 8</span></span><br><span class="line"> <span class="comment">//n += 1 : value += is not a member of Int</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">//var(riable) 是变量,能用val用val,技穷用var</span></span><br><span class="line"> <span class="keyword">var</span> nn = <span class="number">7</span> <span class="comment">//> nn : Int = 7</span></span><br><span class="line"> nn += <span class="number">1</span></span><br><span class="line"> nn <span class="comment">//> res0: Int = 8</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">//tuple - 使用得非常重的数据结构,同python,但是不能(显式地)按下标取到每个元素</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">t</span> =</span> (<span class="number">1</span>, <span class="string">"a"</span>, <span class="type">None</span>) <span class="comment">//> t : (Int, String, None.type) = (1,a,None)</span></span><br><span class="line"> t._1 <span class="comment">//> res1: Int = 1</span></span><br><span class="line"> t._2 <span class="comment">//> res2: String = a</span></span><br><span class="line"> t._3 <span class="comment">//> res3: None.type = None</span></span><br><span class="line"> <span class="comment">//纯语法,相信看过就不会忘记的</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> (</span>no, name, score) = t <span class="comment">//> no : Int = 1</span></span><br><span class="line"> <span class="comment">//| name : String = a</span></span><br><span class="line"> <span class="comment">//| score : None.type = None</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">//collections, 取下标用()</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">list</span> =</span> <span class="type">List</span>(<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>) <span class="comment">//> list : List[Int] = List(1, 2, 3)</span></span><br><span class="line"> list(<span class="number">0</span>) <span class="comment">//> res4: Int = 1</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">m</span> =</span> <span class="type">Map</span>(</span><br><span class="line"> <span class="string">"a"</span> -> <span class="number">1</span>,</span><br><span class="line"> <span class="string">"c"</span> -> <span class="number">2</span>,</span><br><span class="line"> <span class="string">"b"</span> -> <span class="number">3</span>,</span><br><span class="line"> <span class="string">"d"</span> -> <span class="number">4</span>) <span class="comment">//> m : scala.collection.immutable.Map[String,Int] = Map(a -> 1, c -> 2, b -> 3</span></span><br><span class="line"> <span class="comment">//| , d -> 4)</span></span><br><span class="line"> <span class="comment">//取前两个</span></span><br><span class="line"> m.take(<span class="number">2</span>) <span class="comment">//> res5: scala.collection.immutable.Map[String,Int] = Map(a -> 1, c -> 2)</span></span><br><span class="line"> m(<span class="string">"a"</span>) <span class="comment">//> res6: Int = 1</span></span><br><span class="line"> <span class="comment">//删除一个键,得到一个新的map</span></span><br><span class="line"> m - <span class="string">"a"</span> <span class="comment">//> res7: scala.collection.immutable.Map[String,Int] = Map(c -> 2, b -> 3, d -></span></span><br><span class="line"> <span class="comment">//| 4)</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">//函数:变量名在前、类型在后,函数头到函数体之间有“=”号</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">isPalindrome</span>(</span>str: <span class="type">String</span>) = (str == str.reverse.toString())</span><br><span class="line"> <span class="comment">//> isPalindrome: (str: String)Boolean</span></span><br><span class="line"> <span class="comment">//函数:可以显式指定返回值类型,函数的返回值就是最后一行的值</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">isPalindromeDetail</span>(</span>str: <span class="type">String</span>): <span class="type">Boolean</span> = {</span><br><span class="line"> println(str)</span><br><span class="line"> str == str.reverse.toString</span><br><span class="line"> } <span class="comment">//> isPalindromeDetail: (str: String)Boolean</span></span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">isPalindromeDetailUn</span>(</span>str: <span class="type">String</span>): <span class="type">Boolean</span> = ???</span><br><span class="line"> <span class="comment">//> isPalindromeDetailUn: (str: String)Boolean</span></span><br><span class="line"> <span class="comment">//函数式编程:map/filter/reduce,其它的基本都是变种</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">ab</span> =</span> <span class="type">List</span>(<span class="number">1</span>, <span class="number">2</span>, <span class="number">3</span>, <span class="number">4</span>, <span class="number">5</span>, <span class="number">6</span>) <span class="comment">//> ab : List[Int] = List(1, 2, 3, 4, 5, 6)</span></span><br><span class="line"> <span class="comment">//每个元素乘3</span></span><br><span class="line"> ab.map(_ * <span class="number">3</span>) <span class="comment">//> res8: List[Int] = List(3, 6, 9, 12, 15, 18)</span></span><br><span class="line"> <span class="comment">//取出偶数</span></span><br><span class="line"> ab.filter(_ % <span class="number">2</span> == <span class="number">0</span>) <span class="comment">//> res9: List[Int] = List(2, 4, 6)</span></span><br><span class="line"> <span class="comment">//reduce实现sum</span></span><br><span class="line"> ab.reduce(_ + _) <span class="comment">//> res10: Int = 21</span></span><br><span class="line"> <span class="comment">//reduce实现max</span></span><br><span class="line"> ab.reduce((x, y) => <span class="keyword">if</span> (x > y) x <span class="keyword">else</span> y) <span class="comment">//> res11: Int = 6</span></span><br><span class="line">}</span><br></pre></td></tr></table></figure>
<!-- more >
一种是“对集合中的每个东西,东西在哪儿,取出来,执行某个操作”,另一种是“对一个集合中的每一个元素执行操作”。
具体的函数式编程与命令式编程语言的区别网上铺天盖地的,推荐一篇精简(是否得当就不评论了)的介绍(我自己参考写的……)《[课程学习Couresera - Functional Programming Principles in Scala - Week 1](/2014/11/11/课程学习Couresera-Functional-Programming-Principles-in-Scala-Week-1)》
## Scala相关资料
- 以后最常用的是会是这个:[Scala Standard Library API Docs](http://www.scala-lang.org/api/current/#package),也许你会觉得看完也不知道毡哪个,但是你确实得依赖它。
- 书籍:多看上面是有本书的——《[Scala程序设计:Java虚拟机多核编程实战](http://www.duokan.com/book/68639)》,其它的还有很多,如《**Scala for the Impatient**》,《**Programming in Scala: A comprehensive Step-by-step Guide**》
- [Coursera](https://class.coursera.org)上有一门用Scala讲的函数式编程语言的课——*Functional Programming Principles in Scala*,需要注意的是可能从头学到尾都不知道还有`var`这个东西,因为这门课真的只讲函数式编程。另外,请不要惊讶做作业需要花很长时间。
如果你是跟我一样的懒人,还是去Coursera上面上一课吧,系统地学一学对整体把握有好处。
- 方方同学补充: [typesafe activetor](http://www.typesafe.com/)上有不少代码模板,Twitter内部大量使用Scala并且开办了[Scala School](https://twitter.github.io/scala_school/index.html)
## Spark入门

Spark简单说来就只是三步:Create (RDD),Transform (RDD)和Action(非RDD)
1. Create
常见的创建方式有三种(均是从一个SparkContext实例开始):
- textFile
- sequenceFile
- parallize,从一个Scala Collection开始
textFile/sequenceFile两个方法已经可以解决来自HDFS的所有类型的记录文件,parallize用于测试,读hbase等其它的,略麻烦,不是一条语句能搞定了。
2. Transform
- 基本的:map/filter
- 处理key-value:groupByKey, reduceByKey, combineByKey等等
这部分必须熟练掌握,PairRDDFunctions在`import org.apache.spark.SparkContext._`过后就可以自动的给类型是`RDD[(K, V)]`的rdd加上PairRDDFunctions里面的所有方法了。注意:Transform的输入是RDD,输出仍然是RDD。
3. Action
Action会完成RDD向基本数据类型的转换,结果不再是RDD,一般来说就是收集结果到driver结点或者直接写HDFS了。作为Hadoop的用户会一定要会使用`saveAsTextFile`和`saveAsSequeceFile`,收集结果用collect()或者reduce()到driver结点完成或其它操作。
## Spark工程
建一个maven工程,pom里面写上
<figure class="highlight xml"><table><tr><td class="code"><pre><span class="line"><span class="tag"><<span class="title">dependencies</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>org.apache.hadoop<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>hadoop-client<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>2.4.0-mdh2.0.5<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">type</span>></span>jar<span class="tag"></<span class="title">type</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">scope</span>></span>compile<span class="tag"></<span class="title">scope</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">exclusions</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">exclusion</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>asm<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>asm<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">exclusion</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">exclusion</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>org.jboss.netty<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>netty<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">exclusion</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">exclusion</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>servlet-api<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>javax.servlet<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">exclusion</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">exclusions</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>org.apache.spark<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>spark-core_2.10<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>1.1.0<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>junit<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>junit<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>4.8.1<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">scope</span>></span>test<span class="tag"></<span class="title">scope</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>org.scalatest<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>scalatest_2.10<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>2.2.1<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">scope</span>></span>test<span class="tag"></<span class="title">scope</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">dependency</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>joda-time<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>joda-time<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>2.4<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">dependency</span>></span></span><br><span class="line"><span class="tag"></<span class="title">dependencies</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="title">build</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">plugins</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">plugin</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>org.apache.maven.plugins<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>maven-shade-plugin<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>2.3<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">configuration</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactSet</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">includes</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">include</span>></span><span class="tag"></<span class="title">include</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">includes</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">artifactSet</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">configuration</span>></span></span><br><span class="line"></span><br><span class="line"> <span class="tag"><<span class="title">executions</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">execution</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">phase</span>></span>package<span class="tag"></<span class="title">phase</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">goals</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">goal</span>></span>shade<span class="tag"></<span class="title">goal</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">goals</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">execution</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">executions</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">plugin</span>></span></span><br><span class="line"></span><br><span class="line"> <span class="comment"><!-- maven的scala支持插件,适当的时候可以去用一用新的版本 --></span></span><br><span class="line"> <span class="tag"><<span class="title">plugin</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">groupId</span>></span>net.alchim31.maven<span class="tag"></<span class="title">groupId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">artifactId</span>></span>scala-maven-plugin<span class="tag"></<span class="title">artifactId</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">version</span>></span>3.1.3<span class="tag"></<span class="title">version</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">executions</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">execution</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">goals</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">goal</span>></span>compile<span class="tag"></<span class="title">goal</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">goal</span>></span>testCompile<span class="tag"></<span class="title">goal</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">goals</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">execution</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">executions</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">plugin</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">plugins</span>></span></span><br><span class="line"><span class="tag"></<span class="title">build</span>></span></span><br></pre></td></tr></table></figure>
<p>下面是一个代码实例,完成的工作是将格式为<br><code>label pos1:value1 pos2:value2 ... post:valuen#imei,appid</code><br>的样本文件中去掉<code>0, 37, 38, 39, 51</code>五列的值。<br><figure class="highlight scala"><table><tr><td class="code"><pre><span class="line"><span class="keyword">package</span> demo</span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> org.apache.spark.<span class="type">SparkConf</span></span><br><span class="line"><span class="keyword">import</span> org.apache.spark.<span class="type">SparkContext</span></span><br><span class="line"><span class="keyword">import</span> org.apache.spark.<span class="type">SparkContext</span>._</span><br><span class="line"><span class="keyword">import</span> com.xiaomi.data.ctr.feature.analysis.<span class="type">AppstoreRecordParser</span>.getSampleRdd</span><br><span class="line"><span class="keyword">import</span> org.apache.hadoop.io.compress.<span class="type">GzipCodec</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/**</span><br><span class="line"> * 从base中去除 0, 37, 38, 39, 51五维特征</span><br><span class="line"> */</span></span><br><span class="line"><span class="class"><span class="keyword">object</span> <span class="title">SampleReduction</span> {</span></span><br><span class="line"> <span class="comment">/**</span><br><span class="line"> * 定义一个用于解析出来的记录的载体类,为什么这样就可以了?语法</span><br><span class="line"> */</span></span><br><span class="line"> <span class="class"><span class="keyword">class</span> <span class="title">AppStoreRecord</span>(</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">imei</span>:</span> <span class="type">String</span>,</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">appid</span>:</span> <span class="type">Int</span>,</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">label</span>:</span> <span class="type">Int</span>,</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">pv</span>:</span> <span class="type">Map</span>[<span class="type">Int</span>, <span class="type">Double</span>])</span><br><span class="line"></span><br><span class="line"> <span class="comment">/**</span><br><span class="line"> * 解析一条记录的函数</span><br><span class="line"> */</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">parseRecord</span>(</span>line: <span class="type">String</span>): <span class="type">AppStoreRecord</span> = {</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">cols</span> =</span> line.split(<span class="string">"\t"</span>)</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">label</span> =</span> cols(<span class="number">0</span>).toInt</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">splits</span> =</span> cols(<span class="number">1</span>).split(<span class="string">"#"</span>)</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">Array</span>(</span>imei, appId) = splits(<span class="number">1</span>).split(<span class="string">","</span>)</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">pvPart</span> =</span> splits(<span class="number">0</span>).split(<span class="string">" "</span>)</span><br><span class="line"> splits(<span class="number">1</span>).split(<span class="string">","</span>)(<span class="number">0</span>)</span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">values</span> =</span> pvPart.map(cell => { <span class="function"><span class="keyword">val</span> <span class="title">t</span> =</span> cell.split(<span class="string">":"</span>); (t(<span class="number">0</span>).toInt, t(<span class="number">1</span>).toDouble) })</span><br><span class="line"> <span class="keyword">new</span> <span class="type">AppStoreRecord</span>(imei, appId.toInt, label, values.toMap)</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">/**</span><br><span class="line"> * 需要有这个函数来成为入口,同java的main函数,格式固定</span><br><span class="line"> */</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">main</span>(</span>args: <span class="type">Array</span>[<span class="type">String</span>]): <span class="type">Unit</span> = {</span><br><span class="line"> <span class="comment">//固定,不可少</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">sparkConf</span> =</span> <span class="keyword">new</span> <span class="type">SparkConf</span>().setAppName(<span class="string">"Sample Reduction"</span>)</span><br><span class="line"> <span class="comment">//外部执行spark程序时,master会指定,这里手工指定方便在IDE中跑起来</span></span><br><span class="line"> <span class="keyword">if</span> (!sparkConf.contains(<span class="string">"spark.master"</span>)) sparkConf.setMaster(<span class="string">"local[4]"</span>)</span><br><span class="line"> println(<span class="string">"master: "</span> + sparkConf.get(<span class="string">"spark.master"</span>))</span><br><span class="line"> <span class="comment">//固定,必有这么一句</span></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">sc</span> =</span> <span class="keyword">new</span> <span class="type">SparkContext</span>(sparkConf)</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">val</span> <span class="title">zeroFeatures</span> =</span> <span class="type">Array</span>(<span class="number">0</span>, <span class="number">37</span>, <span class="number">38</span>, <span class="number">39</span>, <span class="number">51</span>)</span><br><span class="line"></span><br><span class="line"> sc</span><br><span class="line"> <span class="comment">//create RDD</span></span><br><span class="line"> .textFile(args(<span class="number">0</span>))</span><br><span class="line"> <span class="comment">//transform 1</span></span><br><span class="line"> .map(parseRecord)</span><br><span class="line"> <span class="comment">//transform 2</span></span><br><span class="line"> <span class="comment">//保存为label[tab]pos:value[space]...#imei,appid</span></span><br><span class="line"> .map(t => {</span><br><span class="line"> t.label + <span class="string">"\t"</span> + (t.pv -- zeroFeatures).map { <span class="keyword">case</span> (p, v) => p + <span class="string">":"</span> + v }.mkString(<span class="string">" "</span>) + <span class="string">"#"</span> + t.imei + <span class="string">","</span> + t.appid</span><br><span class="line"> })</span><br><span class="line"> <span class="comment">//action</span></span><br><span class="line"> .saveAsTextFile(args(<span class="number">1</span>), classOf[<span class="type">GzipCodec</span>])</span><br><span class="line"></span><br><span class="line"> <span class="comment">// spark自带的例子里面都有这么一句</span></span><br><span class="line"> sc.stop</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure></p>
<p>IDE之类的操作训不在讲解范围之内了,打包好之后用以下语句执行即可(主jar包后而一定只有程序参数,或其它参数都向前站):<br><figure class="highlight bash"><table><tr><td class="code"><pre><span class="line">spark-submit \</span><br><span class="line"> --master yarn-client \</span><br><span class="line"> --num-executors <span class="number">3</span> \</span><br><span class="line"> --executor-memory <span class="number">2</span>G \</span><br><span class="line"> --queue user_profile_default \</span><br><span class="line"> --class com.your.mainclass \</span><br><span class="line"> target/feature-analysis-<span class="number">0.0</span>.<span class="number">1</span>-SNAPSHOT.jar \</span><br><span class="line"> 各种参数</span><br></pre></td></tr></table></figure></p>
<p>可以在工程下而建一个脚本来省去一些公共参数的填写<br><figure class="highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="shebang">#!/bin/bash</span><br><span class="line"></span></span><br><span class="line"><span class="comment"># 用于提交任务至yarn的脚本,参数:</span></span><br><span class="line"><span class="comment"># class 参数1...</span></span><br><span class="line"></span><br><span class="line">class=<span class="variable">$1</span></span><br><span class="line"><span class="built_in">shift</span></span><br><span class="line"></span><br><span class="line">spark-submit \</span><br><span class="line"> --master yarn-client \</span><br><span class="line"> --num-executors <span class="number">3</span> \</span><br><span class="line"> --executor-memory <span class="number">2</span>G \</span><br><span class="line"> --queue user_profile_default \</span><br><span class="line"> --class <span class="variable">$class</span> \</span><br><span class="line"> target/feature-analysis-<span class="number">0.0</span>.<span class="number">1</span>-SNAPSHOT.jar \</span><br><span class="line"> <span class="string">"<span class="variable">$@</span>"</span></span><br></pre></td></tr></table></figure></p>
<h2 id="Spark相关资料">Spark相关资料</h2><ul>
<li>Spark官网材料,简易的<a href="https://spark.apache.org/docs/latest/programming-guide.html">Programming Guide</a>,以及以后你经常会访问的<a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package">API Docs</a>,将来会看很多很多遍,但是至少需要基本掌握<a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext">SparkContext</a>、<a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD">RDD</a>和<a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions">PairRDDFunctions</a>这三个类,你会发现处理日志基本就够用了~</li>
<li>公开课: <em>开课吧</em>上而有这么一门课,《<a href="http://www.kaikeba.com/courses/60">Spark实战演练</a>》,而且还是免费的,虽然授课老师表情、动作、语气和语调都不丰富,但是毕竟是免费的嘛~我不会告诉你上而还有度娘的office系列课程的。</li>
<li>电子书: 《<a href="https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/">Learning Spark</a>》Safari上有前几章的在线观看,也可以去下个PDF,我找到的只有95页……</li>
<li>演示文档: <a href="http://www.slideshare.net/">SlideShare</a>上面一大堆一大堆的,不过就别指望能给你实在的一步一步怎么做了。需要翻墙,还是希望国人在国外的网站上还是要自律,自己嘴上一时爽却妨碍了其它人访问有用的资源。<br>如果你跟我一样是个懒人,还是去<em>开课吧</em>上面上一课吧,不要觉得我是在打广告,我是真的懒……</li>
</ul>
<h2 id="小结">小结</h2><p> Spark结合Scala学习起来会有一定成本,但是对于尔等程序员一辈子要学习数十上百种语言人,一定是毫无压力的!一个人可以维护的代码量是有限的,用Spark/Scala确实是可以把自己从大段大段的Java重复代码中解脱出来。当然了,熟练才能解脱……你需要坚持到把Scala那些奇怪的东西消化掉,Google/AOL是好朋友,好好利用。</p>
-->
</div>
<footer>
<div class="alignright">
<a href="/2014/11/spark-scala-introduction/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div class="post-content-index">
<header>
<div class="icon"></div>
<h1 class="title transition"><a href="/2014/11/Github-Pages-good-night/">Github Pages, 晚安</a></h1>
<ul>
<li>
<span class="heading-span">Posted on: </span>
<time datetime="2014-11-10T17:35:57.000Z">2014-11-11</time>
</li>
<li>
<span class="heading-span">By: </span>
<a href="/">Du00</a>
</li>
<li>
<span class="heading-span">With: </span>
<a href="/2014/11/Github-Pages-good-night/#ds-thread"><span class="ds-thread-count" data-thread-key="2014/11/Github-Pages-good-night/" data-count-type="comments"></span></a>
</ul>
</header>
<div class="entry">
<p> 好久没有写博客了,有时候就是因为觉得麻烦,尤其是在百度空间的那段经历让我总觉得自己就是任人宰割的。还记得那时要对公式排个版可以说是非常辛苦,如果是简单的$x^2$这样的,可以勉强用百度空间那奇弱的编辑器用上下脚标完成。一点都不复杂的,比如$$ \frac{x - \min}{\max - \min} $$要写成上下行形式的都需要用图片了。一篇笔记在Word中写完了,想发出来还需要再编辑一个小时,这可是一点都不好玩。 </p>
<p> 后来百度空间强推改版,不顾空间用户在贴吧几万楼的反馈,还是上了轻博客。不管轻博客如何简单、如何清新,我只知道我的格式没有了,我的插图也没有了(插图位置也是需要人工调整的)。后来出走到了新浪博客,算是安稳了一点,但是这种繁琐的编辑模式再也让人提不起兴趣了。 </p>
<p> 直到我遇到<code>Markdown</code>,我才发现原来好好写写文档并且有比较好的排版是不冲突的!其间用了一段时间的在线 Markdown编辑器,后来用得比较多的还是与Evernote/印象笔记结合的“马克飞象”,是校友开发的,体验真的非常优秀。但是印象笔记还是不利于分享,我也总是想写点什么,在Coursera上上的课记点笔记什么的,贴出来也是个促进。 </p>
<p> 至于用上<em>Github Pages</em>,还是昨天在Github上提交作业的时候心血来潮,折腾这玩意直到3点。开始用的是Jekyll,说实话到最后我也没配好 (虽然是码农,但是我对Web前端一无所知 ),今天中午找到了<code>hexo</code>这么个玩意——太简单了,一下就可以把博客建起来了。然后试了试各种主题,添加了评论、分享插件,想不到就这样完成了。 </p>
<p> 这东西确实有一种新鲜的感觉,再加上我平时的东西也是用Markdown写的,发发博文再也不会太浪费时间了。一天在网上搜的资料那么多,如果不记下来,也是一种罪过。 </p>
<p> <strong>分享是人的天性。</strong> </p>
</div>
<footer>
<div class="alignright">
<a href="/2014/11/Github-Pages-good-night/#more" class="more-link">点击更多<i class="fa fa-long-arrow-right fa-1"></i></a>
</div>
<div class="clearfix"></div>
</footer>
</div>
</article>